Device Connectivity Issue

Incident Report for emnify

Postmortem

Executive Summary

Due to an issue with a software update, the Policy and Charging Rules Function (PCRF) incorrectly enforced prepaid balance checks on postpaid customer organizations. Consequently, organizations were erroneously blocked, initiating a process that disconnected (off-boarded) devices previously online.

Before identifying and halting the erroneous process, some devices had already been off-boarded. Subsequent attempts by these disconnected devices to reconnect triggered a signalling storm, resulting in an overload of one of our core network functions ( Authentication Centre - AuC) and causing recovery delays. Recovery resumed successfully following scaling of the overloaded network function, after which off-boarded devices could reconnect.

Full recovery occurred 69 minutes after the fault was introduced.

Impact

A number of postpaid organizations received an organization-blocked event and email notification.

Some organizations were effectively blocked, preventing new device connections from approximately 08:33 UTC until all incorrect blocks were lifted at 09:08 UTC (35 minutes).

A small number of organizations experienced disconnection of active device data sessions, impacting a portion of the devices connected at the time of the incident. These disconnections occurred between 08:33 and 08:43 (UTC).

The vast majority of online devices continued normal operations, provided no change occurred in local network conditions or the devices did not disconnect independently.

New connection attempts and reconnection attempts of previously disconnected devices encountered difficulties registering on the network due to the signalling storm. This resulted in significantly reduced success rates for Update Location and Create PDP Context procedures. The overload condition persisted from 08:35 until 09:42 (UTC), when additional AuC capacity was provisioned.

Stabilization Steps

The following corrective measures were undertaken to resolve the incident:

Reversal of the faulty software update.
Reversal of erroneous organization blocks.
Provisioning of additional AuC capacity.

Contributing Factors

The severity of this incident was exacerbated due to the following factor:

The implemented software update failed to anticipate that almost all postpaid organizations in the production database previously existed as prepaid accounts, reflecting the common practice of creating accounts initially as prepaid before converting them to postpaid at a later stage.

Corrective Actions (Preventive Measures)

Detailed investigations will continue over the coming days; however, the following corrective actions are currently planned:

Assessment and enhancement of overload protection mechanisms, especially considering identified shortcomings in LTE device recovery procedures.
Completion of ongoing re-architecture of the AuC to implement a horizontally scalable database, allowing automated adjustment to traffic fluctuations without manual intervention. Deployment of this enhancement is scheduled for completion in Q3 2025.
Expansion of test coverage specifically for organization blocking functionalities to provide improved protection during development.
Implementation of additional safeguards for organization blocking and device off-boarding procedures. Such safeguards will introduce supplementary checks and mandatory escalations to operational teams for evaluation and approval prior to the execution of abnormally large-scale blocking events.

Posted Jul 30, 2025 - 13:37 UTC

Resolved

Dear emnify Customers,

The incident has been resolved. Current metrics indicate that operations have returned to normal. A detailed incident report will be published in the coming days.

In case you are still having issues, please reach out to our Support Team via our Help Center.

Kind regards,
emnify Support Team

Posted Jul 28, 2025 - 10:43 UTC

Monitoring

Dear emnify Customers,

We are seeing that the majority of devices have recovered. Continuous monitoring is ongoing, and we will provide further updates as they become available.

Kind regards,
emnify Support Team

Posted Jul 28, 2025 - 10:17 UTC

Update

Dear emnify Customers,

We’ve implemented changes and are now seeing a significant increase in the number of devices recovering. Monitoring continues to ensure full restoration.

We’ll provide further updates as recovery progresses.

Kind regards,
emnify Support Team

Posted Jul 28, 2025 - 09:52 UTC

Update

Dear emnify Customers,

A fix has been deployed by our engineering team. While we are observing devices beginning to reconnect, the full recovery across all deployed devices is taking longer than initially anticipated. Please rest assured that we are actively monitoring the situation and working to ensure complete restoration as quickly as possible.

Thank you for your continued patience and understanding.

Kind regards,
emnify Support Team

Posted Jul 28, 2025 - 09:45 UTC

Identified

Dear emnify Customers,

We have identified the issue and implemented a fix. We are now seeing devices begin to reconnect.

Further updates will be provided shortly.

Kind regards,
emnify Support Team

Posted Jul 28, 2025 - 09:17 UTC

Investigating

Dear emnify Customers,

We are currently investigating device connectivity issues. We are working on resolving the issue and an update will be provided shortly.

Kind regards,
emnify Support Team

Posted Jul 28, 2025 - 09:05 UTC

This incident affected: Network Connectivity (Network Connectivity - North America, Network Connectivity - Europe, Network Connectivity - Asia-Pacific, Network Connectivity - Central and South America & Caribbean, Network Connectivity - Middle East & Africa).