After deployment of a new version of our policy control function, some devices consumed wrong limit configuration. The issue was observed by the responsible team during post release monitoring by a drop of our overall traffic KPIs.
Approximately 1.5% of our endpoints were getting blocked between 12:11 and 12:43. Devices were actively disconnected and could not establish a new data connection until 12:50, when the configuration was corrected.
A new version of the policy control function is deployed
Monitoring rollout without findings
>>>INCIDENT STARTS HERE<<<
Identified first blocking cases + investigation of potential problem
Alert is created internally
Deployment is rolled back to prevent further blocking
Finished correction of blocked endpoints config to recover connectivity
>>>INCIDENT STOPS HERE<<<
Status Page: backfilling incident
Status Page: Resolved - This incident has been resolved.
Incident closed by Incident Manager
Rollback of the policy control application
Correction of the blocked endpoints configuration
Race condition in the configuration stream for the limit configuration where incorrect retention policy lead to incorrect state for some devices endpoints
Corrective Actions (to prevent this from happening again)
In depth review of configuration management system data consistency
Improve feature management to rollout disabled with longer observation time before activating feature again
EMnify to implement alert monitoring based on unusual traffic limits set on endpoints.
Posted May 13, 2022 - 16:49 UTC
Between 12:11 to 12:43 UTC we mistakenly blocked devices from opening data sessions. This affected some customers that have a traffic limit configured. Customers that have no traffic limit configured were not affected by the incident. The incident was triggered by our application reprocessing traffic limit configuration.
The issue is now resolved and we will further investigate the root cause of the problem.