Devices incorrectly blocked from opening data sessions

Incident Report for emnify

Postmortem

Day/Time (UTC)	Event
2022-05-12 12:11	Incident Start
2022-05-12 12:50	Incident End

Executive summary

After deployment of a new version of our policy control function, some devices consumed wrong limit configuration. The issue was observed by the responsible team during post release monitoring by a drop of our overall traffic KPIs.

Impact

Approximately 1.5% of our endpoints were getting blocked between 12:11 and 12:43. Devices were actively disconnected and could not establish a new data connection until 12:50, when the configuration was corrected.

Timeline (UTC)

Time	Event
11:53	A new version of the policy control function is deployed
	Monitoring rollout without findings
12:11	>>>INCIDENT STARTS HERE<<<
	Identified first blocking cases + investigation of potential problem
12:36	Alert is created internally
	Incident triggered
12:43	Deployment is rolled back to prevent further blocking
12:50	Finished correction of blocked endpoints config to recover connectivity
12:50	>>>INCIDENT STOPS HERE<<<
13:30	Status Page: backfilling incident
	Status Page: Resolved - This incident has been resolved.
	Incident closed by Incident Manager

Stabilization Steps

Rollback of the policy control application
Correction of the blocked endpoints configuration

Contributing Factor(s)

Race condition in the configuration stream for the limit configuration where incorrect retention policy lead to incorrect state for some devices endpoints

Corrective Actions (to prevent this from happening again)

In depth review of configuration management system data consistency
Improve feature management to rollout disabled with longer observation time before activating feature again
EMnify to implement alert monitoring based on unusual traffic limits set on endpoints.

Posted May 13, 2022 - 16:49 UTC

Resolved

Between 12:11 to 12:43 UTC we mistakenly blocked devices from opening data sessions. This affected some customers that have a traffic limit configured. Customers that have no traffic limit configured were not affected by the incident. The incident was triggered by our application reprocessing traffic limit configuration.

The issue is now resolved and we will further investigate the root cause of the problem.

We are sorry for any inconvenience caused.

Posted May 12, 2022 - 12:30 UTC