This report is a breakdown of events for the service outage experienced on the 26th February 2018 affecting all customers using our Voice-over-IP telephony platform.
Breakdown of events:
09:42 - Our Network Operations Centre received multiple alerts relating to our phone platform specifically relating to a large number of phones deregistering from the platform. Engineers immediately started the investigation.
09:48 - Engineers identified the reason for instability of the platform as increasing request queue that is not being served.
09:51 - Engineers rebooted the registration backend.
09:52 - Issue was resolved and service was fully restored.
We identified that the root cause as a bug in the maintenance script that was run around 9:30. This script was run and interrupted mid way by human interaction, locking the call records table. Our system was backloging the requests to the backend database but finally exhausted the resources and started to drop new ones.
We are putting controls in place to try and mitigate this set of circumstances. During last week our vendor has prepared a fix for the maintenance script and this has now been implemented.
We apologise again for the inconvenience caused by this incident.