Monday 18 May we had an outage in processing browser check results between 15:44 PM UTC and 20:38 PM UTC. This was caused by a bug in our release and deployment software. API checks were not impacted.
For around 5 hours, we did not record any results for our browser checks. This means we also failed to alert for all failed browser checks (~59 checks across ~31 customers).
We recently made a changes to our deployment pipeline to avoid accidental "canary flags" which we use to run new code in production without exposing it to end users. This change was accompanied by another code change which wasn't included in the original pull request. We shipped a part of the solution, but not the full solution.
The deployment using a new version of our deployment pipeline, without the necessary code change to accompany it.
A hotfix to "unset" this canary flag was created and deployed.
We had no monitoring or alerting for non-existing browser check results so this outage went unnoticed for 5 hours before a customer ticket on Intercom alerted a team member. The team member alerted the on-call engineer.
What Are We Doing About This?
- We've fixed the code that checks the canary mode. We also already have an open PR with more checks in the deployment process itself to avoid these issues.
- No critical deploys after 12:30 (CET): We decided to not do any deployments after lunch hours in Berlin. This won't be a permanent measure but will be in place for the foreseeable future. This will help us find possible failures before since there'll be people on our team using many parts of Checkly & Intercom is monitored.
- Add more Checkly checks: We'll create API checks for the check results for our own checks that checks the time of the most recent results.
- Add specific CloudWatch alarms: We are adding anomaly detection on AWS for this specific pipeline in our infrastructure. A dip in results being published to our pipeline will now trigger an alert.
What went well?
- Our new deployment pipeline is much faster, so deploying the hot fix was relatively quick.
- Person on call was also the one who made the change so the fix was ready relatively fast after it was noticed.
What went wrong?
- We had no monitoring/alerting for the case of browser check runners not publishing results
- Because the browser check results are a relatively small part of the total results our other monitoring/alarm setups were not triggered either.
Where we got lucky?
- Team member noticed a customer chat on Intercom.
- 15:44 - Deployment was triggered that included the faulty code
- 20:32 - Alarm was triggered
- 20:38 - Hotfix deployment started
- 20:42 - Issue resolved