A bug in our code and a gap in our error handling & alerting caused a very limited amount of browser checks — to our knowledge just one — to stop reporting results to our processing backend.
Although the impact seems limited, we did a full investigation and are sharing this post mortem.
As far as we can see, one customer was impacted for one check in their account. We did not receive any reports nor could find other checks impacted.
An error situation in a Playwright script began to generate extremely large error logs and subsequently overflowed the error object we normally pass back for further processing. Specifically, the payload was too large for the AWS SQS queue 256Kb size limit.
This means that the customer sees no results in their charts and logs and are also not alerted. The data is missing and effectively lost forever.
We managed to reproduce this behaviour and are 99,999% certain it was caused by "out of control" error generation. This is most of the time harmless. However, we did not truncate the error correctly, as we do for other payloads we handle. This possible failure situation was always in our code and infra.
Once we discovered the root cause and were able to reproduce the issue, we quickly designed a quick fix to truncate the responsible payload.
We had no effective detection mechanism or alerting in place for this situation. The customer informed us of the weird behaviour, e.g. missing check results data in their dashboard for a specific check.
What Are We Doing About This?
- We've added the appropriate truncating to not hit any size limits anymore.
- We have done are still doing some refactoring to make it easier for our systems to separate "user errors" and "platform errors". User errors here are just normal errors that can happen in any user submitted code.
- We've fixed our Sentry logging on our runner infrastructure to correctly report these types of issues.
- We've updated our own alerting to trigger notifications based on the errors.
What went well?
We immediately dropped active work and began investigating the root cause. Once we discovered the root cause and were able to reproduce the issue, we quickly designed both a quick-fix as well as additional logging / alerting measures.
What went wrong?
We did not have effective monitoring in place to alert us of this issue. We had to be informed of it by the customer.
Where we got lucky?
We got lucky in that it was a very specific issue which affected only one check for one customer.
08:00 - Customer informs us about missed alerts over the weekend.
08:10 - Start root cause analysis and diagnosing the issue.
11:30 - Sync call with full engineering team
16:00 - Successfully reproduced the issue
11:00 - Enabled extra Sentry logging
16:00 - Deployed truncated error message fix, resolving the issue.
Full day - Continued work on optimising our handling, logging and alerting on these and related errors.