Table of contents
How often should you ping your site? Should you be checking every few minutes, or every hour? Surely you have other ways to detect problems, so maybe just a daily check of your API and main page would be enough, right? While there’s no single right answer for everyone, this post tries to break down how you can find the right cadence for your site checks.
On a recent reddit thread in the (fantastic) r/SRE community, I asked what frequency the engineers responsible for uptime preferred for their heartbeat checks. The answers varied, but even better than the specific responses were the users who explained the logic for a specific cadence. There’s a balance of the need for timely detection of outages against the risk of overloading the system with excessive monitoring traffic.
It can be hard to justify downtime! Inevitably, after all the work to define acceptable levels of downtime, you’ll end up having the site go down in the middle of the quarter’s biggest sales pitch. Someone in the C-suite will holler that “we can’t go down like that ever again” and the previous measure responses will go out the window in favor of continuous checks of site status. While no one wants downtime, costs matter. Most of the replies to the Reddit discussion above settled on 1 minute for pinger checks. The biggest barrier mentioned to this was the cost of various testing options. While infrequent checks for a simple 200 response are nearly free, a complex check run 'every five minutes, every region' can end up costing up to $10k a month if you pick the wrong SaaS provider.
If you need super-frequent checks with a budget of (near) zero, a DIY solution can offer ‘better-than-nothing’ coverage. Matt Billenstein’s Pingthing is a simple daemon to run http checks on your services and email you if they fail. Of course, if your system goes down hard there’s a danger this service will stop working along with everything else. And any DIY solution will lack incident escalation, nice performance logs, etcetera. But if your budget is stretched it’s always possible to send frequent pings from your own server.
You ever read a comment on a discussion that is so complete, such a good summary of all the knowledge that’s out there, that you wish everyone could see it? Well, that happened to me and that’s the reason for this post. Reddit user and r/SRE contributor u/chkno gives a path to calculating how every team should figure out the right frequency for their pinger checks they ask:
“What SLA are you defending?”
If you’re on the checkly blog, you probably don’t need to define an SLA, but just in case we’ll say it’s the percentage your service is guaranteed to be ‘up’ for your clients and stakeholders. Many enterprise relationships now have contractual obligations to an SLA, meaning service providers lose money if they break their SLA. This is a great question to start with but u/chkno goes further with some great rules of thumb:
If you're defending a 99% uptime monthly SLA, you have a budget of seven hours per month of downtime to spend. If diagnosing and remedying an issue takes two hours and you expect a maximum of two issues per month, that's four of your seven hours spent on MTTR, so you had better detect the problem within 90 minutes (the three remaining hours divided by the expected maximum two events per month). If you alert on, say, three failed probes, you'd better probe at least every 45 minutes.
Again we probably don’t need to define Mean Time to Resolution (MTTR), but I’ll emphasize that this calculation requires a fairly honest accounting of the real total time. In this case, we want the time between an incident being detected, and it being closed. The time before that, for detection, is calculated separately.
To be clear, I'm not saying "probe every 45 minutes". I'm saying: You should be able to calculate an appropriate probe interval from your availability target and your MTTR.
Wow this is a great starting place, let’s calculate another example: if no other numbers change, but you have an average of 4 downtime incidents per month, you now will use 6 hours of your SLA budget resolving issues, leaving a total of only 60 minutes per month for detection time, divided by your four likely outages, you need a probe to run at least every 15 minutes. You can see why so many teams end up with a 1 minute frequency for the most basic checks!
User u/chckno gives some guidance about what to do when the result of that calculation is too costly:
If that calculation results in an unreasonably high cost, you'll need to do something else, like build a more economical way to get a health signal, improve MTTR (eg: maybe you need to be able to deploy rollbacks faster), or renegotiate the SLA. Choosing to probe at a rate inadequate to defend your SLA is planning to fail.
There are a number of ways to improve your MTTR, and the automation of rollbacks is a great place to start: if you begin your incident response by rolling back to a point when the problem wasn’t present, you’re drastically reducing how long it takes to get the service up again. It’s interesting to note here that faster/better rollbacks may increase how long it takes to investigate the problem and fix the last release. Without the failure continuing to occur, it’s possible that our logging and other observability tools can’t capture the real root cause. This can even lead to some institutional inertia against fast rollbacks. If you’re seeing incidents that you can only diagnose if they keep happening, consider overall observability improvements so the nature of each failure is better recorded.
Working through these ideas, there are some simple steps to help you calibrate how often you should ping your site, and what kind of checks you should use.
- Understand Service Criticality: Not all services demand the same level of monitoring. Classify your services based on their criticality to business operations. High-criticality services need more frequent checks, while for others, a much less frequent schedule will suffice.
- Assess Service Characteristics: Consider the nature of the service. A static content service won’t need as frequent monitoring as a dynamic, user-facing application.
- Schedule dynamically based on region: With some intelligent logic on check timing, you can prioritize finding the status of a single region, or spread your checks around so at least one region is always checking site status. Take a look at the Checkly documentation to see some options on how to schedule checks across different regions.
- Balance with Comprehensive Monitoring: Pinger checks are just one aspect of monitoring. They should be complemented with more detailed health checks and performance metrics to provide a holistic view of service health. As you’ve seen in our Playwright documentation, with Checkly you can test really complex behaviors within your site. Use these checks to find hidden issues that won’t be covered by just looking for 200 OK
- Refine with Feedback Loops: Use feedback from past incidents to refine your monitoring strategy. If frequent pinger checks are missing issues or causing unnecessary noise, adjust accordingly.
If I could go into battle for pinger checks, I would definitely fight under the banner of
“🏰 🗡️Choosing to probe at a rate inadequate to defend your SLA is planning to fail.🏹 🛡️”
If you’d like to join a community of engineers trying to work out the right way to ping their site, join the Checkly Slack and say hi!