This is the eighth part of our 12-day Advent of Monitoring series. In this series, Checkly's engineers will share practical monitoring tips from their own experience.
If you have large(r) customers, there is a point where they ask you for service-level agreements, or short SLAs. These are customer contracts defining different aspects of your service and what you guarantee for them. One common agreement is around availability, or, colloquially speaking, uptime.
Your contract might state, and I am not a lawyer, that you guarantee that your service (or core parts of it) is available 99.99% of the time of a given period, mostly per month, quarter, or year. This is where the talk about four nines or five nines comes from.
To offer an availability SLA to customers, means also that you need to prove that you kept this uptime. For this, you frequently send SLA reports to them, or they’ll ask for it at some point ;).
So how can we make sure, and prove for that matter—that our service was available for the time we guaranteed? Enter uptime monitoring with Checkly.
First, as an example, let's take a look at what it means to be available 99.99% of the time of a month. For this, I am using one of my fav websites, which is called uptime.is*. You can type uptime.is/99.99 (or even cooler, uptime.is/four-nines) into your browser bar, and you will get the periods of downtime you can have for that level of availability.
Four nines availability in a month means you can have a maximum downtime of 4m 21s. That's not a lot.
If you are brave, use uptime.is to see what it means for five or even six nines. Using the four nines and assuming we are just having a single incident in a month, we need to check our service more often than these four minutes and a few seconds to detect that downtime.
For example, if we had an API check with a five-minute frequency, we might report 100% availability at the end of the month but actually were just available 99.99% or less of the time. Here, it would be better to use a one-minute frequency, which brings us to a theoretically measurable availability of roughly 99.9977%. Good enough for our four nines, or 99.99%.
With a one-minute check, we can detect problems in our services up to the SLA we gave to our customers and can report these with confidence at the end of the month or for the next QBR.
For convenience, here is a table of the monthly SLA you can measure with the available check frequencies on Checkly.**
From the table, you can see that it is important to pick the right frequency for your SLA reporting. If you want a five-nine availability for some critical parts of your service, choose at least a 20-second frequency. If you need four nines, a two-minute frequency is enough.
Whatever you choose, good luck, and may your systems always be up.
*Checkly is not affiliated with uptime.is in any way, we are just fans. Go check it out, it has a ton more of cool features around availability calculations. Also, this service is written in LISP but this is a story for another time.
**I just added the frequencies to up to one hour, but you can add up to 24 hours.