Table of contents
If you’re leading a technical organization, you’ll have a critical question when you want to perform uptime monitoring: should wse build and run an uptime monitor ourselves, or pay for a service? While it’s totally possible to build your own basic uptime monitor, you’ll end up costing your business more.
Requirements for Scalable Monitoring Solution
At Checkly, we get the good fortune of being able to see the required capabilities for some of the world’s most advanced engineering teams when it comes to their monitoring needs. To set up our case, let’s start with some of the features and capabilities that you’ll need for your DIY build.
Synthetic Monitoring
You’ll need a solution that can simulate user behavior, behavior, which goes beyond just checking an endpoint for a 200 OK response. Some user scenarios you’d like to be able to simulate:
- A user makes a search, waits for the results, and adds the items found to a shopping cart
- User downloads a PDF, renames it, and re-uploads it
- A user loads a page with an accessibility plugin that replaces images with alt-text
It’s tempting to say these ‘go beyond’ uptime monitoring, but if your service isn’t working for the scenarios, your users will certainly feel the site is down. Any solution should support an open source framework like Playwright for simulating a user.
Global Distribution
We’ve all experienced outages that largely affect a single geographic area. I personally have scrambled more than once to start up my VPN access to check if an issue reported by multiple users was only happening in one region.
While the internet often feels homogenous and omnipresent, in reality the connections between continents are tenuous, and most responsive services rely on local servers. Uptime monitoring can’t be localised to a single region unless your users are.
Isolation
Your uptime monitoring isn’t worth much if it goes down along with your own service. That means any DIY service will need to run on its own containers, route data to its own storage, and produce both alerts and dashboards on a system independent from the rest of your services. This is particularly critical as you’re relying on your uptime monitor to notify you and start the incident response process. If you’re relying on users to report some downtime events (because the outage also takes down your uptime monitor) you’re adding 15 minutes at least to the ‘detection’ portion of your mean time to repair (MTTR). If we think about the time needed to atain a reasonable MTTR, this means that delivering more than ‘three nines’ of availability is out of reach.
Alerting Channels
Rule-Based Alerting is one requirement, with alerts requiring a configuration for when alerts will go out to your team.
- Retry Logic
- Different alert thresholds for different checks
- A ‘degraded’ state for issues of concern that don’t warrant an alert sent to the on-call team
Notifications and Integrations need to go slightly beyond a simple email, since we expect to check multipe routes and types of user interaction. Key features include.
• Multi-channel notifications (email, SMS, Slack, PagerDuty, Teams, etc.)
• Ability to configure escalation policies and on-call schedules
• Flexible integration with ticketing systems (Jira, ServiceNow)
Dashboards
Now that our on-call team has been notified of an incident, we require a clear and elegant dashboard system for our uptime monitor. Remember: we can’t leverage too much of our existing Ops infrastructure, since isolation is a requirement. Let’s look at an example Checkly dashboard to get a sense of the requirements:
Here’s the information we get from this high-level dashboard view, all of which make a midnight incident response that much more effective:
- Pattern of failure - we know how often our check is failing versus succeeding
- Geographic distribution - what failed where is visible at the top level
- Statistical breakdown - whether failures are outliers or the norm
- Trace details - at this high level, we get details on the most recent failure
These details help an incident response team do much faster categorization of the failure (scope of users affected and how severely) and start with root cause analysis.
Challenges with DIY
There are a number of known challenges with running uptime monitoring from a DIY cloud service. These are challenges faced by other teams in the past:
Poor Separation Of Concerns
By default, services stored on the same cloud, cluster, and network as your production services will fail at the same time. There’s not much point in an uptime monitor that will go down with the rest of your service.
Geographic distribution
Every large engineering team has had incidents that were specific to a single region. If the purpose of infrastructure monitoring is to monitor from the outside in, you must monitor from regions other than where your service is hosted.
Maintenance and Upkeep
an uptime monitor requires its own uptime monitor, which means dedicated ongoing effort to make sure it’s always working. As we pursue a good separation of concerns and geographic distribution, we multiply the maintenance workload.
Notification Channels
while email notifications can be easy enough to implement, we don’t want a single point of failure in knowing about downtime, this means work to integrate and maintain webhooks, SMS, and email notifications.
Environment consistency
how do we ensure that service checks always run in an identical environment? Are we building containers to run each check? How will we manage needed package updates to the runners?
🏗️ Customer Story: Consensys
For just one example of teams running into the problems listed above, Consensys used a DIY solution based on Prometheus Blackbox Exporter and some APM tools, but ended up hurting productivity and running up service and infrastructure costs due to a fractured solution to a single problem (knowing whether the service was fully working for all users).
Hidden costs of DIY Monitoring
Along with the concerns listed above, there are also a number of unknowns when deciding to build a DIY solution.
Adoption
if your team develops an uptime monitor, it’s likely that only that team will know how to add new checks, routes, and interactions. The result is a tool that’s heavily siloed and prevents shifting monitoring left into your engineering teams.
Runaway infrastructure costs
What systems are in place to ensure that no one can configure site checks to run too frequently, to shut down if running too long, to not retry too many times, in other words not to run up our own compute and network bills as they try to constantly ensure a running service.
Self-DDoS
Related to runaway infrastructure costs, what ensures that, when checking a service and performing retries in case of failure, we don’t accidentally create a check that strains our own service, a self-inflicted denial-of-service attack?
🏗️Customer Story: LinkedIn
After investing significant time in an in-house solution, LinkedIn was facing significant infrastructure costs to run monitoring tools in-house. As the company scaled, maintaining and expanding these systems became more complex.
As a result, 60% of incidents were tied to changes made in production. There was no way to ensure that critical user flows remained functional after app or service updates went into production, and internal tools proved both expensive and unreliable.
Benefits of a Cloud-based Solution
A great concern to any tech team should be a ‘lock-in’ or ‘sunk cost’ situation making it difficult to switch solutions later on. Ironically, a DIY solution may present the highest risk of lock-in since it’s hard to justify changing a solution where development time has already been spent. Consider a paid solution for uptime monitoring for the following benefits:
Quicker Value
rather than waiting weeks or months to implement a synthetic browser, check runner, scheduler, and notification system, launch monitoring in a few days with off-the-shelf tools.
Easier Adoption
A robust tool like Checkly will have a user-friendly web GUI along with CLI tools and full Monitoring as Code support. Engage all your engineers in the practice of monitoring by democratizing monitoring tools.
Open Source Tooling
The right monitoring tool will integrate with a testing and automation framework like Playwright, making the writing of new monitors easy. Further, look for integrations with OpenTelemetry, the open source monitoring standard, to integrate back end insights with synthetic check results.
🏗️ Customer Story: commercetools
Sticking with traditional in-house validation for deployments, commercetools experienced significant downsides to their homegrown solution: uptime checks were limited with only shallow checks that services were online, users would find problems that their automated tools couldn’t, and deployments still required manual validation consuming a great deal of developer time. With an off-the-shelf solution from Checkly developers could write checks that accurately simulated real user behavior. Automated checks validated each deployment without significant manual work, leading to more reliable releases and 7x more deployments with Checkly than before.
Conclusions
Choosing between building or buying an uptime monitoring solution comes down to weighing the risks, effort, and long-term value. While building your own might seem cost-effective at first, the hidden costs in time, maintenance, and reliability can outweigh the benefits. A paid solution like Checkly offers faster setup, better tools for your team, and a clear path to reliable, user-focused monitoring. By investing in the right tools, you can ensure uptime monitoring works as it should—helping your team deliver a better, more dependable product.
Checkly can help your team implement independent monitoring, with high targeted notifications, useful failure information, along with deep tracing powered by Checkly traces. Get started today to make sure you find every incident before your users do.