The Advent of Monitoring, Day 12: Behind the Scenes: How Checkly Is Using a Smoke Test Matrix to Tame Variant Complexity

Share on social

Table of contents

This is the last part of our 12-day Advent of Monitoring series. In this series, Checkly's engineers will share practical monitoring tips from their own experience.

At Checkly, the commitment to reliability is not just a tagline; it's embedded in our DNA. As software engineers, we understand the critical importance of dogfooding—using our own product to ensure its robustness and effectiveness. This approach holds immense value, especially since Checkly is designed for observability.

The High Stakes of Shipping Broken Code

In the realm of monitoring and checks, the margin for error is exceptionally narrow. If we ship a broken update, it can cascade into a series of unwanted scenarios for our customers:

  1. False Positives: When a check that should pass starts failing, it triggers unnecessary alerts, potentially waking someone up or causing undue stress.
  2. False Negatives: Conversely, when a failing check erroneously passes, it masks underlying issues, leaving them unaddressed.
  3. False Monitoring: In some cases, checks might not run at all, leading to a complete breakdown in monitoring.

We've encountered each of these scenarios, learning valuable lessons along the way.

The Complexity of Shipping Daily Updates at Checkly

Shipping updates multiple times a day to a diverse customer base brings its own set of challenges. Our customers use a variety of check types, including API, Browser check, Multi-step check, and Heartbeat checks. Our checks support multiple Checkly Runtimes (currently 4), each providing a different set of libraries, and are compatible with both Javascript and Typescript. Additionally, we offer Playwright and Playwright Test support, and two different browsers: Chromium and Chrome. On top of that, we ship our runners to 22 different AWS regions. As you can see there are many variants of checks that our customers use.

The Checkly Smoke Test Matrix: Our Solution

To manage this complexity, we are using one of the best solutions we know for Monitoring - Checkly - our Product. We call our solution the Checkly Smoke Test Matrix it’s a big collection of Checkly checks that continuously monitor the health of our Runners platform.

For each of the check variants mentioned before we generate one passing and one failing check. These are real checks that visit and test real websites. They are simple and their goal is proof that the basic functionality of the check variant is working.

A typical passing check will involve visiting google.com (thank you Alphabet 😍) a reliable baseline, considering if Google is down, the internet at large has bigger problems than our Runners platform. For

Here is an example of a ✅passing playwright test chrome browser check that uses Typescript for runtime 2023.09

import { expect, test } from '@playwright/test'

test.use({ channel: 'chrome' })

test.beforeEach(async ({ page }) => {
  await page.goto('https://google.com/')
})

test.afterAll(async({ request }) => {
  await request.post(`https://api.checklyhq.com/heartbeats/ping/${process.env.PING_ID}`)
})

test('title is Google', async ({ page }) => {
  await expect(page).toHaveTitle(/Google/)
})

Notice how we are calling a corresponding Heartbeat Check’s ping URL to notify that the check is running.

For failing variants, we use assertions we know will fail, like expecting the title of Google's page to be "Not Google". We run these checks in all 22 AWS regions we operate in, every minute.

How can the Smoke Test Matrix detect issues?

Each check in our matrix has configured alerts. We get notified if a check that should pass fails (failure alert), if a check that should fail passes (recovery alert), and if a check doesn't run at all. The latter is monitored through a heartbeat check, which sends an alert if it doesn't receive the expected ping in a specified time window matching the Check interval.

This way we cover all main scenarios of how our checks behave, and we simultaneously test multiple components of our platform. If we deploy a change and one of the checks in our Smoke Test Matrix triggers an alert, we immediately investigate. Moreover, because the checks are simple and reliable, we can track their durations to detect performance issues.

Closing Thoughts: A Safety Net for Software Reliability

Leveraging the Checkly CLI, we've automated the creation and maintenance of these checks across our local, dev, staging, and production environments. This forms a crucial part of our safety net, catching bugs before they impact our customers. While no system is bulletproof—a simultaneous failure of all check types and our alerting system is a theoretical possibility—our approach ensures high reliability. We also use public Slack channels for alerting, ensuring transparency and prompt response.

In conclusion, at Checkly, we don’t just build tools for observability; we live by them. This commitment to using what we create not only enhances our product's reliability but also ensures that we stay a step ahead in preventing disruptions for our customers.

Share on social