(Updated: )

Exploring the Synergy Between Testing and Monitoring in Software Development

Share on social

Table of contents

Understanding the Distinct Yet Overlapping Worlds

The roles of testing and monitoring often intersect, yet they maintain distinct identities. In my near-decade in the tech sector I've observed how end-to-end (E2E) tests and synthetic monitoring, despite common frameworks and requirements, often fail to benefit from collaboration and synergy. This leads to Operations teams testing black box code, developers who don’t know or care how their code is running on production, and flaky, flappy test suites that don’t get taken seriously until users begin to complain.

All these teams and stakeholders want reliable, consistent tools that help ship reliable software, and it seems a shame that the work developers and testers put into writing end-to-end tests isn’t better used for monitoring that code in production. Let’s talk a bit about how this split happened, and why bridging the gap is harder than you’d think.

The Divergent Paths of E2E Testing and Synthetic Monitoring

At the heart of their differences lies the purpose and scope of each approach. E2E tests are the bulwark against code regressions, simulating user interactions to validate code correctness. They delve deep into the application, not shying away from the less-traveled paths and edge cases. In contrast, synthetic monitoring is the sentinel of infrastructure health and user experience, focusing on ensuring that key functionalities like login and regional accessibility are always up and running.

When it comes to execution cost and frequency, E2E tests are often resource-intensive and time-consuming, a factor that becomes increasingly significant in larger organizations. When I asked about the distinction between E2E and Synthetics on Hacker News, user dexwiz gave a simple explanation for why these tests usually run separately:

E2E tests can be expensive or long running. An E2E test suite for a mature application may take hours (or even days) to run, which means time and money. In larger orgs, the cost of running E2E tests can represent a significant dollar amount. Synthetic tests should be quick so they can give timely feedback and be reran often.

This touches on a distinction that will come up over and over again: cost. When asking about how nice it would be to run E2E tests continuously, the answer often boils down to ‘sure, if money is no object!’ More on this in the test design section below.

The environment in which these tests operate also sets them apart. E2E tests are typically confined to isolated environments, a necessary measure to shield real user data from the potential chaos of testing. This allows for things like stateful tests that push out major updates to user records. Since the records are fake and the isolated environment can easily be rolled back. Synthetic monitoring, on the other hand, mirrors real user behavior in the production environment, offering a non-intrusive yet accurate reflection of the system's health.

Geographical and network considerations further distinguish these two approaches. E2E tests, often run in the same network as the application code, might miss out on uncovering connectivity issues. This is by design: when testing a new branch I don’t want to know that our German network portal is down.

Synthetic monitoring, with its deployment across multiple locations, including external networks, excels in identifying regional and connectivity-related problems. With tools like Checkly, you can monitor from many geographies, so that you know users in every geographic region have access to your service.

Differences in test design

What qualifies as ‘failing’? I remember a test suite I interacted with as a junior dev that, when I tested my branch, returned 231 failing tests. With a large measure of embarrassment I told my mentor this, and she told me “Oh that’s fine, as long is it’s 231 tests exactly. Those 231 are always broken. Any less or more is a problem.” While I usually tell this story to relate the universality of tech debt, it also shows the differences between monitoring a production site and internal tests: this dev could rely on her tests to fail in the exact same way every single time.

A critical difference is the tolerance for noise. Interestingly, this can go either way. Developers, focused on code correctness, might exhibit a higher tolerance for flaky tests. For SRE/DevOps teams, the stakes are higher - noisy alerts can signify real-time issues impacting business operations and revenue, demanding a lower tolerance for such disturbances. On the other hand, SRE and DevOps may set a certain level of failure as background noise not requiring an alert. No matter where or what kind of error messages are being thrown, if errors are happening less than 4 times per hour, no alert goes out.

This affects every detail of how we design our tests: timeouts are probably quite generous in internal tests, where with synthetics testing and heartbeat checks, we want to know if performance has degraded even a bit from the ideal. We won’t wake up the whole Ops team if the site is loading in 7.1 seconds, but it’s still worth noting.

This brings us back to cost: both the cost of running tests and the costs of failing for real users. Reddit user u/itasteawesome offers this perspective on how the coverage of testing vs. monitoring is driven by cost:

During testing I expect to be extremely thorough and validate basically everything. Once it has passed those I can cut it down to a subset of the most critical user interactions. If money is no object then sure monitor every single test, but inevitably my customers found that budgets exist and they are reigned in.

Costs do exist, and even though Checkly can beat what you’re currently paying for synthetics testing, no testing system is free, and a test suite that takes an hour to complete and generates 2 gigabytes of reports can’t be run every 5 minutes from your syenthetics service. While I wish to explore using tests in production as part of a monitoring plan, it’s not like all tests can be used wholesale.

The Philosophical Kinship between testing and monitoring

Testing and monitoring are not worlds apart. Both, in essence, involve observing software behavior. Monitoring does so passively, watching over the running system, while testing takes an active role, manipulating the application to validate specific outcomes. As we write synthetics checks that log in, perform account actions, and check detailed performance, we see that there’s a good deal of overlap in how these two fields work in practice.

The Case for Using Pre-production tests in your Synthetics: Not a recycling bin but a resale store.

In reimagining the relationship between QA and Operations, it's helpful to think of it less as a linear pipeline and more as a dynamic, collaborative process. Rather than moving every pre-production test over to a production environment, the most useful test can be selected and re-used. This isn’t a conveyor belt at a recycling center, but a resale shop with the best tools on sale.

In this model, the two teams engage in regular check-ins, evaluating the suite of tests developed during the QA phase. These selected tests are then adapted for use in production environments, typically run by synthetic monitoring or pinger systems. This approach ensures that the most valuable tests are reused in a way that maximizes their utility across different stages of the software lifecycle.

Re-using tests in production could involve automation techniques, such as tagging certain tests during the QA phase for potential use in production, or employing a sampling method to periodically select tests for production use. Tests deployed in production should be read-only, designed to not affect performance, data integrity, or security. On Reddit, user u/gmuslera explains it like this:

The map is not the territory. All the synthetic data and loads that you did in testing may not be what the application is actually facing in production. And you may have an important input there. But do it in a way that doesn't affect production, in performance, data integrity, security, privacy and so on.

By carefully selecting and adapting tests for production use, teams can gain valuable insights into real-world performance and system health, saving time in the test-writing phase, all while ensuring the smooth functioning of the live environment.

The Case for Synthetic Monitoring in Pre-Production: What’s good for the goose.

Integrating synthetic monitoring into QA, test, staging, or even development environments allows for early detection of issues, ensuring that problems are addressed before they escalate in the production environment. It also brings a touch of realism, offering a preview of how the application will perform in the real world.

Bringing production monitors into development is, again, not a direct pipeline: we don’t want to wake people up with blaring notifications just because a Staging environment went down for 10 minutes. Further, it may not make sense to handle failing production tests the same way as failed E2E tests, blocking deployment of a branch. Rather these synthetic monitors should send more analogue signals about what is changing as these environments are updated. If performance is slowly degrading as a new feature is merged, that’s useful information and may surprise people writing the code. That’s the core benefit: a new set of systemic information that doesn’t require separate maintenance to keep up-to-date on expectations for the production system.

Adding synthetics checks to earlier stages also has the benefit of warning the entire team when a new update will break existing synthetics checks on production. Knowing this ahead of time can save time for Operations, and even prevent failed canary deployments if we can update the tests along with the new feature. Note that this is a benefit of Checkly, since your tests can reside in your repository next to your production code, meaning a new feature, and an update for synthetics, can be managed together.

Remember that both these sets of tests are probably written in Playwright, so when one of these tests fails pre-production it should be easy for developers to read the syntax and see what’s being tested.

Wrapping Up: Separate Roles with a Shared Purpose

Testing and Operations will always be separate jobs, and separate responsibilities in the deployment and production process. The integration of testing and monitoring, particularly through synthetic monitoring in pre-production stages, can significantly enhance the quality and reliability of software as it moves through deployment. As developers and operations engineers, embracing this synergy can lead to a more robust and efficient software lifecycle, ensuring that our applications are not only well-tested but also consistently monitored for optimal performance.

Share on social