1. Observability Tools Are Falling Short
Often the longest period of an incident response is the time spent between knowing you have a problem and understanding what’s causing the problem. This is almost definitional, since once you know the nature of the problem it’s generally possible to fix or at least remediate it pretty quickly, for example with a rollback. But what makes this time to understanding the problem so long? One of the primary challenges is the inadequacy of observability tools. When speaking to enterprise teams, about half of engineers report that their observability solutions require significant manual effort, and that alerts lack sufficient context to triage issues effectively. Imagine you’re trying to solve an outage that’s only affecting some users and some features. You may start with a nice dashboard of front end performance monitoring, but once it’s time to really drill in to root causes, you find yourself paging through AWS CloudWatch logs trying to match the time codes of recent failures to see if any errors were logged. When your goals for mean time to repair are in the minutes, ending up in CloudWatch is setting yourself up for failure. When tools are cumbersome and fail to provide actionable insights, engineers spend more time diagnosing problems, which increases MTTR.2. Time to Detect (TTD) Is Overlooked
Another factor contributing to high MTTR is the time it takes to detect an issue in the first place. Reducing the mean time to detect (MTTD) is crucial for improving MTTR, as delays in detection can lead to prolonged outages. An entire guide on time to detection is available on our Learn site, but in short the most common failures are:- Reliance on user reporting — if a large class of failures will only ever be reported by users, you’re automatically adding 15 minutes at least to your time to detection. As a rule of thumb, out of the last 10 incidents, only 1 should be user-reported.
- Low-frequency monitoring — if you’re using an automated system to check your service’s availability, it’s critical to set the right frequency for those checks. Checks that are happening only every 15 minutes won’t provide much better time to detection than waiting for users to report problems.
- Slow alerting — While most teams doing monitoring think quite a bit about the optimal frequency for those checks, less thought is put into the time spent between detection and sending an alert to the on-call team. On more than one occasion this author has seen teams drastically improve their MTTR just by adopting faster standards for how the on-call team got notified (for example adding alert channels beyond email).
3. The Complexity of Cloud-Native Environments
Many organizations are adopting cloud-native architectures, such as containers and microservices, to improve efficiency and scalability. However, this shift has introduced greater complexity, higher data volumes, and increased pressure on engineering teams. Cloud-native architectures have made it harder to discover and troubleshoot incidents. This complexity can significantly delay the time it takes to detect and resolve issues. Why is this issue particularly bad with cloud native architectures? The relative ease of deploying new servers, new stacks, even whole new clusters, means that the complexity of operational architectures has increased exponentially since the advent of cloud native architectures. After the fateful instruction of ‘cattle, not pets’ we have found ourselves swamped with an infinitude of small, disposable containers. That’s great until the herd is sick and we cannot diagnose the cause. If you’re unsure if this is true in your own organization, ask this question of your team:- If I point to any particular microservice, and a select a common user activity, can you tell me with certainty whether taking down that microservice will affect that user activity?
How Can Companies Improve MTTR?
To address these challenges, organizations need to focus on optimizing each stage of the incident response process:1. Reduce Time to Detect (TTD)
Improving detection times requires robust monitoring and alerting systems that can quickly identify issues. This involves setting low scrape intervals for data collection and ensuring that observability tools can ingest and generate alerts rapidly. Faster detection allows teams to begin remediation efforts sooner, which can significantly reduce MTTR. This is covered in more detail in our page on optimizing detection time.2. Streamline Time to Remediate
Once an issue is detected, the next step is to stop the customer pain by implementing a temporary fix. This can be achieved by:- Providing engineers with actionable data and context in alerts. For example, linking alerts to dashboards that show affected services or customers can help engineers prioritize and resolve issues more efficiently.
- Linking alerts with deeper root cause information. While it’s critical to show ‘who is affected and how,’ it’s also supremely important to give some indication of what’s happening internal to our services so that our engineers are not stuck sorting through backend logs trying to match time stamps. For example, Checkly Traces links backend traces with synthetic monitoring data, to show application internals related to any user-facing issue.
- Easy rollbacks. Whether it’s deploying rolled back code right from source control with (for example) GitHub actions, or feature flagging to disable a new feature that’s causing trouble, it’s critical that on-call teams are empowered to get the service back to a working state by rolling back the clock.