(Updated: )

Parallel Scheduling Is Now GA: Detect Regional Outages Up to 20x Faster

Share on social

Table of contents

I am happy to announce that Checkly now supports parallel scheduling as a new way to schedule your checks.

Parallel scheduling allows you to reduce mean time to detection, provide better insights when addressing outages, and give improved accuracy in performance trends, making it a powerful new feature for all Checkly users.

Vercel's CEO Guillermo Rauch tweets about Checkly's new parallel scheduing feature

Source: Guillermo Rauch's X account

Parallel scheduling vs. round-robin

If you provide a service to users worldwide, you typically want to monitor it from several locations to ensure global availability. For most of Checkly’s history, we have run checks in what is known as a ‘round-robin’ pattern. In this case, a single monitor sequentially checks your application from various geographic locations, one after the other.

While this method provides a broad view of performance, it has limitations regarding time efficiency and real-time problem detection.

Ensuring minimum availability for global services with round-robin

In our example, we’ll consider a service used by a global audience, such as Vercel or Checkly. To ensure minimum availability, we create a check that monitors this service from locations around to world, ensuring coverage similar to the user base. In this example we pick the follow locations:

  • Frankfurt
  • London
  • North Virginia
  • Ohio
  • Tokyo
  • Sydney

This check is scheduled to run once every minute.

With the round-robin scheduling method, the check will execute from Frankfurt and then rotate through all the other five locations before returning and checking from Frankfurt again. This leaves a six-minute window where the service might be unavailable from that location without triggering any alerts.

The above highlights the problem with round-robin scheduling - while it can cover all the locations a service should be available from, the execution pattern leaves gaps where a regional outage might occur without the service alerting. The more locations you want to monitor, the greater the risk of a delayed alert

Parallel scheduling to close the gap

While the round-robin solution is decent, we still need to cover regional outages.

The solution is simple: Parallel scheduling. Compared to round-robin, parallel scheduling provides four main benefits:

  • Reduced time to detect issues.
  • Immediate insight into the scope of an outage.
  • Increased chance of catching regional outages.
  • Improved data granularity.

Parallel scheduling
Round-robin

Mean time to detect regional issues

No longer than the check frequency
Check frequency * number of locations monitored

Check run data provided

You immediately know how many locations failed and where
You only know that at least one location failed

Likelihood of detecting intermittent regional outages

High chance, since each location is checked at each check run
Medium to low, depending on the number of locations the check rotates through

Data granularity

High—get data from each location at each run, and clearly see regional differences in performance
Medium—clearly see if the overall performance of a service degrades over time, harder to identify local problems

With parallel scheduling, a check is scheduled and runs on each location configured. In our example from earlier the check would run from all six locations every minute and alert as soon as a location reports a failure. This can reduce the time to detect any regional issues by up to a factor of six since we are using six locations. For critical services and user paths, this will significantly affect both customer happiness and ensure SLOs are not broken.

The more locations you monitor, the more you can reduce your MTTD, potentially by a factor of 20.

Additionally, running a check in parallel gives you an immediate understanding of the scope of the problem - is it a global outage, or is it limited to one or two regions? This information can help inform the urgency of the problem and give you some idea of where the problem might lie.

Parallel check runs also reduce the risk of missing a short regional outage. In our earlier example, if the service would be unavailable from eu-central-1 for four minutes while the check was running from other locations this outage would never be registered. With parallel scheduling, catching these shorter outages becomes much more likely.

Finally, a check running in parallel from multiple locations will give you an accurate performance measure from all selected locations each time the check runs, giving you clear signals if a specific location has performance problems.

Checkly's UI showing the check performance from different locations

Using parallel scheduling in Checkly

Parallel scheduling is now available as a scheduling option for API, Browser, and Multistep checks. To select a scheduling strategy, edit your check and go to ‘Scheduling and Locations’:

Checkly's UI showing the new parallel scheduling option

When using our CLI, or the Terraform provider, determining the scheduling strategy is done in the check construct.

CLI, notice the runParallel: true

new ApiCheck('list-all-checks', {
  name: 'List all checks',
  activated: false,
  muted: false,
  runParallel: true,
  locations: ['eu-north-1', 'eu-central-1', 'us-west-1', 'ap-northeast-1'],
  frequency: Frequency.EVERY_10S,
  maxResponseTime: 20000,
  degradedResponseTime: 5000,
  request: {
    url: 'https://developers.checklyhq.com/reference/getv1checks',
    method: 'GET'
  },
})

Terraform, here we use run_parallel = true:

resource "checkly_check" "list-all-checks" {
  name                      = "List all checks"
  type                      = "API"
  frequency                 = 0
  frequency_offset          = 10
  activated                 = false
  muted                     = false
  run_parallel              = true
  locations                 = ["eu-north-1", "eu-central-1", "us-west-1", "ap-northeast-1"]
  degraded_response_time    = 5000
  max_response_time         = 20000
  request {
    method                    = "GET"
    url                       = "https://developers.checklyhq.com/reference/getv1checks"
  }
}

For more information on how to use our CLI and Terraform provider, check out our docs.

Parallel scheduling and cost optimization

Finally, I want to discuss how parallel scheduling affects your costs when using Checkly. Each check run from a location counts against your total usage, so excessive usage of parallel scheduling can increase your usage amount above your budget.

In order to help you understand how the different check settings affect your final cost, we have added a helper in the check editor, giving you a clear indicator of how the number of locations and the check scheduling method changes the monthly cost.

Checkly's cost helper that shows the price based on the scheduling strategy and frequency

Do note that this helper only provides an approximation and that the final cost can end up being higher in case the check uses a lot of retries.

I recommend using parallel scheduling for your critical services and flows, where downtime costs money and user happiness, and using round-robin scheduling for your less critical monitoring, where some downtime can be accepted.

If you want more details on how check costs are calculated, we have a detailed breakdown in our documentation.

Parallel scheduling is available now on all Hobby, Team, and Enterprise plans.

Share on social