Team Culture in Incident Response Playbooks
The term ‘playbook’ sometimes implies a sort of mechanical process that tells engineers exactly what to do, a bit like a robot following lines of code. But of course the term is taken from sport, where a playbook lists individual plays that allow for human participants and implicit randomness. Team culture is a clear part of incident response, and you need to consider how real people will interpret plans when faced with real failures and stressful situations. Two early questions are how much standardization of playbooks should happen between teams, and how to handle cases where people don’t follow the process.How much should teams standardize their incident response?
A basic question when you start writing playbooks for incident response is how standard these playbooks need to be. Should teams write their own playbooks, or should ops, SRE, or platform teams write the playbook that all other teams need to follow. Standardization across the whole organization can’t be extreme. Let’s say a playbook has a decision point about the scope of an issue. A backend team working in database interfaces and a frontend team will have totally different answers to what ‘scope’ means and how to measure it. So we want standards for how incidents are handled, but now how they’re identified, investigated, or remediated. One senior ops engineer talks about the balance between team standardization and organizational process:Teams are responsible for their own stuff at the end of the day. They will have their own runbooks as their systems will differ from other teams. The incident response process is standard across all teams, the software teams create and maintain will differ though, so this makes sense as a boundary of responsibility for this organisation. Generally teams manage their own alerts and dashboards. If they continually have problems as a result of poor visibility, it’ll be very clear to the leadership due to the number of incident tickets assigned to them. They’ll get the support they need to fix the underlying problem, be that fixing the system or improving the observability.That general truism should apply to your organization: if a team’s individual process for incidents isn’t working, it’ll show up in the number of incidents assigned to the team.