The operational discipline that turns service degradation into resolved problem — recognising that detection, severity-routing, response coordination, communication, and learning are distinct activities that succeed or fail independently, and that the architecture supporting each is what determines whether incidents resolve in minutes or hours.
A primitive incident response process looks like this: a user complains, an engineer notices, the engineer pages a colleague, they investigate, eventually find the problem, fix it, and move on — possibly writing a brief note about what happened. The activity is reactive, dependent on individual heroics, and produces no institutional learning. Each incident teaches the responders, but the institution is no smarter for it; the next similar incident plays out the same way.
A mature incident response architecture treats the same situation as a structured operational discipline with distinct phases that can be designed, measured, and improved independently. Detection is a system property — automated alerts firing on the right signals at the right thresholds, with a documented mean time to detect (MTTD) that's tracked over time. Severity classification is a routing primitive — a sev1 (full outage, customer-impacting) gets one response (page the on-call, declare an incident, commander appointed); a sev3 (degradation in non-critical path) gets a different one (ticket queued, addressed in business hours). Response coordination uses defined roles — incident commander, communications lead, scribe — drawn from frameworks like the Incident Command System used in emergency response. Communication runs on documented channels — internal status, customer-facing status page, executive briefings — each with templates and cadences. Post-incident review is blameless and structured, producing actions tracked through completion. The discipline is not to prevent incidents (impossible) but to resolve them faster, learn from them more, and prevent the next class of similar ones from occurring.
The architectural shift is not "we wrote some runbooks." It is: incident response is a multi-phase operational discipline whose architectural support — detection signals, severity routing, role structures, communication channels, learning loops — determines whether the organisation gets faster and smarter over time, or accumulates incidents the way it accumulates technical debt.
Six principles
1. Detection and response are separate disciplines — designing them as one obscures both failures
A common pattern conflates detection (the system noticing something's wrong) with response (the team doing something about it). The conflation produces architectures where alert thresholds are set so that humans can respond — meaning thresholds are loose to avoid alert fatigue, meaning real problems aren't detected until they're severe. Conversely, separating the two lets each be optimised for its own goal: detection optimised for catching real problems early (which means tighter thresholds, more signals, lower noise); response optimised for producing fast, calm, effective resolution (which means automation where possible, clear roles where automation can't, and consolidated alerts that group related signals into single incidents). The architectural discipline is to design detection for sensitivity and response for effectiveness, with a routing layer between them that escalates only what response actually needs.
Flip ↻
Architectural implications
Detection signals are tuned for catching real problems, not for tolerable alert volume — alert fatigue is solved at the routing layer (consolidation, intelligent grouping, severity routing) rather than by making detection less sensitive.
A clear distinction exists between "alert fired" (detection event) and "incident declared" (response triggered) — not every alert escalates to an incident; an alerting layer with deduplication and correlation sits between detection and response.
Response architecture is optimised for fast, calm resolution: documented roles, escalation paths, automation for routine recovery actions, runbooks for non-routine ones, and clear declaration of when an incident starts and ends.
Mean time to detect (MTTD) and mean time to resolve (MTTR) are tracked separately and trended over time; improving each requires different actions.
Quick test
Pick a recent incident in your organisation. How many minutes elapsed between the underlying problem starting and detection firing (MTTD)? Between detection and the responders engaging (response latency)? Between engagement and resolution (MTTR)? If those are bundled into a single number ("the incident lasted 47 minutes"), the discipline is unable to improve any phase independently — and the bottleneck phase, whichever it is, stays the bottleneck.
Reference
Google SRE Book — Chapter 12: Effective Troubleshooting treats detection and response as architecturally separable concerns; the framework distinguishes monitoring (the detection system) from incident response (the operational discipline) and treats their interfaces (alerting, severity routing) as designed properties rather than emergent ones.
Flip ↺
2. Severity classification is a routing primitive — different severities deserve categorically different responses
A small organisation can treat all incidents the same: page everyone, all-hands, fix it. A large organisation cannot — the cost of all-hands response to every minor degradation is unsustainable, and the cost of soft response to a critical outage is worse. Severity classification is the routing primitive that solves this: a documented taxonomy (sev1 through sev5, or critical/high/medium/low) with clear criteria, applied at incident declaration, that determines which response pattern triggers. Sev1 (e.g. full outage, payment system down, data loss in progress) gets immediate page, incident commander appointed, customer-facing status published, executive notified. Sev2 (significant degradation, partial outage, customer-impacting performance) gets paged response within an SLA, internal coordination, customer comms if user-visible. Sev3 (degradation in redundant or non-critical path) gets ticketed for business-hours work. The architectural discipline is to make these criteria explicit, train people to apply them, and trust the classification rather than treating every incident as severity-0 or severity-N depending on who's awake.
Flip ↻
Architectural implications
Severity criteria are documented with concrete examples — not just "high impact" but "user-facing service unavailable, OR data loss, OR loss of regulated logging."
The classification produces routing: each severity has a documented response pattern (who pages, what cadence of updates, who is communications lead, what status surfaces are activated).
Severity can be revised mid-incident as scope becomes clear — a sev2 that turns out to be a database corruption escalates; a sev1 that turns out to be a misclassified deploy de-escalates — and the routing adjusts.
Severity is tracked as a metric: counts per severity per period, MTTD/MTTR per severity, escalations from one severity to another, providing operational insight into the system's incident profile.
Quick test
Pick the most recent incident. Was its severity classified at declaration? Was the response pattern documented for that severity? Was the severity revisited as the incident developed? If severity classification was implicit ("it felt important"), the response was running on individual judgment rather than designed routing — and the cost is paid in mismatched response across different incidents and different responders.
Reference
PagerDuty Incident Response Documentation provides an industry-canonical severity taxonomy (sev1–sev5) with concrete criteria and response patterns, freely available for adaptation. Atlassian Incident Handbook covers similar ground with different severity vocabulary, useful for organisations that prefer the high/medium/low framing.
Flip ↺
3. Coordination roles — commander, scribe, communications lead — make the response work as a system
In a major incident, multiple engineers are simultaneously investigating, multiple stakeholders are asking for updates, and the team is operating under time pressure. Without defined roles, the response becomes chaos: investigators interrupt each other with questions, status updates get duplicated, customer comms either don't happen or happen contradictorily across surfaces, and the team's effective bandwidth is much lower than the sum of its individuals. The Incident Command System (ICS), originally developed for emergency response and adopted by software organisations, defines roles that make coordinated response possible: Incident Commander (decides priorities, makes calls when there's disagreement, owns the response — but does not investigate); Operations Lead (drives the technical investigation and remediation work); Communications Lead (handles all internal and external updates, freeing investigators); Scribe (maintains the incident timeline, records decisions and actions). For smaller incidents, one person plays multiple roles; for larger ones, the roles separate. The architectural discipline is to have the role structure defined and trained on, so that when a sev1 hits at 3 AM, people fall into their roles automatically rather than colliding.
Flip ↻
Architectural implications
Roles are documented with explicit responsibilities and explicit non-responsibilities (the Incident Commander does not investigate; the investigators do not communicate to customers).
Role assignment is part of the incident declaration ritual — within minutes of declaration, the roles are filled, and everyone knows who's playing each.
Tooling supports the structure: the chat channel for the incident has the roles displayed, the status page tooling routes through the Communications Lead, the timeline tool routes through the Scribe.
People are trained on the roles, including how to step into them under pressure and how to hand off as roles fatigue (a Commander on a 6-hour incident hands off to a fresh Commander; the role doesn't end when the original person tires).
Quick test
Pick a recent multi-hour incident in your organisation. Who was the Incident Commander, who was Communications Lead, who was Scribe? If those questions don't have clean answers, the response was running without role structure — and the cost was paid in duplicated work, mixed messages, and an incident timeline that has to be reconstructed afterwards from chat scrollback.
Reference
Incident Command System (ICS) — FEMA is the canonical original reference for the role structure, developed for emergency-services coordination; the structure has been adapted to software incident response by PagerDuty, Google SRE, and others, with the role names and responsibilities consistent across adaptations.
Flip ↺
4. Runbooks live on a spectrum from prose to automation — choose the placement deliberately per recovery action
A runbook — documented steps to take in response to a specific signal or scenario — exists somewhere on a spectrum from pure prose ("if you see this alert, check X, then Y, then Z") to full automation ("when this alert fires, execute this script that performs the recovery"). The right placement depends on the recovery action's properties. Actions that are frequently performed, deterministic, and low-risk belong on the automation end — the system performs them without human intervention, surfacing only the result. Actions that are infrequent, judgment-requiring, or high-risk (irreversible changes, financial actions, security-sensitive operations) belong on the prose end — humans execute with guidance, but the decision and the responsibility remain human. The architectural mistake is uniform treatment: automating everything (including high-stakes actions that should require human judgment) or scripting nothing (forcing humans through routine recovery work that machines could handle). Each runbook gets a deliberate placement.
Flip ↻
Architectural implications
Each runbook (or each action within a runbook) is classified on the prose-to-automation spectrum, with the placement justified.
Frequently-performed deterministic actions (restart this service, drain this node, scale this group) are automated, with the automation tested in non-production and exercised in game days.
Infrequent or judgment-requiring actions are written as prose runbooks with sufficient context for a fresh responder to execute correctly under pressure — the runbook teaches as it directs.
The runbook collection is a living document: actions that consistently get bypassed because the runbook is wrong get fixed; actions that are now automated have their prose runbooks updated to reflect the automated path.
Quick test
Pick a routine recovery action in your organisation — restarting a service after a known failure mode, draining a misbehaving node, rotating a credential. Is it automated, runbook-prose, or "everyone just knows"? If it's "everyone just knows," the action depends on tribal knowledge, and the next responder who doesn't have it will execute differently. If it's runbook-prose for an action that's frequent and deterministic, human time is being spent where automation could free it.
5. Post-incident review is blameless or it doesn't work — and the alternative is repeated incidents
A post-incident review (post-mortem) is the discipline by which the institution learns from an incident: what happened, why, what would have caught it earlier, what changes should be made. The review's value depends entirely on its honesty. Blameful reviews — focused on identifying the individual who made the mistake — produce defensive responses, shallow findings, and a culture where engineers hide problems rather than surface them. Blameless reviews — focused on the system that allowed the incident to happen — produce honest accounts, deeper findings, and a culture where engineers proactively raise risks because they know doing so won't produce blame. The architectural construct here is the facilitation framework: the review is run by someone with explicit authority to keep it blameless, the structure separates "what happened" (timeline, factual) from "why it happened" (system analysis) from "what we'll change" (actions), and the actions are tracked through completion. Without the facilitation, "blameless" is a label the team applies to reviews that aren't actually blameless.
Flip ↻
Architectural implications
Post-incident reviews are scheduled within a documented window (typically 1–2 weeks) and run with explicit blameless framing — not "who made the mistake" but "what conditions made this kind of mistake possible."
The review's structure separates timeline (factual reconstruction), analysis (contributing factors at the system level), and action items (concrete changes with owners and target dates).
Action items are tracked through completion in the same systems used for engineering work, with periodic review of overdue items — a review that produces actions that don't ship is theatre.
The reviews are made available across the organisation (with appropriate redactions for sensitive content) — institutional learning depends on more than the people in the room hearing the lesson.
Quick test
Pick a major incident from 6+ months ago in your organisation. Was a post-mortem written? Were action items produced? Have they shipped? If the answers are "yes, yes, partially," the discipline is working. If the answers are "yes, yes, we don't track that," the post-mortem is producing artefacts that aren't producing change — and similar incidents will continue.
6. Time-to-detect, time-to-resolve, and incident frequency are engineering metrics — measured, trended, acted on
The incident-response system's own performance is measurable, and the metrics tell a different story than incident-narrative summaries do. Mean Time To Detect (MTTD) — from problem starting to detection firing — surfaces the detection layer's effectiveness independent of response. Mean Time To Resolve (MTTR) — from detection to resolution — surfaces the response layer's effectiveness. Incident frequency by severity — how often sev1, sev2, etc. fire per period — surfaces the system's overall reliability profile. Incident-class repetition — how often a similar root cause produces a new incident — surfaces whether learning is actually preventing recurrences. These metrics are tracked, trended, and acted on the same way other engineering metrics are: a deteriorating MTTD signals that detection coverage isn't keeping up with the system's evolution; a deteriorating MTTR signals that response architecture (runbooks, role training, tooling) needs investment; rising incident frequency in a category signals that engineering remediation isn't shipping the changes the post-mortems called for.
Flip ↻
Architectural implications
MTTD and MTTR are measured per incident and aggregated by severity, period, and service area; trends are monitored as actively as availability or latency.
Incident frequency is tracked by category (data plane vs control plane, service A vs service B, deploy-related vs ambient) — aggregate counts hide patterns that specific categories surface.
Repeat-class incidents are flagged: an incident whose root cause matches a prior incident is a strong signal that the prior post-mortem's actions didn't ship or didn't address the right cause.
The metrics drive investment: MTTD problems route to detection-and-alerting work; MTTR problems route to runbook, automation, and training work; frequency problems route to engineering remediation of recurring root causes.
Quick test
Pick the last quarter. What was the MTTD across sev1 and sev2 incidents, MTTR across the same, and incident frequency by severity? What's the trend? If those numbers don't exist, the incident-response system is running without engineering signal — and improvement is whatever happens by accident rather than by design.
Reference
Google SRE Book — Practical Alerting and Service-Level Objectives treats MTTD and MTTR as primary engineering metrics; the same chapter introduces error budgets that connect incident frequency to engineering investment. DORA Metrics treats MTTR as one of the four key engineering performance metrics, alongside deployment frequency, lead time, and change failure rate.
Flip ↺
Flowchart
flowchart TB
%% ─── Detection layer ───
subgraph DETECT [Detection Layer]
direction TB
SIGNALS[(Detection Signals metrics · logs · traces synthetic · user reports)]
ALERT_RT[Alerting Router dedup · correlation consolidation]
end
%% ─── Severity classification ───
SEV{Severity Classification}
%% ─── Differentiated response ───
subgraph RESPONSES [Differentiated Response Patterns]
direction TB
SEV1[Sev1 / Critical page on-call incident commander customer status]
SEV2[Sev2 / High paged within SLA internal coordination]
SEV3[Sev3 / Medium ticketed business-hours]
end
%% ─── Coordination roles ───
subgraph ROLES [Coordination Roles]
direction LR
IC[Incident Commander decides priorities]
OPS[Operations Lead investigates]
COMMS[Communications Lead handles updates]
SCRIBE[Scribe maintains timeline]
end
%% ─── Resolution ───
AUTO[Automated Recovery frequent deterministic actions]
RUNBOOK[Prose Runbook judgment-requiring actions]
RESOLVE([Incident Resolved])
%% ─── Post-incident ───
subgraph LEARN [Learning Loop]
direction TB
POSTMORTEM[Blameless Post-Mortem timeline · analysis · actions]
ACTIONS[(Action Items tracked in eng system through completion)]
end
%% ─── Metrics ───
METRICS[/MTTD · MTTR frequency by severity repeat-class rate/]
%% ─── Flow ───
SIGNALS --> ALERT_RT
ALERT_RT --> SEV
SEV -->|sev1| SEV1
SEV -->|sev2| SEV2
SEV -->|sev3| SEV3
SEV1 --> ROLES
SEV2 --> ROLES
ROLES --> AUTO
ROLES --> RUNBOOK
AUTO --> RESOLVE
RUNBOOK --> RESOLVE
SEV3 --> RUNBOOK
RESOLVE --> POSTMORTEM
POSTMORTEM --> ACTIONS
ACTIONS -.improvements.-> ALERT_RT
ACTIONS -.improvements.-> AUTO
%% ─── Metrics flow ───
DETECT -.MTTD.-> METRICS
RESPONSES -.MTTR.-> METRICS
ACTIONS -.repeat-class.-> METRICS
METRICS -.engineering investment.-> ACTIONS
%% ─── NODE STYLES ───
%% Detection: blue (operational data)
style SIGNALS fill:#DCEAFC,stroke:#2C5F9E,stroke-width:1.5px,color:#1A2840
style ALERT_RT fill:#DCEAFC,stroke:#2C5F9E,stroke-width:1.8px,color:#1A2840
%% Severity: emphasised cream (routing decision)
style SEV fill:#FBEDE0,stroke:#C96330,stroke-width:2.5px,color:#3A2410
%% Response patterns: amber gradient (escalating intensity)
style SEV1 fill:#FBEDE0,stroke:#C96330,stroke-width:2px,color:#3A2410
style SEV2 fill:#FEF3C7,stroke:#92722C,stroke-width:1.8px,color:#5C4316
style SEV3 fill:#FEF8E5,stroke:#A8956A,stroke-width:1.5px,color:#5C4316
%% Roles: cooler tone (the coordination structure)
style IC fill:#F4E5DD,stroke:#A37050,stroke-width:1.8px,color:#3A2410
style OPS fill:#F4E5DD,stroke:#A37050,stroke-width:1.5px,color:#3A2410
style COMMS fill:#F4E5DD,stroke:#A37050,stroke-width:1.5px,color:#3A2410
style SCRIBE fill:#F4E5DD,stroke:#A37050,stroke-width:1.5px,color:#3A2410
%% Recovery actions
style AUTO fill:#FEF8E5,stroke:#A8956A,stroke-width:1.5px,color:#5C4316
style RUNBOOK fill:#FEF8E5,stroke:#A8956A,stroke-width:1.5px,color:#5C4316
style RESOLVE fill:#4A90E2,stroke:#2C5F9E,stroke-width:2px,color:#FFFFFF
%% Learning loop: blue (the institutional memory)
style POSTMORTEM fill:#DCEAFC,stroke:#2C5F9E,stroke-width:2px,color:#1A2840
style ACTIONS fill:#DCEAFC,stroke:#2C5F9E,stroke-width:1.8px,color:#1A2840
%% Metrics
style METRICS fill:#E8E8E2,stroke:#5C5C5C,stroke-width:1.5px,color:#2C2C2C
%% ─── CLUSTER STYLES ───
style DETECT fill:#EDF4FC,stroke:#A8C0DD,stroke-width:1.2px,color:#1A2840
style RESPONSES fill:#FEF1E6,stroke:#D4A87A,stroke-width:1.4px,color:#3A2410
style ROLES fill:#FEF8E5,stroke:#D4C599,stroke-width:1.2px,color:#5C4316
style LEARN fill:#EDF4FC,stroke:#A8C0DD,stroke-width:1.2px,color:#1A2840
The diagram below shows the canonical incident-response architecture: detection signals flowing into an alerting layer with deduplication and correlation; severity classification routing to differentiated response patterns; coordination roles (commander, communications, scribe) instantiated on declaration; runbook/automation execution path with documented escalation; post-incident review with blameless framing producing action items tracked through completion; metrics layer (MTTD, MTTR, frequency, repetition) feeding back into engineering investment.
Common pitfalls when adopting incident-response thinking
Detection tuned for human bandwidth
Alert thresholds are set loose so the team can keep up with the volume. Real problems hide below threshold. By the time something escalates loudly enough to alert, the impact is severe.
Flip ↻
What to do instead
Detection tuned for catching real problems early. Alert volume managed at the routing/consolidation layer (deduplication, intelligent grouping, severity-based routing), not by making detection less sensitive.
Flip ↺
Severity by gut feel
Every incident is whatever severity the responder assigns based on how the situation feels. Some sev1s are over-classified; some sev3s are under-classified. Response patterns are inconsistent.
Flip ↻
What to do instead
Documented severity criteria with concrete examples. Severity assigned at declaration, revisited mid-incident as scope clarifies. Severity routing produces consistent response per class.
Flip ↺
The Incident Commander who's also investigating
The most senior engineer is on the page; they take over coordination AND lead the technical investigation. Coordination decisions get delayed (they're heads-down debugging); investigation gets interrupted (they're answering status requests). Both suffer.
Flip ↻
What to do instead
Roles are separate. The Incident Commander coordinates, doesn't investigate. The Operations Lead investigates, doesn't communicate. The Communications Lead handles updates. For small incidents, one person plays multiple roles deliberately; for large ones, the roles separate.
Flip ↺
Blameless in name, blameful in practice
The team labels post-mortems "blameless" but the conversation focuses on the engineer who pushed the deploy that caused the incident. The label is a slogan; the culture is the practice. Engineers learn that surfacing risks produces consequences and stop doing it.
Flip ↻
What to do instead
Active facilitation by someone with authority to keep the framing honest. Structure separates "what happened" (timeline) from "why it happened" (system) from "what we'll change" (actions). The discipline is enforced, not assumed.
Flip ↺
Action items that don't ship
Post-mortems produce action items. The action items live in a spreadsheet. Six months later, the same incident class recurs because the action items never made it into engineering plans.
Flip ↻
What to do instead
Action items live in the same engineering tracking system as feature work, with owners and target dates. Periodic review of overdue items. Repeat-class incidents trigger investigation of why prior actions didn't ship.
Flip ↺
Adoption checklist
1Detection is tuned for catching real problems early; alert volume managed at routing layer☐
Flip ↻
Detection sensitivity isn't compromised by alert fatigue. The routing layer (consolidation, deduplication, severity-based routing) handles volume, not the threshold tuning.
Flip ↺
2Severity criteria are documented with concrete examples; severity is assigned at declaration☐
Flip ↻
Severity is a routing primitive, not a label. Concrete criteria mean different responders classify the same situation similarly. Severity is revisited mid-incident as scope clarifies.
Flip ↺
3Each severity has a documented response pattern — paging cadence, comms surfaces, role activation☐
Flip ↻
Different severities deserve categorically different responses. The pattern per severity is documented; the routing produces consistent response.
Flip ↺
4Coordination roles (Commander, Operations, Communications, Scribe) are documented and trained on☐
Flip ↻
ICS-derived roles, with explicit responsibilities and explicit non-responsibilities. Role assignment is part of the incident declaration ritual. People can step into roles under pressure.
Flip ↺
5Runbooks are placed deliberately on the prose-to-automation spectrum per action's properties☐
Flip ↻
Frequent deterministic low-risk actions are automated. Infrequent judgment-requiring actions are prose-runbook. The placement is justified, not uniform.
Flip ↺
6Automation is tested in non-production and exercised in game days☐
Flip ↻
Automation that hasn't been tested fails on the first incident. Game days exercise both the automation and the human response patterns, surfacing gaps before incidents do.
Flip ↺
7Post-incident reviews are blameless with active facilitation☐
Flip ↻
Blameless is a practice, not a label. Facilitation by someone with authority to keep the framing honest. Structure separates timeline, analysis, actions.
Flip ↺
8Action items are tracked in engineering systems with owners and target dates; overdue items are reviewed☐
Flip ↻
Actions live in the same tracking system as feature work. Repeat-class incidents trigger investigation of why prior actions didn't ship. The post-mortem produces change, not artefacts.
Flip ↺
9MTTD, MTTR, incident frequency by severity, and repeat-class rate are tracked and trended☐
Flip ↻
The incident-response system's own performance is measured. Different metrics signal different work: MTTD problems → detection investment; MTTR problems → response investment; frequency problems → engineering remediation.
Flip ↺
10Reviews are made available across the organisation with appropriate redactions☐
Flip ↻
Institutional learning depends on more than the people in the room hearing the lesson. Cross-team availability spreads the lesson; appropriate redactions handle sensitive content without losing the learning.