Incident Management Guide

Master incident response, severity classification, and Critical User Journey mapping

Incident Management Theory & Best Practices

Effective incident management is crucial for maintaining service reliability and minimizing user impact. Learn the core principles that drive successful incident response.

What is an Incident?

An incident is an unplanned interruption or reduction in quality of a service. This includes outages, performance degradation, security breaches, or any event that impacts users negatively.

Core Principles

  • Detect Fast: Mean Time To Detection (MTTD) matters
  • Respond Faster: Mean Time To Recovery (MTTR) is critical
  • Learn Always: Every incident is a learning opportunity
  • Blameless Culture: Focus on systems, not individuals

Incident Lifecycle

  1. Detection: Identify the incident
  2. Response: Acknowledge and assemble the team
  3. Mitigation: Restore service to normal
  4. Resolution: Fix the root cause
  5. Post-Mortem: Learn and improve

Best Practices for Incident Response

1. Establish Clear Roles

Incident Commander: Leads the response, makes decisions
Communications Lead: Updates stakeholders and customers
Technical Lead: Coordinates technical investigation and remediation

2. Use Runbooks & Playbooks

Document common incident scenarios and response procedures. Runbooks reduce MTTR by providing step-by-step guidance for known issues.

3. Maintain Incident Timeline

Keep detailed timestamps of detection, actions taken, and resolution. This helps with post-mortems and identifying bottlenecks in your process.

4. Communicate Proactively

Regular updates to stakeholders and users, even when there's no new information. Silence creates uncertainty and erodes trust.

5. Focus on Mitigation First

Get the service back up before diving deep into root cause. Users care about service restoration, not the why (yet).

6. Conduct Blameless Post-Mortems

Review what happened, why it happened, and how to prevent recurrence. Focus on systems and processes, not people.

Interactive Incident Severity Calculator

Not sure what severity level to assign? Use this interactive calculator to determine the appropriate incident severity based on impact and urgency.

Calculate Incident Severity

Severity Level Reference

Level Description User Impact Response Time Example
P0 Critical All or most users cannot use core features Immediate (24/7) Complete site outage, data loss
P1 High Significant user base affected < 1 hour Payment processing down, login failures
P2 Medium Some users affected, workaround exists < 4 hours Feature degradation, slow performance
P3 Low Minimal user impact < 24 hours Minor UI issues, cosmetic bugs

Incident Management Tools in the Market

Choosing the right incident management tool is crucial for effective response. Here's a comprehensive comparison of leading platforms.

Tool Best For Key Features Pricing Integration
PagerDuty Enterprise teams, complex on-call schedules Advanced alerting, on-call management, incident response, automation, AIOps $$$ (Starts at $21/user/mo) 650+ integrations
Opsgenie Atlassian users, mid-size teams Alert routing, on-call scheduling, integrates well with Jira $$ (Starts at $9/user/mo) 200+ integrations
VictorOps (Splunk) DevOps teams, observability focus Timeline view, on-call, ChatOps integration, post-incident review $$ (Starts at $9/user/mo) 150+ integrations
Statuspage.io Customer communication, transparency Status pages, incident communication, subscriber notifications $$ (Starts at $29/mo) Atlassian ecosystem
Incident.io Modern teams, Slack-first workflows Slack-native, automated workflows, post-mortem automation $$$ (Custom pricing) Slack, Jira, GitHub
FireHydrant SRE teams, incident learning Incident tracking, retrospectives, runbooks, analytics $$$ (Custom pricing) Slack, DataDog, etc
Blameless SRE maturity, reliability programs SLO tracking, incident management, postmortem automation, reliability insights $$$ (Custom pricing) Major monitoring tools

How to Choose the Right Tool

Team Size Matters

  • Small (<10): Opsgenie, VictorOps
  • Medium (10-100): PagerDuty, FireHydrant
  • Large (100+): PagerDuty, Blameless

Budget Considerations

  • Budget-conscious: Opsgenie, VictorOps
  • Mid-range: PagerDuty starter plans
  • Enterprise: PagerDuty, Blameless

Ecosystem Fit

  • Atlassian stack: Opsgenie, Statuspage
  • Slack-first: Incident.io
  • Splunk users: VictorOps
  • SRE maturity: Blameless, FireHydrant

Critical User Journey (CUJ) Mapping Playbook

Critical User Journeys (CUJs) are the most important paths users take through your application. Mapping incidents to CUJs helps prioritize response and understand true business impact.

What are Critical User Journeys?

A Critical User Journey is a sequence of steps a user takes to accomplish a high-value task in your application. Examples include:

  • E-commerce: Browse → Add to Cart → Checkout → Payment → Order Confirmation
  • SaaS: Login → Access Dashboard → Perform Key Action → Save/Export
  • Social Media: Login → View Feed → Create Post → Publish
  • Banking: Login → View Balance → Transfer Money → Confirm Transaction

Why Map Incidents to CUJs?

Better Prioritization

Understand which incidents actually impact users vs. which are internal-only

Clear Communication

Explain impact in business terms stakeholders understand

Resource Allocation

Focus engineering effort on protecting critical paths

SLO Alignment

Define SLOs based on real user journeys, not arbitrary metrics

Step-by-Step CUJ Mapping Playbook

1

Identify Your CUJs

Work with product and business teams to list 3-7 most critical user journeys. Ask:

  • What actions generate revenue?
  • What features do users expect to always work?
  • What failures would cause users to leave?
2

Map System Dependencies

For each CUJ, document which services, APIs, and dependencies are involved:

  • Frontend components
  • Backend services and APIs
  • Databases and data stores
  • Third-party services
3

Define Success Criteria

What does "working" mean for each step in the journey?

  • Response time thresholds (e.g., page load < 2s)
  • Success rates (e.g., API success > 99.9%)
  • Data accuracy requirements
4

Create CUJ Impact Matrix

Build a matrix showing which services impact which CUJs. This helps during incidents to quickly assess user impact.

5

Instrument and Monitor

Set up synthetic monitoring or real user monitoring (RUM) to track each CUJ:

  • End-to-end journey tests
  • Key step completion rates
  • Performance metrics per step
6

Link to Incident Response

During incidents, always identify affected CUJs and include in status updates:

  • Which journeys are impacted?
  • What percentage of users affected?
  • Are there workarounds available?

Interactive CUJ Mapper Tool

Use this tool to map your incident to affected Critical User Journeys and understand the real user impact.

Map Your Incident to CUJs

Example: E-commerce Incident

Incident Affected Service Impacted CUJs User Impact Severity
Payment gateway timeout Payment API Purchase/Checkout 100% of checkout attempts fail P0
Search index lag Search Service Search/Discovery Search results 1 hour stale P2
Profile image upload slow CDN Profile Management Upload takes 10s instead of 2s P3

Frequently Asked Questions

Q: What's the difference between P0 and P1 incidents?

A: P0 incidents affect all or most users and require immediate 24/7 response (e.g., complete outage). P1 incidents affect a significant portion of users and require response within 1 hour (e.g., critical feature down but site accessible).

Q: Should every incident have a post-mortem?

A: Not necessarily. Focus post-mortems on P0/P1 incidents, recurring issues, or incidents that reveal systemic problems. P2/P3 might only need brief incident reports.

Q: How do we make post-mortems truly blameless?

A: Focus language on systems: "the deployment process lacked safeguards" not "Bob deployed buggy code." Assume good intentions. Ask "how did the system allow this?" not "who did this?"

Q: How many CUJs should we define?

A: Start with 3-7 most critical journeys. Too many dilutes focus; too few misses important user flows. You can always add more as you mature.

Q: What if multiple services are down - how do we prioritize?

A: Use CUJ mapping! Restore services that impact the most critical user journeys first. Revenue-generating paths typically take priority.

Q: Should we use multiple incident management tools?

A: Generally no. Pick one primary platform to avoid confusion during incidents. You might supplement with a status page tool, but keep alerting/on-call in one system.

Q: How do we reduce MTTR?

A: Focus on: better monitoring/alerting (faster detection), runbooks for common issues, automated rollback capabilities, clear escalation paths, and regular incident response practice/drills.

Q: What metrics should we track for incident management?

A: Key metrics include: MTTD (Mean Time To Detection), MTTM (Mean Time To Mitigation), MTTR (Mean Time To Recovery), incident frequency by severity, and post-mortem action item completion rate.

Last Updated: February 2026

Created For: SRE Teams, DevOps Engineers & Incident Responders

Status: Ready to Use