Incident Management Guide
Master incident response, severity classification, and Critical User Journey mapping
Incident Management Theory & Best Practices
Effective incident management is crucial for maintaining service reliability and minimizing user impact. Learn the core principles that drive successful incident response.
What is an Incident?
An incident is an unplanned interruption or reduction in quality of a service. This includes outages, performance degradation, security breaches, or any event that impacts users negatively.
Core Principles
- Detect Fast: Mean Time To Detection (MTTD) matters
- Respond Faster: Mean Time To Recovery (MTTR) is critical
- Learn Always: Every incident is a learning opportunity
- Blameless Culture: Focus on systems, not individuals
Incident Lifecycle
- Detection: Identify the incident
- Response: Acknowledge and assemble the team
- Mitigation: Restore service to normal
- Resolution: Fix the root cause
- Post-Mortem: Learn and improve
Best Practices for Incident Response
1. Establish Clear Roles
Incident Commander: Leads the response, makes decisions
Communications Lead: Updates stakeholders and customers
Technical Lead: Coordinates technical investigation and remediation
2. Use Runbooks & Playbooks
Document common incident scenarios and response procedures. Runbooks reduce MTTR by providing step-by-step guidance for known issues.
3. Maintain Incident Timeline
Keep detailed timestamps of detection, actions taken, and resolution. This helps with post-mortems and identifying bottlenecks in your process.
4. Communicate Proactively
Regular updates to stakeholders and users, even when there's no new information. Silence creates uncertainty and erodes trust.
5. Focus on Mitigation First
Get the service back up before diving deep into root cause. Users care about service restoration, not the why (yet).
6. Conduct Blameless Post-Mortems
Review what happened, why it happened, and how to prevent recurrence. Focus on systems and processes, not people.
Interactive Incident Severity Calculator
Not sure what severity level to assign? Use this interactive calculator to determine the appropriate incident severity based on impact and urgency.
Calculate Incident Severity
Severity Level Reference
| Level | Description | User Impact | Response Time | Example |
|---|---|---|---|---|
| P0 | Critical | All or most users cannot use core features | Immediate (24/7) | Complete site outage, data loss |
| P1 | High | Significant user base affected | < 1 hour | Payment processing down, login failures |
| P2 | Medium | Some users affected, workaround exists | < 4 hours | Feature degradation, slow performance |
| P3 | Low | Minimal user impact | < 24 hours | Minor UI issues, cosmetic bugs |
Incident Management Tools in the Market
Choosing the right incident management tool is crucial for effective response. Here's a comprehensive comparison of leading platforms.
| Tool | Best For | Key Features | Pricing | Integration |
|---|---|---|---|---|
| PagerDuty | Enterprise teams, complex on-call schedules | Advanced alerting, on-call management, incident response, automation, AIOps | $$$ (Starts at $21/user/mo) | 650+ integrations |
| Opsgenie | Atlassian users, mid-size teams | Alert routing, on-call scheduling, integrates well with Jira | $$ (Starts at $9/user/mo) | 200+ integrations |
| VictorOps (Splunk) | DevOps teams, observability focus | Timeline view, on-call, ChatOps integration, post-incident review | $$ (Starts at $9/user/mo) | 150+ integrations |
| Statuspage.io | Customer communication, transparency | Status pages, incident communication, subscriber notifications | $$ (Starts at $29/mo) | Atlassian ecosystem |
| Incident.io | Modern teams, Slack-first workflows | Slack-native, automated workflows, post-mortem automation | $$$ (Custom pricing) | Slack, Jira, GitHub |
| FireHydrant | SRE teams, incident learning | Incident tracking, retrospectives, runbooks, analytics | $$$ (Custom pricing) | Slack, DataDog, etc |
| Blameless | SRE maturity, reliability programs | SLO tracking, incident management, postmortem automation, reliability insights | $$$ (Custom pricing) | Major monitoring tools |
How to Choose the Right Tool
Team Size Matters
- Small (<10): Opsgenie, VictorOps
- Medium (10-100): PagerDuty, FireHydrant
- Large (100+): PagerDuty, Blameless
Budget Considerations
- Budget-conscious: Opsgenie, VictorOps
- Mid-range: PagerDuty starter plans
- Enterprise: PagerDuty, Blameless
Ecosystem Fit
- Atlassian stack: Opsgenie, Statuspage
- Slack-first: Incident.io
- Splunk users: VictorOps
- SRE maturity: Blameless, FireHydrant
Critical User Journey (CUJ) Mapping Playbook
Critical User Journeys (CUJs) are the most important paths users take through your application. Mapping incidents to CUJs helps prioritize response and understand true business impact.
What are Critical User Journeys?
A Critical User Journey is a sequence of steps a user takes to accomplish a high-value task in your application. Examples include:
- E-commerce: Browse → Add to Cart → Checkout → Payment → Order Confirmation
- SaaS: Login → Access Dashboard → Perform Key Action → Save/Export
- Social Media: Login → View Feed → Create Post → Publish
- Banking: Login → View Balance → Transfer Money → Confirm Transaction
Why Map Incidents to CUJs?
Better Prioritization
Understand which incidents actually impact users vs. which are internal-only
Clear Communication
Explain impact in business terms stakeholders understand
Resource Allocation
Focus engineering effort on protecting critical paths
SLO Alignment
Define SLOs based on real user journeys, not arbitrary metrics
Step-by-Step CUJ Mapping Playbook
Identify Your CUJs
Work with product and business teams to list 3-7 most critical user journeys. Ask:
- What actions generate revenue?
- What features do users expect to always work?
- What failures would cause users to leave?
Map System Dependencies
For each CUJ, document which services, APIs, and dependencies are involved:
- Frontend components
- Backend services and APIs
- Databases and data stores
- Third-party services
Define Success Criteria
What does "working" mean for each step in the journey?
- Response time thresholds (e.g., page load < 2s)
- Success rates (e.g., API success > 99.9%)
- Data accuracy requirements
Create CUJ Impact Matrix
Build a matrix showing which services impact which CUJs. This helps during incidents to quickly assess user impact.
Instrument and Monitor
Set up synthetic monitoring or real user monitoring (RUM) to track each CUJ:
- End-to-end journey tests
- Key step completion rates
- Performance metrics per step
Link to Incident Response
During incidents, always identify affected CUJs and include in status updates:
- Which journeys are impacted?
- What percentage of users affected?
- Are there workarounds available?
Interactive CUJ Mapper Tool
Use this tool to map your incident to affected Critical User Journeys and understand the real user impact.
Map Your Incident to CUJs
Example: E-commerce Incident
| Incident | Affected Service | Impacted CUJs | User Impact | Severity |
|---|---|---|---|---|
| Payment gateway timeout | Payment API | Purchase/Checkout | 100% of checkout attempts fail | P0 |
| Search index lag | Search Service | Search/Discovery | Search results 1 hour stale | P2 |
| Profile image upload slow | CDN | Profile Management | Upload takes 10s instead of 2s | P3 |
Frequently Asked Questions
Q: What's the difference between P0 and P1 incidents?
A: P0 incidents affect all or most users and require immediate 24/7 response (e.g., complete outage). P1 incidents affect a significant portion of users and require response within 1 hour (e.g., critical feature down but site accessible).
Q: Should every incident have a post-mortem?
A: Not necessarily. Focus post-mortems on P0/P1 incidents, recurring issues, or incidents that reveal systemic problems. P2/P3 might only need brief incident reports.
Q: How do we make post-mortems truly blameless?
A: Focus language on systems: "the deployment process lacked safeguards" not "Bob deployed buggy code." Assume good intentions. Ask "how did the system allow this?" not "who did this?"
Q: How many CUJs should we define?
A: Start with 3-7 most critical journeys. Too many dilutes focus; too few misses important user flows. You can always add more as you mature.
Q: What if multiple services are down - how do we prioritize?
A: Use CUJ mapping! Restore services that impact the most critical user journeys first. Revenue-generating paths typically take priority.
Q: Should we use multiple incident management tools?
A: Generally no. Pick one primary platform to avoid confusion during incidents. You might supplement with a status page tool, but keep alerting/on-call in one system.
Q: How do we reduce MTTR?
A: Focus on: better monitoring/alerting (faster detection), runbooks for common issues, automated rollback capabilities, clear escalation paths, and regular incident response practice/drills.
Q: What metrics should we track for incident management?
A: Key metrics include: MTTD (Mean Time To Detection), MTTM (Mean Time To Mitigation), MTTR (Mean Time To Recovery), incident frequency by severity, and post-mortem action item completion rate.
Additional Resources
Templates & Guides
Related Pages
Last Updated: February 2026
Created For: SRE Teams, DevOps Engineers & Incident Responders
Status: Ready to Use