Incident Management - SLO Education Hub

Incident Management Theory & Best Practices

Effective incident management is crucial for maintaining service reliability and minimizing user impact. Learn the core principles that drive successful incident response.

What is an Incident?

An incident is an unplanned interruption or reduction in quality of a service. This includes outages, performance degradation, security breaches, or any event that impacts users negatively.

Core Principles

Detect Fast: Mean Time To Detection (MTTD) matters
Respond Faster: Mean Time To Recovery (MTTR) is critical
Learn Always: Every incident is a learning opportunity
Blameless Culture: Focus on systems, not individuals

Incident Lifecycle

Detection: Identify the incident
Response: Acknowledge and assemble the team
Mitigation: Restore service to normal
Resolution: Fix the root cause
Post-Mortem: Learn and improve

Best Practices for Incident Response

1. Establish Clear Roles

Incident Commander: Leads the response, makes decisions
Communications Lead: Updates stakeholders and customers
Technical Lead: Coordinates technical investigation and remediation

2. Use Runbooks & Playbooks

Document common incident scenarios and response procedures. Runbooks reduce MTTR by providing step-by-step guidance for known issues.

3. Maintain Incident Timeline

Keep detailed timestamps of detection, actions taken, and resolution. This helps with post-mortems and identifying bottlenecks in your process.

4. Communicate Proactively

Regular updates to stakeholders and users, even when there's no new information. Silence creates uncertainty and erodes trust.

5. Focus on Mitigation First

Get the service back up before diving deep into root cause. Users care about service restoration, not the why (yet).

6. Conduct Blameless Post-Mortems

Review what happened, why it happened, and how to prevent recurrence. Focus on systems and processes, not people.

Interactive Incident Severity Calculator

Not sure what severity level to assign? Use this interactive calculator to determine the appropriate incident severity based on impact and urgency.

Calculate Incident Severity

User Impact:

Service State:

Business Impact:

Severity Level Reference

Level	Description	User Impact	Response Time	Example
P0	Critical	All or most users cannot use core features	Immediate (24/7)	Complete site outage, data loss
P1	High	Significant user base affected	< 1 hour	Payment processing down, login failures
P2	Medium	Some users affected, workaround exists	< 4 hours	Feature degradation, slow performance
P3	Low	Minimal user impact	< 24 hours	Minor UI issues, cosmetic bugs

Incident Management Tools in the Market

Choosing the right incident management tool is crucial for effective response. Here's a comprehensive comparison of leading platforms.

Filter by features:

Tool	Best For	Key Features	Pricing	Integration
PagerDuty	Enterprise teams, complex on-call schedules	Advanced alerting, on-call management, incident response, automation, AIOps	$$$ (Starts at $21/user/mo)	650+ integrations
Opsgenie	Atlassian users, mid-size teams	Alert routing, on-call scheduling, integrates well with Jira	$$ (Starts at $9/user/mo)	200+ integrations
VictorOps (Splunk)	DevOps teams, observability focus	Timeline view, on-call, ChatOps integration, post-incident review	$$ (Starts at $9/user/mo)	150+ integrations
Statuspage.io	Customer communication, transparency	Status pages, incident communication, subscriber notifications	$$ (Starts at $29/mo)	Atlassian ecosystem
Incident.io	Modern teams, Slack-first workflows	Slack-native, automated workflows, post-mortem automation	$$$ (Custom pricing)	Slack, Jira, GitHub
FireHydrant	SRE teams, incident learning	Incident tracking, retrospectives, runbooks, analytics	$$$ (Custom pricing)	Slack, DataDog, etc
Blameless	SRE maturity, reliability programs	SLO tracking, incident management, postmortem automation, reliability insights	$$$ (Custom pricing)	Major monitoring tools

How to Choose the Right Tool

Team Size Matters

Small (<10): Opsgenie, VictorOps
Medium (10-100): PagerDuty, FireHydrant
Large (100+): PagerDuty, Blameless

Budget Considerations

Budget-conscious: Opsgenie, VictorOps
Mid-range: PagerDuty starter plans
Enterprise: PagerDuty, Blameless

Ecosystem Fit

Atlassian stack: Opsgenie, Statuspage
Slack-first: Incident.io
Splunk users: VictorOps
SRE maturity: Blameless, FireHydrant

Critical User Journey (CUJ) Mapping Playbook

Critical User Journeys (CUJs) are the most important paths users take through your application. Mapping incidents to CUJs helps prioritize response and understand true business impact.

What are Critical User Journeys?

A Critical User Journey is a sequence of steps a user takes to accomplish a high-value task in your application. Examples include:

E-commerce: Browse → Add to Cart → Checkout → Payment → Order Confirmation
SaaS: Login → Access Dashboard → Perform Key Action → Save/Export
Social Media: Login → View Feed → Create Post → Publish
Banking: Login → View Balance → Transfer Money → Confirm Transaction

Why Map Incidents to CUJs?

Better Prioritization

Understand which incidents actually impact users vs. which are internal-only

Clear Communication

Explain impact in business terms stakeholders understand

Resource Allocation

Focus engineering effort on protecting critical paths

SLO Alignment

Define SLOs based on real user journeys, not arbitrary metrics

Step-by-Step CUJ Mapping Playbook

1

Identify Your CUJs

Work with product and business teams to list 3-7 most critical user journeys. Ask:

What actions generate revenue?
What features do users expect to always work?
What failures would cause users to leave?

2

Map System Dependencies

For each CUJ, document which services, APIs, and dependencies are involved:

Frontend components
Backend services and APIs
Databases and data stores
Third-party services

3

Define Success Criteria

What does "working" mean for each step in the journey?

Response time thresholds (e.g., page load < 2s)
Success rates (e.g., API success > 99.9%)
Data accuracy requirements

4

Create CUJ Impact Matrix

Build a matrix showing which services impact which CUJs. This helps during incidents to quickly assess user impact.

5

Instrument and Monitor

Set up synthetic monitoring or real user monitoring (RUM) to track each CUJ:

End-to-end journey tests
Key step completion rates
Performance metrics per step

6

Link to Incident Response

During incidents, always identify affected CUJs and include in status updates:

Which journeys are impacted?
What percentage of users affected?
Are there workarounds available?

Interactive CUJ Mapper Tool

Use this tool to map your incident to affected Critical User Journeys and understand the real user impact.

Map Your Incident to CUJs

Affected Service/Component:

Select Affected CUJs:

User Signup / Registration User Login / Authentication Purchase / Checkout Flow Content Access / Viewing Search / Discovery Profile / Settings Management

Estimated % of Users Affected:

Example: E-commerce Incident

Incident	Affected Service	Impacted CUJs	User Impact	Severity
Payment gateway timeout	Payment API	Purchase/Checkout	100% of checkout attempts fail	P0
Search index lag	Search Service	Search/Discovery	Search results 1 hour stale	P2
Profile image upload slow	CDN	Profile Management	Upload takes 10s instead of 2s	P3

Frequently Asked Questions

Q: What's the difference between P0 and P1 incidents?

A: P0 incidents affect all or most users and require immediate 24/7 response (e.g., complete outage). P1 incidents affect a significant portion of users and require response within 1 hour (e.g., critical feature down but site accessible).

Q: Should every incident have a post-mortem?

A: Not necessarily. Focus post-mortems on P0/P1 incidents, recurring issues, or incidents that reveal systemic problems. P2/P3 might only need brief incident reports.

Q: How do we make post-mortems truly blameless?

A: Focus language on systems: "the deployment process lacked safeguards" not "Bob deployed buggy code." Assume good intentions. Ask "how did the system allow this?" not "who did this?"

Q: How many CUJs should we define?

A: Start with 3-7 most critical journeys. Too many dilutes focus; too few misses important user flows. You can always add more as you mature.

Q: What if multiple services are down - how do we prioritize?

A: Use CUJ mapping! Restore services that impact the most critical user journeys first. Revenue-generating paths typically take priority.

Q: Should we use multiple incident management tools?

A: Generally no. Pick one primary platform to avoid confusion during incidents. You might supplement with a status page tool, but keep alerting/on-call in one system.

Q: How do we reduce MTTR?

A: Focus on: better monitoring/alerting (faster detection), runbooks for common issues, automated rollback capabilities, clear escalation paths, and regular incident response practice/drills.

Q: What metrics should we track for incident management?

A: Key metrics include: MTTD (Mean Time To Detection), MTTM (Mean Time To Mitigation), MTTR (Mean Time To Recovery), incident frequency by severity, and post-mortem action item completion rate.

Incident Management Guide