CUJ → SLI → SLO → Error Budget
See how Critical User Journeys drive your reliability pipeline — from measurement to budget
🔗 The Reliability Pipeline
Every reliability practice starts with understanding your users. Here's how the pipeline connects — from what users care about, to how you measure it, to the targets you set, and the budget you earn.
👤 What is a Critical User Journey?
A CUJ is a sequence of steps a user takes to accomplish a key goal. It defines what matters for your users — and therefore for your service reliability.
How to Identify CUJs
- Ask: What actions must users always be able to complete?
- Focus: On the most critical, high-frequency user paths
- Validate: With product, business, and user research
- Prioritize: By business impact, not technical complexity
Good CUJ Examples
- ✅ User can search and book a flight
- ✅ User can complete airport check-in
- ✅ User receives real-time flight updates
- ❌ Too broad: "User can use the app"
- ❌ Too narrow: "User can sort results by price"
Why CUJs Drive Everything
Without a CUJ, SLIs are just metrics. With a CUJ, they become user-meaningful signals that tell you whether users are succeeding at what they came to do.
CUJs ensure your reliability work targets what actually matters to the business and its customers.
✈ Interactive Demo: Airport Reliability Pipeline
Select a Critical User Journey below to see how it drives SLI measurement, SLO targets, and Error Budget management.
Pick a Critical User Journey
Select a journey above to begin exploring the reliability pipeline.
SLI — Measuring What Matters
Select a journey to see its SLI.
—
SLO — Our Service Promise
Select a journey to see its SLO.
—
Error Budget — How Much Can We Fail?
💡 Team Decision
Select a journey above to see the recommended team action.
📊 Understanding SLIs
A Service Level Indicator is a quantitative measure of your service's behaviour from the user's perspective. It answers: "How well is the service performing right now?"
Common SLI Types
Availability
Is the service up and responding?
successful_requests / total_requests
Latency
Is the service responding fast enough?
requests_under_threshold / total_requests
Throughput
Is the service handling enough load?
successful_jobs / total_jobs_submitted
Freshness
Is the data current enough?
updates_within_window / expected_updates
Good SLI Design Principles
- User-focused: Measure what users experience, not what's easy to instrument
- Ratio-based: Express as a proportion (e.g., 99.5%) not raw counts
- Bounded: Should have a clear 0–100% range
- Actionable: A change in the SLI should trigger a meaningful response
- Aligned to CUJ: Each SLI should map directly to a user goal
🔧 Putting It All Together
Here's the complete pipeline in action — from user need to engineering decision.
Identify the CUJ
"Users must be able to complete a flight booking in under 3 minutes"
Define the SLI
Booking Success Rate = successful_bookings / total_attempts × 100
Set the SLO
Booking Success Rate ≥ 99.5% measured over a 30-day rolling window
Calculate the Error Budget
(1 − 0.995) × 720h = 3.6 hours of allowed downtime per month
Make Engineering Decisions
Budget healthy? Deploy freely. Budget low? Stabilise. Budget exhausted? Stop, fix, review.