CUJ → SLI → SLO → Error Budget

🔗 The Reliability Pipeline

Every reliability practice starts with understanding your users. Here's how the pipeline connects — from what users care about, to how you measure it, to the targets you set, and the budget you earn.

CUJ

Critical User Journey — what users care about

SLI

Service Level Indicator — how we measure it

SLO

Service Level Objective — our target

Error Budget

How much failure we can afford

👤 What is a Critical User Journey?

A CUJ is a sequence of steps a user takes to accomplish a key goal. It defines what matters for your users — and therefore for your service reliability.

How to Identify CUJs

Ask: What actions must users always be able to complete?
Focus: On the most critical, high-frequency user paths
Validate: With product, business, and user research
Prioritize: By business impact, not technical complexity

Good CUJ Examples

✅ User can search and book a flight
✅ User can complete airport check-in
✅ User receives real-time flight updates
❌ Too broad: "User can use the app"
❌ Too narrow: "User can sort results by price"

Why CUJs Drive Everything

Without a CUJ, SLIs are just metrics. With a CUJ, they become user-meaningful signals that tell you whether users are succeeding at what they came to do.

CUJs ensure your reliability work targets what actually matters to the business and its customers.

✈ Interactive Demo: Airport Reliability Pipeline

Select a Critical User Journey below to see how it drives SLI measurement, SLO targets, and Error Budget management.

Step 1

Pick a Critical User Journey

Select a journey above to begin exploring the reliability pipeline.

Step 2

SLI — Measuring What Matters

Indicator Name

—

What it measures

Select a journey to see its SLI.

Formula

—

Current SLI Value

—

Step 3

SLO — Our Service Promise

SLO Target

—

Select a journey to see its SLO.

—

Step 4

Error Budget — How Much Can We Fail?

Budget Remaining

— Select a journey to see your budget

💡 Team Decision

Select a journey above to see the recommended team action.

📊 Understanding SLIs

A Service Level Indicator is a quantitative measure of your service's behaviour from the user's perspective. It answers: "How well is the service performing right now?"

Common SLI Types

Availability

Is the service up and responding?

successful_requests / total_requests

Latency

Is the service responding fast enough?

requests_under_threshold / total_requests

Throughput

Is the service handling enough load?

successful_jobs / total_jobs_submitted

Freshness

Is the data current enough?

updates_within_window / expected_updates

Good SLI Design Principles

User-focused: Measure what users experience, not what's easy to instrument
Ratio-based: Express as a proportion (e.g., 99.5%) not raw counts
Bounded: Should have a clear 0–100% range
Actionable: A change in the SLI should trigger a meaningful response
Aligned to CUJ: Each SLI should map directly to a user goal

🔧 Putting It All Together

Here's the complete pipeline in action — from user need to engineering decision.

1

Identify the CUJ

"Users must be able to complete a flight booking in under 3 minutes"

2

Define the SLI

Booking Success Rate = successful_bookings / total_attempts × 100

3

Set the SLO

Booking Success Rate ≥ 99.5% measured over a 30-day rolling window

4

Calculate the Error Budget

(1 − 0.995) × 720h = 3.6 hours of allowed downtime per month

5

Make Engineering Decisions

Budget healthy? Deploy freely. Budget low? Stabilise. Budget exhausted? Stop, fix, review.