AI SLOs & Error Budgets

Why AI Systems Require a Different Approach

Traditional SLOs assume binary outcomes — a request either succeeds or fails. AI systems introduce probabilistic outputs, where "correct" is often subjective and quality exists on a spectrum. This fundamentally changes how we define, measure, and budget for reliability.

⚙️

Traditional Systems

Deterministic: Same input → same output, every time
Binary success: Request succeeds (200 OK) or fails (5xx)
Clear thresholds: Latency < 200ms is measurable
Objective quality: Data is correct or incorrect
Stable baselines: Metrics don't drift without code changes

🤖

AI/ML Systems

Non-deterministic: Same input may produce varied outputs
Probabilistic quality: Responses can be good, acceptable, or poor
Subjective thresholds: "Good enough" depends on context
Distribution shifts: Quality can degrade as data changes
Model drift: Performance changes over time without code changes

AI Service Level Indicators (SLIs)

AI systems require new categories of SLIs beyond traditional availability and latency. Here are the key dimensions to measure:

🎯

Quality / Accuracy

Measures how often the AI output meets an acceptable quality bar. Requires human evaluation, automated scoring, or proxy metrics.

Human thumbs-up rate BLEU / ROUGE scores Task completion rate Retrieval precision@k

Example SLO: 90% of responses rated ≥ 4/5 by users over a 28-day window

⚡

Latency

Time-to-first-token (TTFT) and total generation time. AI latency is often higher and more variable than traditional APIs.

Time-to-first-token Total response time Tokens per second P99 latency

Example SLO: 95% of requests return first token within 500ms

✅

Availability

The proportion of requests that receive any valid response. Includes model endpoint availability and inference infrastructure uptime.

Successful inference rate Model endpoint uptime Timeout rate Error rate (5xx)

Example SLO: 99.5% of inference requests complete without error

🛡️

Safety & Alignment

Measures how often the AI produces harmful, off-topic, or policy-violating outputs. Critical for user-facing AI products.

Policy violation rate Hallucination rate Refusal accuracy Toxicity rate

Example SLO: < 0.1% of responses flagged by safety classifier

📊

Freshness / Relevance

For RAG and knowledge-based systems, measures how up-to-date and contextually relevant the AI's knowledge base is.

Index staleness Context hit rate Knowledge cutoff drift Retrieval relevance score

Example SLO: Knowledge base refreshed within 24 hours for 99% of updates

💸

Cost Efficiency

Token and compute costs per successful output. Helps balance quality improvements against infrastructure spend.

Cost per request Tokens per response Cache hit rate Model utilisation

Example SLO: Average cost per successful completion < $0.005

Error Budgets for AI Systems

An AI error budget quantifies how much quality degradation, latency, or policy violation is acceptable before impacting the product's reliability promise. It creates a shared language between ML engineers and SREs.

The Formula

Error Budget = 100% − SLO Target

Example: SLO = 90% quality → Error Budget = 10% of responses can be below quality bar

Key Principles for AI Error Budgets

1

Separate Budgets per SLI Dimension

Track quality, latency, and safety as independent budgets. A quality regression shouldn't consume your latency budget.

2

Use Rolling Windows

AI quality can shift with model updates or data drift. A 28-day rolling window captures trends better than cumulative metrics.

3

Account for Evaluation Lag

Human-in-the-loop quality scores arrive after the fact. Design your error budget calculations to handle delayed signals.

4

Gate Model Deployments

Before deploying a new model version, project its error budget impact using offline evals. Don't deploy if it burns the remaining budget.

5

Invest Budget in Experiments

Like traditional SRE, a healthy AI error budget means you can experiment (new prompts, fine-tuning, model upgrades) with acceptable risk.

Error Budget States & Actions

Healthy (> 50% remaining)

Quality is consistently above SLO. Safe to experiment with new model versions, prompt changes, or new features.

Run A/B experiments
Deploy model updates
Try new prompting strategies

At Risk (10–50% remaining)

Quality is trending toward SLO breach. Limit risky changes and investigate any degradation signals.

Pause non-critical experiments
Review recent model/data changes
Increase evaluation sample rate

Breached (< 10% remaining)

Quality has fallen below acceptable levels. Freeze all changes and focus on restoring quality.

Freeze model deployments
Consider rollback to last stable model
Escalate to ML team for root cause

Interactive AI SLO Builder

Use this tool to define SLOs for your AI system. Select your use case and set your quality targets to get a recommended error budget and alerting thresholds.

Configure Your AI SLO

AI Use Case:

Quality / Accuracy SLO Target (%):

Latency SLO — First Token / Response (ms):

Latency Percentile:

Safety / Policy SLO Required?

Measurement Window (days):

Alerting Strategies for AI Systems

AI alerting requires both traditional infrastructure alerts and new ML-specific signals. Here's how to build a comprehensive alerting strategy.

The Three Layers of AI Alerting

Layer 1

Infrastructure & Availability

Traditional infrastructure alerts — the foundation of AI reliability.

Model Endpoint Error Rate

Alert when 5xx rate > 1% over 5 min

Inference Latency Spike

Alert when p95 latency > 2× baseline over 10 min

GPU / Compute Saturation

Alert when GPU utilisation > 90% for 15 min

Queue Depth

Alert when inference queue depth > 100 requests

Layer 2

Quality & Model Performance

ML-specific signals that indicate quality degradation before users notice.

Burn Rate Alert

Alert when error budget burn rate > 2× at 1-hour window

Quality Score Drop

Alert when 7-day rolling quality score drops > 5% from baseline

User Negative Feedback Spike

Alert when thumbs-down rate > 3× 7-day average

Output Token Anomaly

Alert when average response length deviates > 3σ from baseline

Layer 3

Safety & Policy Compliance

Critical alerts for user-facing AI — policy violations require immediate response.

Safety Classifier Violation

Page on-call immediately for any spike above 0.1% violation rate

Hallucination Rate Increase

Alert when factual accuracy drops > 10% in 1-hour window

Unexpected Refusal Spike

Alert when refusal rate increases > 2× normal (model may be stuck)

PII / Sensitive Data Leak

Immediate page on-call for any detected PII in outputs

Multi-Window Burn Rate Alerting for AI

Borrowed from Google's SRE practices, burn rate alerting detects quality degradation at different speeds — fast burns for immediate outages, slow burns for gradual drift.

Alert Type	Long Window	Short Window	Burn Rate	Budget Consumed	Urgency
P0 – Critical	1 hour	5 min	14.4×	2% in 1 hour	Page Now
P1 – High	6 hours	30 min	6×	5% in 6 hours	Page Now
P2 – Medium	1 day	2 hours	3×	10% in 1 day	Ticket
P3 – Low	3 days	6 hours	1×	10% in 3 days	Monitor

Burn rate is calculated against a 28-day window. A burn rate of 14.4× means the budget would be fully consumed in ~2 days if sustained.

Detecting & Managing Model Drift

Unlike traditional software, AI systems can degrade without any code changes. Model drift occurs when the real-world data distribution diverges from training data, silently eroding quality.

📉

Data Drift

Input distributions shift over time (e.g., users ask different types of questions). The model wasn't trained on the new patterns.

Detection: Monitor statistical properties of inputs (embedding distance, feature histograms)

🎯

Concept Drift

The relationship between inputs and "correct" outputs changes (e.g., new product features change what a "good" support response looks like).

Detection: Monitor quality scores over time; compare recent samples to baseline eval set

🔄

Model Staleness

The world changes but the model doesn't. Knowledge cutoffs, outdated product information, or changed regulations can make old responses wrong.

Detection: Track freshness metrics, user correction rates, and knowledge coverage gaps

Drift Response Playbook

1

Establish a Quality Baseline

Run a curated eval set on every model deployment. This becomes your reference point for detecting drift. Store scores in your metrics platform.

2

Instrument Production Quality Proxies

Human eval is expensive and slow. Use proxy signals: user feedback rates, engagement metrics, downstream task success, or an automated evaluator model.

3

Set Drift Thresholds & Alerts

Alert when your rolling quality proxy deviates more than 2σ from the 30-day baseline. Trigger an investigation — not necessarily a rollback.

4

Investigate Before Acting

Confirm drift is real (not a metric pipeline issue), identify affected request types, and determine whether a hotfix, prompt update, or model retrain is appropriate.

5

Close the Loop with Retraining or RAG Updates

For persistent drift, update your training data, fine-tuning set, or retrieval knowledge base. Validate against your eval set before deploying.

Frequently Asked Questions

Q: What SLO target should I set for AI quality?

A: Start by measuring current quality, then set a target slightly above it (e.g., if you're at 85%, target 88%). Don't start with a 99% quality SLO — AI systems are probabilistic by nature. Tighten targets over time as you understand your system better.

Q: How do I measure AI quality without human raters?

A: Use proxy metrics: user satisfaction signals (thumbs up/down, session continuation, task completion), automated evaluator models (LLM-as-judge), downstream task metrics (did the user need to retry?), or statistical measures like embedding similarity to ideal responses.

Q: Should AI quality SLOs be in my SLA with customers?

A: Treat quality SLOs as internal commitments first. SLAs typically cover availability and latency. If you include AI quality in an SLA, use conservative targets (e.g., 80% quality) with clear, objective measurement criteria that customers can verify.

Q: How often should I re-evaluate my AI SLO targets?

A: Review AI SLOs quarterly or after every major model update. Unlike traditional services, AI quality can improve significantly with a model upgrade, so your baseline shifts. Don't lock in SLO targets so tightly that they prevent you from upgrading models.

Q: How do I handle the latency variance in LLM responses?

A: Set separate SLOs for time-to-first-token (TTFT) and total completion time. Use streaming responses to improve perceived latency. Alert on p95 or p99 latency rather than average — long tail latency is where user experience breaks down.

Q: What's the difference between AI error budgets and traditional error budgets?

A: Traditional error budgets track binary failures (requests that return errors). AI error budgets additionally track quality failures (requests that return responses that are below the quality bar). You need both: an availability budget and a quality budget, managed independently.

Q: How do I alert on gradual model drift without too many false positives?

A: Use multi-window burn rate alerting (fast + slow windows). For gradual drift, a slow burn alert (e.g., 3× burn rate over 3 days) catches trends early without firing on normal variation. Combine with statistical process control (SPC) charts for ongoing monitoring.

Q: Can I apply SLOs to generative AI outputs that are inherently subjective?

A: Yes — the key is defining a clear evaluation rubric upfront. Decide what "good enough" means for your use case (e.g., "contains all required information", "tone matches brand guidelines", "no factual errors on verifiable claims") and operationalise it as a measurable SLI.