AI SLOs & Error Budgets

Defining and measuring "good enough" for systems that can never be 100% deterministic

Why AI Systems Require a Different Approach

Traditional SLOs assume binary outcomes — a request either succeeds or fails. AI systems introduce probabilistic outputs, where "correct" is often subjective and quality exists on a spectrum. This fundamentally changes how we define, measure, and budget for reliability.

⚙️

Traditional Systems

  • Deterministic: Same input → same output, every time
  • Binary success: Request succeeds (200 OK) or fails (5xx)
  • Clear thresholds: Latency < 200ms is measurable
  • Objective quality: Data is correct or incorrect
  • Stable baselines: Metrics don't drift without code changes
🤖

AI/ML Systems

  • Non-deterministic: Same input may produce varied outputs
  • Probabilistic quality: Responses can be good, acceptable, or poor
  • Subjective thresholds: "Good enough" depends on context
  • Distribution shifts: Quality can degrade as data changes
  • Model drift: Performance changes over time without code changes
💡
The Core Insight: For AI systems, you don't aim for perfection — you define what "good enough" means, measure how often you achieve it, and budget for the times you don't. The question shifts from "did it work?" to "was it good enough?"

AI Service Level Indicators (SLIs)

AI systems require new categories of SLIs beyond traditional availability and latency. Here are the key dimensions to measure:

🎯

Quality / Accuracy

Measures how often the AI output meets an acceptable quality bar. Requires human evaluation, automated scoring, or proxy metrics.

Human thumbs-up rate BLEU / ROUGE scores Task completion rate Retrieval precision@k
Example SLO: 90% of responses rated ≥ 4/5 by users over a 28-day window

Latency

Time-to-first-token (TTFT) and total generation time. AI latency is often higher and more variable than traditional APIs.

Time-to-first-token Total response time Tokens per second P99 latency
Example SLO: 95% of requests return first token within 500ms

Availability

The proportion of requests that receive any valid response. Includes model endpoint availability and inference infrastructure uptime.

Successful inference rate Model endpoint uptime Timeout rate Error rate (5xx)
Example SLO: 99.5% of inference requests complete without error
🛡️

Safety & Alignment

Measures how often the AI produces harmful, off-topic, or policy-violating outputs. Critical for user-facing AI products.

Policy violation rate Hallucination rate Refusal accuracy Toxicity rate
Example SLO: < 0.1% of responses flagged by safety classifier
📊

Freshness / Relevance

For RAG and knowledge-based systems, measures how up-to-date and contextually relevant the AI's knowledge base is.

Index staleness Context hit rate Knowledge cutoff drift Retrieval relevance score
Example SLO: Knowledge base refreshed within 24 hours for 99% of updates
💸

Cost Efficiency

Token and compute costs per successful output. Helps balance quality improvements against infrastructure spend.

Cost per request Tokens per response Cache hit rate Model utilisation
Example SLO: Average cost per successful completion < $0.005

Error Budgets for AI Systems

An AI error budget quantifies how much quality degradation, latency, or policy violation is acceptable before impacting the product's reliability promise. It creates a shared language between ML engineers and SREs.

The Formula

Error Budget = 100% − SLO Target
Example: SLO = 90% quality → Error Budget = 10% of responses can be below quality bar

Key Principles for AI Error Budgets

1

Separate Budgets per SLI Dimension

Track quality, latency, and safety as independent budgets. A quality regression shouldn't consume your latency budget.

2

Use Rolling Windows

AI quality can shift with model updates or data drift. A 28-day rolling window captures trends better than cumulative metrics.

3

Account for Evaluation Lag

Human-in-the-loop quality scores arrive after the fact. Design your error budget calculations to handle delayed signals.

4

Gate Model Deployments

Before deploying a new model version, project its error budget impact using offline evals. Don't deploy if it burns the remaining budget.

5

Invest Budget in Experiments

Like traditional SRE, a healthy AI error budget means you can experiment (new prompts, fine-tuning, model upgrades) with acceptable risk.

Error Budget States & Actions

Healthy (> 50% remaining)

Quality is consistently above SLO. Safe to experiment with new model versions, prompt changes, or new features.

  • Run A/B experiments
  • Deploy model updates
  • Try new prompting strategies

At Risk (10–50% remaining)

Quality is trending toward SLO breach. Limit risky changes and investigate any degradation signals.

  • Pause non-critical experiments
  • Review recent model/data changes
  • Increase evaluation sample rate

Breached (< 10% remaining)

Quality has fallen below acceptable levels. Freeze all changes and focus on restoring quality.

  • Freeze model deployments
  • Consider rollback to last stable model
  • Escalate to ML team for root cause

Interactive AI SLO Builder

Use this tool to define SLOs for your AI system. Select your use case and set your quality targets to get a recommended error budget and alerting thresholds.

Configure Your AI SLO

Alerting Strategies for AI Systems

AI alerting requires both traditional infrastructure alerts and new ML-specific signals. Here's how to build a comprehensive alerting strategy.

The Three Layers of AI Alerting

Layer 1

Infrastructure & Availability

Traditional infrastructure alerts — the foundation of AI reliability.

Model Endpoint Error Rate
Alert when 5xx rate > 1% over 5 min
Inference Latency Spike
Alert when p95 latency > 2× baseline over 10 min
GPU / Compute Saturation
Alert when GPU utilisation > 90% for 15 min
Queue Depth
Alert when inference queue depth > 100 requests
Layer 2

Quality & Model Performance

ML-specific signals that indicate quality degradation before users notice.

Burn Rate Alert
Alert when error budget burn rate > 2× at 1-hour window
Quality Score Drop
Alert when 7-day rolling quality score drops > 5% from baseline
User Negative Feedback Spike
Alert when thumbs-down rate > 3× 7-day average
Output Token Anomaly
Alert when average response length deviates > 3σ from baseline
Layer 3

Safety & Policy Compliance

Critical alerts for user-facing AI — policy violations require immediate response.

Safety Classifier Violation
Page on-call immediately for any spike above 0.1% violation rate
Hallucination Rate Increase
Alert when factual accuracy drops > 10% in 1-hour window
Unexpected Refusal Spike
Alert when refusal rate increases > 2× normal (model may be stuck)
PII / Sensitive Data Leak
Immediate page on-call for any detected PII in outputs

Multi-Window Burn Rate Alerting for AI

Borrowed from Google's SRE practices, burn rate alerting detects quality degradation at different speeds — fast burns for immediate outages, slow burns for gradual drift.

Alert Type Long Window Short Window Burn Rate Budget Consumed Urgency
P0 – Critical 1 hour 5 min 14.4× 2% in 1 hour Page Now
P1 – High 6 hours 30 min 5% in 6 hours Page Now
P2 – Medium 1 day 2 hours 10% in 1 day Ticket
P3 – Low 3 days 6 hours 10% in 3 days Monitor

Burn rate is calculated against a 28-day window. A burn rate of 14.4× means the budget would be fully consumed in ~2 days if sustained.

Detecting & Managing Model Drift

Unlike traditional software, AI systems can degrade without any code changes. Model drift occurs when the real-world data distribution diverges from training data, silently eroding quality.

📉

Data Drift

Input distributions shift over time (e.g., users ask different types of questions). The model wasn't trained on the new patterns.

Detection: Monitor statistical properties of inputs (embedding distance, feature histograms)
🎯

Concept Drift

The relationship between inputs and "correct" outputs changes (e.g., new product features change what a "good" support response looks like).

Detection: Monitor quality scores over time; compare recent samples to baseline eval set
🔄

Model Staleness

The world changes but the model doesn't. Knowledge cutoffs, outdated product information, or changed regulations can make old responses wrong.

Detection: Track freshness metrics, user correction rates, and knowledge coverage gaps

Drift Response Playbook

1

Establish a Quality Baseline

Run a curated eval set on every model deployment. This becomes your reference point for detecting drift. Store scores in your metrics platform.

2

Instrument Production Quality Proxies

Human eval is expensive and slow. Use proxy signals: user feedback rates, engagement metrics, downstream task success, or an automated evaluator model.

3

Set Drift Thresholds & Alerts

Alert when your rolling quality proxy deviates more than 2σ from the 30-day baseline. Trigger an investigation — not necessarily a rollback.

4

Investigate Before Acting

Confirm drift is real (not a metric pipeline issue), identify affected request types, and determine whether a hotfix, prompt update, or model retrain is appropriate.

5

Close the Loop with Retraining or RAG Updates

For persistent drift, update your training data, fine-tuning set, or retrieval knowledge base. Validate against your eval set before deploying.

Frequently Asked Questions

Q: What SLO target should I set for AI quality?

A: Start by measuring current quality, then set a target slightly above it (e.g., if you're at 85%, target 88%). Don't start with a 99% quality SLO — AI systems are probabilistic by nature. Tighten targets over time as you understand your system better.

Q: How do I measure AI quality without human raters?

A: Use proxy metrics: user satisfaction signals (thumbs up/down, session continuation, task completion), automated evaluator models (LLM-as-judge), downstream task metrics (did the user need to retry?), or statistical measures like embedding similarity to ideal responses.

Q: Should AI quality SLOs be in my SLA with customers?

A: Treat quality SLOs as internal commitments first. SLAs typically cover availability and latency. If you include AI quality in an SLA, use conservative targets (e.g., 80% quality) with clear, objective measurement criteria that customers can verify.

Q: How often should I re-evaluate my AI SLO targets?

A: Review AI SLOs quarterly or after every major model update. Unlike traditional services, AI quality can improve significantly with a model upgrade, so your baseline shifts. Don't lock in SLO targets so tightly that they prevent you from upgrading models.

Q: How do I handle the latency variance in LLM responses?

A: Set separate SLOs for time-to-first-token (TTFT) and total completion time. Use streaming responses to improve perceived latency. Alert on p95 or p99 latency rather than average — long tail latency is where user experience breaks down.

Q: What's the difference between AI error budgets and traditional error budgets?

A: Traditional error budgets track binary failures (requests that return errors). AI error budgets additionally track quality failures (requests that return responses that are below the quality bar). You need both: an availability budget and a quality budget, managed independently.

Q: How do I alert on gradual model drift without too many false positives?

A: Use multi-window burn rate alerting (fast + slow windows). For gradual drift, a slow burn alert (e.g., 3× burn rate over 3 days) catches trends early without firing on normal variation. Combine with statistical process control (SPC) charts for ongoing monitoring.

Q: Can I apply SLOs to generative AI outputs that are inherently subjective?

A: Yes — the key is defining a clear evaluation rubric upfront. Decide what "good enough" means for your use case (e.g., "contains all required information", "tone matches brand guidelines", "no factual errors on verifiable claims") and operationalise it as a measurable SLI.