AI SLOs & Error Budgets
Defining and measuring "good enough" for systems that can never be 100% deterministic
Why AI Systems Require a Different Approach
Traditional SLOs assume binary outcomes — a request either succeeds or fails. AI systems introduce probabilistic outputs, where "correct" is often subjective and quality exists on a spectrum. This fundamentally changes how we define, measure, and budget for reliability.
Traditional Systems
- Deterministic: Same input → same output, every time
- Binary success: Request succeeds (200 OK) or fails (5xx)
- Clear thresholds: Latency < 200ms is measurable
- Objective quality: Data is correct or incorrect
- Stable baselines: Metrics don't drift without code changes
AI/ML Systems
- Non-deterministic: Same input may produce varied outputs
- Probabilistic quality: Responses can be good, acceptable, or poor
- Subjective thresholds: "Good enough" depends on context
- Distribution shifts: Quality can degrade as data changes
- Model drift: Performance changes over time without code changes
AI Service Level Indicators (SLIs)
AI systems require new categories of SLIs beyond traditional availability and latency. Here are the key dimensions to measure:
Quality / Accuracy
Measures how often the AI output meets an acceptable quality bar. Requires human evaluation, automated scoring, or proxy metrics.
Latency
Time-to-first-token (TTFT) and total generation time. AI latency is often higher and more variable than traditional APIs.
Availability
The proportion of requests that receive any valid response. Includes model endpoint availability and inference infrastructure uptime.
Safety & Alignment
Measures how often the AI produces harmful, off-topic, or policy-violating outputs. Critical for user-facing AI products.
Freshness / Relevance
For RAG and knowledge-based systems, measures how up-to-date and contextually relevant the AI's knowledge base is.
Cost Efficiency
Token and compute costs per successful output. Helps balance quality improvements against infrastructure spend.
Error Budgets for AI Systems
An AI error budget quantifies how much quality degradation, latency, or policy violation is acceptable before impacting the product's reliability promise. It creates a shared language between ML engineers and SREs.
The Formula
Key Principles for AI Error Budgets
Separate Budgets per SLI Dimension
Track quality, latency, and safety as independent budgets. A quality regression shouldn't consume your latency budget.
Use Rolling Windows
AI quality can shift with model updates or data drift. A 28-day rolling window captures trends better than cumulative metrics.
Account for Evaluation Lag
Human-in-the-loop quality scores arrive after the fact. Design your error budget calculations to handle delayed signals.
Gate Model Deployments
Before deploying a new model version, project its error budget impact using offline evals. Don't deploy if it burns the remaining budget.
Invest Budget in Experiments
Like traditional SRE, a healthy AI error budget means you can experiment (new prompts, fine-tuning, model upgrades) with acceptable risk.
Error Budget States & Actions
Healthy (> 50% remaining)
Quality is consistently above SLO. Safe to experiment with new model versions, prompt changes, or new features.
- Run A/B experiments
- Deploy model updates
- Try new prompting strategies
At Risk (10–50% remaining)
Quality is trending toward SLO breach. Limit risky changes and investigate any degradation signals.
- Pause non-critical experiments
- Review recent model/data changes
- Increase evaluation sample rate
Breached (< 10% remaining)
Quality has fallen below acceptable levels. Freeze all changes and focus on restoring quality.
- Freeze model deployments
- Consider rollback to last stable model
- Escalate to ML team for root cause
Interactive AI SLO Builder
Use this tool to define SLOs for your AI system. Select your use case and set your quality targets to get a recommended error budget and alerting thresholds.
Configure Your AI SLO
Alerting Strategies for AI Systems
AI alerting requires both traditional infrastructure alerts and new ML-specific signals. Here's how to build a comprehensive alerting strategy.
The Three Layers of AI Alerting
Infrastructure & Availability
Traditional infrastructure alerts — the foundation of AI reliability.
Quality & Model Performance
ML-specific signals that indicate quality degradation before users notice.
Safety & Policy Compliance
Critical alerts for user-facing AI — policy violations require immediate response.
Multi-Window Burn Rate Alerting for AI
Borrowed from Google's SRE practices, burn rate alerting detects quality degradation at different speeds — fast burns for immediate outages, slow burns for gradual drift.
| Alert Type | Long Window | Short Window | Burn Rate | Budget Consumed | Urgency |
|---|---|---|---|---|---|
| P0 – Critical | 1 hour | 5 min | 14.4× | 2% in 1 hour | Page Now |
| P1 – High | 6 hours | 30 min | 6× | 5% in 6 hours | Page Now |
| P2 – Medium | 1 day | 2 hours | 3× | 10% in 1 day | Ticket |
| P3 – Low | 3 days | 6 hours | 1× | 10% in 3 days | Monitor |
Burn rate is calculated against a 28-day window. A burn rate of 14.4× means the budget would be fully consumed in ~2 days if sustained.
Detecting & Managing Model Drift
Unlike traditional software, AI systems can degrade without any code changes. Model drift occurs when the real-world data distribution diverges from training data, silently eroding quality.
Data Drift
Input distributions shift over time (e.g., users ask different types of questions). The model wasn't trained on the new patterns.
Concept Drift
The relationship between inputs and "correct" outputs changes (e.g., new product features change what a "good" support response looks like).
Model Staleness
The world changes but the model doesn't. Knowledge cutoffs, outdated product information, or changed regulations can make old responses wrong.
Drift Response Playbook
Establish a Quality Baseline
Run a curated eval set on every model deployment. This becomes your reference point for detecting drift. Store scores in your metrics platform.
Instrument Production Quality Proxies
Human eval is expensive and slow. Use proxy signals: user feedback rates, engagement metrics, downstream task success, or an automated evaluator model.
Set Drift Thresholds & Alerts
Alert when your rolling quality proxy deviates more than 2σ from the 30-day baseline. Trigger an investigation — not necessarily a rollback.
Investigate Before Acting
Confirm drift is real (not a metric pipeline issue), identify affected request types, and determine whether a hotfix, prompt update, or model retrain is appropriate.
Close the Loop with Retraining or RAG Updates
For persistent drift, update your training data, fine-tuning set, or retrieval knowledge base. Validate against your eval set before deploying.
Frequently Asked Questions
Q: What SLO target should I set for AI quality?
A: Start by measuring current quality, then set a target slightly above it (e.g., if you're at 85%, target 88%). Don't start with a 99% quality SLO — AI systems are probabilistic by nature. Tighten targets over time as you understand your system better.
Q: How do I measure AI quality without human raters?
A: Use proxy metrics: user satisfaction signals (thumbs up/down, session continuation, task completion), automated evaluator models (LLM-as-judge), downstream task metrics (did the user need to retry?), or statistical measures like embedding similarity to ideal responses.
Q: Should AI quality SLOs be in my SLA with customers?
A: Treat quality SLOs as internal commitments first. SLAs typically cover availability and latency. If you include AI quality in an SLA, use conservative targets (e.g., 80% quality) with clear, objective measurement criteria that customers can verify.
Q: How often should I re-evaluate my AI SLO targets?
A: Review AI SLOs quarterly or after every major model update. Unlike traditional services, AI quality can improve significantly with a model upgrade, so your baseline shifts. Don't lock in SLO targets so tightly that they prevent you from upgrading models.
Q: How do I handle the latency variance in LLM responses?
A: Set separate SLOs for time-to-first-token (TTFT) and total completion time. Use streaming responses to improve perceived latency. Alert on p95 or p99 latency rather than average — long tail latency is where user experience breaks down.
Q: What's the difference between AI error budgets and traditional error budgets?
A: Traditional error budgets track binary failures (requests that return errors). AI error budgets additionally track quality failures (requests that return responses that are below the quality bar). You need both: an availability budget and a quality budget, managed independently.
Q: How do I alert on gradual model drift without too many false positives?
A: Use multi-window burn rate alerting (fast + slow windows). For gradual drift, a slow burn alert (e.g., 3× burn rate over 3 days) catches trends early without firing on normal variation. Combine with statistical process control (SPC) charts for ongoing monitoring.
Q: Can I apply SLOs to generative AI outputs that are inherently subjective?
A: Yes — the key is defining a clear evaluation rubric upfront. Decide what "good enough" means for your use case (e.g., "contains all required information", "tone matches brand guidelines", "no factual errors on verifiable claims") and operationalise it as a measurable SLI.