What is an Error Budget?
An error budget is the maximum amount of unreliability your service can tolerate before breaking its Service Level Objective (SLO). It’s not about preventing all failures—it’s about quantifying how much failure is acceptable and using that budget to guide engineering decisions.
Think of it like a financial budget. You have a fixed amount to spend. Once it’s gone, you stop spending until it replenishes. Error budgets work the same way: when you’ve exhausted your allowed unreliability, you pause risky changes and invest in stability.
The beauty of error budgets is that they replace subjective arguments (“we need to move faster” vs “we need more reliability”) with objective data: “we have 20 minutes of downtime remaining this month—do we deploy now or wait?”
How Error Budgets Work
Error budgets derive directly from your SLO. If your SLO promises 99.9% uptime over 30 days, your error budget is the remaining 0.1%—roughly 43 minutes of allowed downtime per month.
Formula:
Error Budget = 100% - SLO Target
Example Calculations:
SLO Target | Error Budget | 30-Day Allowance |
---|---|---|
99.9% | 0.1% | 43.2 minutes |
99.95% | 0.05% | 21.6 minutes |
99.99% | 0.01% | 4.32 minutes |
95% | 5% | 36 hours |
The stricter your SLO, the smaller your error budget. A 99.99% (“four nines”) SLO leaves only 4 minutes of downtime per month—leaving almost no room for mistakes.
Why Error Budgets Matter
Aligning Conflicting Incentives
Product teams want to ship fast. Operations teams want systems stable. Error budgets create a shared framework where both goals coexist.
When your error budget is healthy:
- Deploy frequently
- Experiment with new features
- Take calculated risks
- Prioritize velocity
When your error budget is exhausted:
- Freeze risky deployments
- Focus on reliability improvements
- Pay down technical debt
- Investigate root causes
This isn’t subjective—it’s policy. Error budgets remove politics from reliability decisions.
Quantifying Risk
Every change carries risk. Deployments can introduce bugs. Infrastructure changes can cause outages. Even innocuous updates can cascade into failures.
Error budgets make risk tangible. Before deploying a new feature, teams can ask: “How much error budget do we have left?” If the answer is “not much,” maybe that risky database migration waits until next week.
This doesn’t eliminate risk—it makes risk visible and manageable.
Preventing Burnout
Without error budgets, reliability teams fight an endless battle against entropy. Every outage becomes a crisis. Every incident triggers blame.
Error budgets normalize failure. Not all failures are catastrophic. Some failures fit within acceptable bounds. When your error budget says “we can tolerate this,” teams stop treating every issue as an emergency.
Measuring Error Budget Consumption
Error budgets measure unreliability over time. How you calculate consumption depends on what you’re measuring.
Availability-Based Budgets
For uptime-focused SLOs, error budget consumption tracks downtime:
Error Budget Consumed = (Total Downtime / Measurement Window) * 100
Example: Your SLO is 99.9% uptime over 30 days (43.2 minutes allowed downtime).
- Current outage: 15 minutes total downtime this month
- Budget consumed: (15 / 43.2) * 100 = 34.7%
- Remaining budget: 65.3% (28.2 minutes)
Request-Based Budgets
For services measured by request success rate:
Error Budget Consumed = (Failed Requests / Total Requests) * 100
Example: Your SLO is 99.9% request success rate.
- Total requests: 1,000,000
- Failed requests: 500
- Budget consumed: (500 / 1000) = 50%
- Remaining budget: 50% (500 more failures allowed)
Latency-Based Budgets
For performance-focused SLOs:
Error Budget Consumed = (Slow Requests / Total Requests) * 100
Example: Your SLO is 95th percentile response time under 200ms.
- Track requests exceeding threshold
- Each slow request consumes budget proportionally
Using Error Budgets to Drive Decisions
Error budgets aren’t just dashboards—they’re decision-making frameworks.
Policy-Based Responses
Teams establish policies tied to budget levels:
Budget Status: Healthy (greater than 50% remaining)
- Normal development velocity
- Standard deployment cadence
- Feature work prioritized
- Monitoring improvements continue
Budget Status: Warning (10-50% remaining)
- Increase code review rigor
- Require extra testing for risky changes
- Delay non-critical deployments
- Begin investigating reliability gaps
Budget Status: Exhausted (less than 10% remaining)
- Deployment freeze for non-critical changes
- All engineering effort on stability
- Post-incident reviews required
- Root cause analysis for all incidents
- Technical debt prioritization
These policies aren’t suggestions—they’re enforced. When budget runs out, velocity slows. When budget replenishes, velocity resumes.
Tracking Burn Rate
How fast are you consuming your error budget? If you’re burning through it in the first week of the month, you won’t survive until the end.
Burn rate measures error budget consumption velocity:
Burn Rate = (Budget Consumed in Period) / (Expected Consumption if Linear)
- Burn rate = 1.0: On track (consuming budget evenly)
- Burn rate greater than 1.0: Burning too fast (need intervention)
- Burn rate less than 1.0: Consuming slowly (healthy margin)
Example: Your monthly error budget is 43 minutes.
- After 10 days (33% of month), you’ve used 20 minutes (46% of budget)
- Burn rate = 46% / 33% = 1.4x
- Interpretation: You’re burning budget 1.4x faster than sustainable
Burn rate alerts catch problems before budget exhaustion.
Common Mistakes
Setting Unrealistic SLOs
An SLO of 99.999% (“five nines”) sounds impressive. It also leaves only 26 seconds of downtime per month. For most teams, that’s unachievable without massive infrastructure investment.
Unrealistic SLOs create tiny error budgets that get exhausted immediately, forcing permanent deployment freezes. Instead, set SLOs based on:
- Current performance baselines
- User expectations (what do users actually notice?)
- Infrastructure maturity
- Team capacity
Better to promise 99.9% and consistently deliver than promise 99.999% and constantly break it.
Ignoring Error Budget Signals
Error budgets only work if teams respect them. If you exhaust your budget but keep deploying anyway, you’ve just created theater—useless numbers on a dashboard.
Enforcement requires organizational buy-in. Leadership must support deployment freezes when budgets run out. Product managers must accept delayed features during reliability sprints.
Measuring the Wrong Thing
Your error budget should measure what users care about. If users don’t notice occasional 503 errors but they do notice slow page loads, your budget should prioritize latency over availability.
Misaligned budgets waste engineering effort on problems users don’t experience while ignoring issues that matter.
Treating All Failures Equally
A 10-second outage at 3 a.m. affecting 5 users is not the same as a 10-second outage at noon affecting 5,000 users. Yet most error budget calculations treat them identically.
Some teams weight budget consumption by impact:
- User-facing failures consume more budget
- Peak-hour failures consume more than off-peak
- Critical path failures (checkout, login) consume more than secondary features
This creates nuanced budgets that reflect business priorities.
Practical Implementation
Start Simple
Don’t build a complex error budget system on day one. Start with:
- Pick one critical service with clear uptime requirements
- Define a realistic SLO based on current performance
- Track downtime manually using incident logs
- Calculate monthly budget consumption in a spreadsheet
- Review monthly and adjust SLO if needed
Once this works, expand to more services and automate tracking.
Automate Tracking
Manual error budget tracking doesn’t scale. As your system grows, you need tools that:
- Continuously monitor SLIs (uptime, latency, error rates)
- Automatically calculate budget consumption
- Alert when burn rate exceeds thresholds
- Provide historical trends and forecasts
Platforms like Upstat provide uptime monitoring and performance tracking that generate the underlying data needed for error budget calculations, helping teams maintain visibility into service health and make informed reliability decisions.
Review and Iterate
Error budgets aren’t static. Review them quarterly:
- Are SLOs still aligned with user expectations?
- Is budget consumption consistent or spiking?
- Are teams respecting budget policies?
- Should burn rate thresholds change?
Adjust as your service matures and requirements evolve.
Conclusion: Making Reliability Measurable
Error budgets transform reliability from a vague aspiration into a measurable, actionable framework. They answer questions like:
- How much downtime is acceptable?
- Should we deploy this risky change?
- When do we focus on stability versus features?
By quantifying acceptable unreliability, error budgets align teams around shared goals, reduce conflict between velocity and stability, and prevent both over-engineering (chasing perfection) and under-engineering (ignoring reliability).
Perfect reliability is impossible. But predictable, budgeted reliability—where teams know how much failure they can tolerate and what to do when they exceed it—is entirely achievable.
Define your error budget. Measure it. Respect it. And let data, not politics, guide your reliability decisions.
Explore In Upstat
Track service uptime and performance metrics that enable error budget calculations, helping teams make data-driven reliability decisions.