Blog Home  /  error-budget

What is an Error Budget?

Error budgets quantify acceptable unreliability, translating SLOs into actionable guidance for balancing feature velocity with system stability. This guide explains how error budgets work and how teams use them to make data-driven reliability decisions.

August 14, 2025 undefined
sre

What is an Error Budget?

An error budget is the maximum amount of unreliability your service can tolerate before breaking its Service Level Objective (SLO). It’s not about preventing all failures—it’s about quantifying how much failure is acceptable and using that budget to guide engineering decisions.

Think of it like a financial budget. You have a fixed amount to spend. Once it’s gone, you stop spending until it replenishes. Error budgets work the same way: when you’ve exhausted your allowed unreliability, you pause risky changes and invest in stability.

The beauty of error budgets is that they replace subjective arguments (“we need to move faster” vs “we need more reliability”) with objective data: “we have 20 minutes of downtime remaining this month—do we deploy now or wait?”

How Error Budgets Work

Error budgets derive directly from your SLO. If your SLO promises 99.9% uptime over 30 days, your error budget is the remaining 0.1%—roughly 43 minutes of allowed downtime per month.

Formula:

Error Budget = 100% - SLO Target

Example Calculations:

SLO Target Error Budget 30-Day Allowance
99.9% 0.1% 43.2 minutes
99.95% 0.05% 21.6 minutes
99.99% 0.01% 4.32 minutes
95% 5% 36 hours

The stricter your SLO, the smaller your error budget. A 99.99% (“four nines”) SLO leaves only 4 minutes of downtime per month—leaving almost no room for mistakes.

Why Error Budgets Matter

Aligning Conflicting Incentives

Product teams want to ship fast. Operations teams want systems stable. Error budgets create a shared framework where both goals coexist.

When your error budget is healthy:

  • Deploy frequently
  • Experiment with new features
  • Take calculated risks
  • Prioritize velocity

When your error budget is exhausted:

  • Freeze risky deployments
  • Focus on reliability improvements
  • Pay down technical debt
  • Investigate root causes

This isn’t subjective—it’s policy. Error budgets remove politics from reliability decisions.

Quantifying Risk

Every change carries risk. Deployments can introduce bugs. Infrastructure changes can cause outages. Even innocuous updates can cascade into failures.

Error budgets make risk tangible. Before deploying a new feature, teams can ask: “How much error budget do we have left?” If the answer is “not much,” maybe that risky database migration waits until next week.

This doesn’t eliminate risk—it makes risk visible and manageable.

Preventing Burnout

Without error budgets, reliability teams fight an endless battle against entropy. Every outage becomes a crisis. Every incident triggers blame.

Error budgets normalize failure. Not all failures are catastrophic. Some failures fit within acceptable bounds. When your error budget says “we can tolerate this,” teams stop treating every issue as an emergency.

Measuring Error Budget Consumption

Error budgets measure unreliability over time. How you calculate consumption depends on what you’re measuring.

Availability-Based Budgets

For uptime-focused SLOs, error budget consumption tracks downtime:

Error Budget Consumed = (Total Downtime / Measurement Window) * 100

Example: Your SLO is 99.9% uptime over 30 days (43.2 minutes allowed downtime).

  • Current outage: 15 minutes total downtime this month
  • Budget consumed: (15 / 43.2) * 100 = 34.7%
  • Remaining budget: 65.3% (28.2 minutes)

Request-Based Budgets

For services measured by request success rate:

Error Budget Consumed = (Failed Requests / Total Requests) * 100

Example: Your SLO is 99.9% request success rate.

  • Total requests: 1,000,000
  • Failed requests: 500
  • Budget consumed: (500 / 1000) = 50%
  • Remaining budget: 50% (500 more failures allowed)

Latency-Based Budgets

For performance-focused SLOs:

Error Budget Consumed = (Slow Requests / Total Requests) * 100

Example: Your SLO is 95th percentile response time under 200ms.

  • Track requests exceeding threshold
  • Each slow request consumes budget proportionally

Using Error Budgets to Drive Decisions

Error budgets aren’t just dashboards—they’re decision-making frameworks.

Policy-Based Responses

Teams establish policies tied to budget levels:

Budget Status: Healthy (greater than 50% remaining)

  • Normal development velocity
  • Standard deployment cadence
  • Feature work prioritized
  • Monitoring improvements continue

Budget Status: Warning (10-50% remaining)

  • Increase code review rigor
  • Require extra testing for risky changes
  • Delay non-critical deployments
  • Begin investigating reliability gaps

Budget Status: Exhausted (less than 10% remaining)

  • Deployment freeze for non-critical changes
  • All engineering effort on stability
  • Post-incident reviews required
  • Root cause analysis for all incidents
  • Technical debt prioritization

These policies aren’t suggestions—they’re enforced. When budget runs out, velocity slows. When budget replenishes, velocity resumes.

Tracking Burn Rate

How fast are you consuming your error budget? If you’re burning through it in the first week of the month, you won’t survive until the end.

Burn rate measures error budget consumption velocity:

Burn Rate = (Budget Consumed in Period) / (Expected Consumption if Linear)
  • Burn rate = 1.0: On track (consuming budget evenly)
  • Burn rate greater than 1.0: Burning too fast (need intervention)
  • Burn rate less than 1.0: Consuming slowly (healthy margin)

Example: Your monthly error budget is 43 minutes.

  • After 10 days (33% of month), you’ve used 20 minutes (46% of budget)
  • Burn rate = 46% / 33% = 1.4x
  • Interpretation: You’re burning budget 1.4x faster than sustainable

Burn rate alerts catch problems before budget exhaustion.

Common Mistakes

Setting Unrealistic SLOs

An SLO of 99.999% (“five nines”) sounds impressive. It also leaves only 26 seconds of downtime per month. For most teams, that’s unachievable without massive infrastructure investment.

Unrealistic SLOs create tiny error budgets that get exhausted immediately, forcing permanent deployment freezes. Instead, set SLOs based on:

  • Current performance baselines
  • User expectations (what do users actually notice?)
  • Infrastructure maturity
  • Team capacity

Better to promise 99.9% and consistently deliver than promise 99.999% and constantly break it.

Ignoring Error Budget Signals

Error budgets only work if teams respect them. If you exhaust your budget but keep deploying anyway, you’ve just created theater—useless numbers on a dashboard.

Enforcement requires organizational buy-in. Leadership must support deployment freezes when budgets run out. Product managers must accept delayed features during reliability sprints.

Measuring the Wrong Thing

Your error budget should measure what users care about. If users don’t notice occasional 503 errors but they do notice slow page loads, your budget should prioritize latency over availability.

Misaligned budgets waste engineering effort on problems users don’t experience while ignoring issues that matter.

Treating All Failures Equally

A 10-second outage at 3 a.m. affecting 5 users is not the same as a 10-second outage at noon affecting 5,000 users. Yet most error budget calculations treat them identically.

Some teams weight budget consumption by impact:

  • User-facing failures consume more budget
  • Peak-hour failures consume more than off-peak
  • Critical path failures (checkout, login) consume more than secondary features

This creates nuanced budgets that reflect business priorities.

Practical Implementation

Start Simple

Don’t build a complex error budget system on day one. Start with:

  1. Pick one critical service with clear uptime requirements
  2. Define a realistic SLO based on current performance
  3. Track downtime manually using incident logs
  4. Calculate monthly budget consumption in a spreadsheet
  5. Review monthly and adjust SLO if needed

Once this works, expand to more services and automate tracking.

Automate Tracking

Manual error budget tracking doesn’t scale. As your system grows, you need tools that:

  • Continuously monitor SLIs (uptime, latency, error rates)
  • Automatically calculate budget consumption
  • Alert when burn rate exceeds thresholds
  • Provide historical trends and forecasts

Platforms like Upstat provide uptime monitoring and performance tracking that generate the underlying data needed for error budget calculations, helping teams maintain visibility into service health and make informed reliability decisions.

Review and Iterate

Error budgets aren’t static. Review them quarterly:

  • Are SLOs still aligned with user expectations?
  • Is budget consumption consistent or spiking?
  • Are teams respecting budget policies?
  • Should burn rate thresholds change?

Adjust as your service matures and requirements evolve.

Conclusion: Making Reliability Measurable

Error budgets transform reliability from a vague aspiration into a measurable, actionable framework. They answer questions like:

  • How much downtime is acceptable?
  • Should we deploy this risky change?
  • When do we focus on stability versus features?

By quantifying acceptable unreliability, error budgets align teams around shared goals, reduce conflict between velocity and stability, and prevent both over-engineering (chasing perfection) and under-engineering (ignoring reliability).

Perfect reliability is impossible. But predictable, budgeted reliability—where teams know how much failure they can tolerate and what to do when they exceed it—is entirely achievable.

Define your error budget. Measure it. Respect it. And let data, not politics, guide your reliability decisions.

Explore In Upstat

Track service uptime and performance metrics that enable error budget calculations, helping teams make data-driven reliability decisions.