How do error budgets help teams make decisions?

Error budgets replace subjective reliability arguments with objective data. When error budget is healthy, teams can move fast and take risks with new features. When it's exhausted, teams pause risky changes and focus on stability work. This creates a data-driven balance between innovation and reliability.

What happens when you run out of error budget?

When error budget is exhausted, teams typically implement a change freeze or slow down, prioritizing reliability work over new features. This might include fixing bugs, improving monitoring, strengthening tests, or addressing technical debt. The goal is to restore reliability before resuming normal development pace.

How do you calculate an error budget?

Error Budget = 100% - SLO Target. For uptime, convert to time: a 99.9% monthly SLO leaves 0.1% error budget or 43.2 minutes of downtime. For success rate, if your SLO is 99.95% successful requests, your error budget allows 0.05% failed requests (50 failures per 100,000 requests).

What is an Error Budget? How SREs Balance Reliability

Q: What is an error budget?

An error budget is the maximum amount of unreliability your service can tolerate before breaking its SLO. It quantifies acceptable failure, calculated as 100% minus your SLO target. For example, a 99.9% SLO means a 0.1% error budget, or about 43 minutes of downtime per month.

What is an Error Budget?

An error budget is the maximum amount of unreliability your service can tolerate before breaking its Service Level Objective (SLO). It’s not about preventing all failures—it’s about quantifying how much failure is acceptable and using that budget to guide engineering decisions.

Think of it like a financial budget. You have a fixed amount to spend. Once it’s gone, you stop spending until it replenishes. Error budgets work the same way: when you’ve exhausted your allowed unreliability, you pause risky changes and invest in stability.

The beauty of error budgets is that they replace subjective arguments (“we need to move faster” vs “we need more reliability”) with objective data: “we have 20 minutes of downtime remaining this month—do we deploy now or wait?”

How Error Budgets Work

Error budgets derive directly from your SLO. If your SLO promises 99.9% uptime over 30 days, your error budget is the remaining 0.1%—roughly 43 minutes of allowed downtime per month.

Formula:

Error Budget = 100% - SLO Target

Example Calculations:

SLO Target	Error Budget	30-Day Allowance
99.9%	0.1%	43.2 minutes
99.95%	0.05%	21.6 minutes
99.99%	0.01%	4.32 minutes
95%	5%	36 hours

The stricter your SLO, the smaller your error budget. A 99.99% (“four nines”) SLO leaves only 4 minutes of downtime per month—leaving almost no room for mistakes.

Why Error Budgets Matter

Aligning Conflicting Incentives

Product teams want to ship fast. Operations teams want systems stable. Error budgets create a shared framework where both goals coexist.

When your error budget is healthy:

Deploy frequently
Experiment with new features
Take calculated risks
Prioritize velocity

When your error budget is exhausted:

Freeze risky deployments
Focus on reliability improvements
Pay down technical debt
Investigate root causes

This isn’t subjective—it’s policy. Error budgets remove politics from reliability decisions.

Quantifying Risk

Every change carries risk. Deployments can introduce bugs. Infrastructure changes can cause outages. Even innocuous updates can cascade into failures.

Error budgets make risk tangible. Before deploying a new feature, teams can ask: “How much error budget do we have left?” If the answer is “not much,” maybe that risky database migration waits until next week.

This doesn’t eliminate risk—it makes risk visible and manageable.

Preventing Burnout

Without error budgets, reliability teams fight an endless battle against entropy. Every outage becomes a crisis. Every incident triggers blame.

Error budgets normalize failure. Not all failures are catastrophic. Some failures fit within acceptable bounds. When your error budget says “we can tolerate this,” teams stop treating every issue as an emergency.

Measuring Error Budget Consumption

Error budgets measure unreliability over time. How you calculate consumption depends on what you’re measuring.

Availability-Based Budgets

For uptime-focused SLOs, error budget consumption tracks downtime:

Error Budget Consumed = (Total Downtime / Measurement Window) * 100

Example: Your SLO is 99.9% uptime over 30 days (43.2 minutes allowed downtime).

Current outage: 15 minutes total downtime this month
Budget consumed: (15 / 43.2) * 100 = 34.7%
Remaining budget: 65.3% (28.2 minutes)

Request-Based Budgets

For services measured by request success rate:

Error Budget Consumed = (Failed Requests / Total Requests) * 100

Example: Your SLO is 99.9% request success rate.

Total requests: 1,000,000
Failed requests: 500
Budget consumed: (500 / 1000) = 50%
Remaining budget: 50% (500 more failures allowed)

Latency-Based Budgets

For performance-focused SLOs:

Error Budget Consumed = (Slow Requests / Total Requests) * 100

Example: Your SLO is 95th percentile response time under 200ms.

Track requests exceeding threshold
Each slow request consumes budget proportionally

Using Error Budgets to Drive Decisions

Error budgets aren’t just dashboards—they’re decision-making frameworks.

Policy-Based Responses

Teams establish policies tied to budget levels:

Budget Status: Healthy (greater than 50% remaining)

Normal development velocity
Standard deployment cadence
Feature work prioritized
Monitoring improvements continue

Budget Status: Warning (10-50% remaining)

Increase code review rigor
Require extra testing for risky changes
Delay non-critical deployments
Begin investigating reliability gaps

Budget Status: Exhausted (less than 10% remaining)

Deployment freeze for non-critical changes
All engineering effort on stability
Post-incident reviews required
Root cause analysis for all incidents
Technical debt prioritization

These policies aren’t suggestions—they’re enforced. When budget runs out, velocity slows. When budget replenishes, velocity resumes.

Tracking Burn Rate

How fast are you consuming your error budget? If you’re burning through it in the first week of the month, you won’t survive until the end.

Burn rate measures error budget consumption velocity:

Burn Rate = (Budget Consumed in Period) / (Expected Consumption if Linear)

Burn rate = 1.0: On track (consuming budget evenly)
Burn rate greater than 1.0: Burning too fast (need intervention)
Burn rate less than 1.0: Consuming slowly (healthy margin)

Example: Your monthly error budget is 43 minutes.

After 10 days (33% of month), you’ve used 20 minutes (46% of budget)
Burn rate = 46% / 33% = 1.4x
Interpretation: You’re burning budget 1.4x faster than sustainable

Burn rate alerts catch problems before budget exhaustion.

Common Mistakes

Setting Unrealistic SLOs

An SLO of 99.999% (“five nines”) sounds impressive. It also leaves only 26 seconds of downtime per month. For most teams, that’s unachievable without massive infrastructure investment.

Unrealistic SLOs create tiny error budgets that get exhausted immediately, forcing permanent deployment freezes. Instead, set SLOs based on:

Current performance baselines
User expectations (what do users actually notice?)
Infrastructure maturity
Team capacity

Better to promise 99.9% and consistently deliver than promise 99.999% and constantly break it.

Ignoring Error Budget Signals

Error budgets only work if teams respect them. If you exhaust your budget but keep deploying anyway, you’ve just created theater—useless numbers on a dashboard.

Enforcement requires organizational buy-in. Leadership must support deployment freezes when budgets run out. Product managers must accept delayed features during reliability sprints.

Measuring the Wrong Thing

Your error budget should measure what users care about. If users don’t notice occasional 503 errors but they do notice slow page loads, your budget should prioritize latency over availability.

Misaligned budgets waste engineering effort on problems users don’t experience while ignoring issues that matter.

Treating All Failures Equally

A 10-second outage at 3 a.m. affecting 5 users is not the same as a 10-second outage at noon affecting 5,000 users. Yet most error budget calculations treat them identically.

Some teams weight budget consumption by impact:

User-facing failures consume more budget
Peak-hour failures consume more than off-peak
Critical path failures (checkout, login) consume more than secondary features

This creates nuanced budgets that reflect business priorities.

Practical Implementation

Start Simple

Don’t build a complex error budget system on day one. Start with:

Pick one critical service with clear uptime requirements
Define a realistic SLO based on current performance
Track downtime manually using incident logs
Calculate monthly budget consumption in a spreadsheet
Review monthly and adjust SLO if needed

Once this works, expand to more services and automate tracking.

Automate Tracking

Manual error budget tracking doesn’t scale. As your system grows, you need tools that:

Continuously monitor SLIs (uptime, latency, error rates)
Automatically calculate budget consumption
Alert when burn rate exceeds thresholds
Provide historical trends and forecasts

Platforms like Upstat provide uptime monitoring and performance tracking that generate the underlying data needed for error budget calculations, helping teams maintain visibility into service health and make informed reliability decisions.

Review and Iterate

Error budgets aren’t static. Review them quarterly:

Are SLOs still aligned with user expectations?
Is budget consumption consistent or spiking?
Are teams respecting budget policies?
Should burn rate thresholds change?

Adjust as your service matures and requirements evolve.

Conclusion: Making Reliability Measurable

Error budgets transform reliability from a vague aspiration into a measurable, actionable framework. They answer questions like:

How much downtime is acceptable?
Should we deploy this risky change?
When do we focus on stability versus features?

By quantifying acceptable unreliability, error budgets align teams around shared goals, reduce conflict between velocity and stability, and prevent both over-engineering (chasing perfection) and under-engineering (ignoring reliability).

Perfect reliability is impossible. But predictable, budgeted reliability—where teams know how much failure they can tolerate and what to do when they exceed it—is entirely achievable.

Define your error budget. Measure it. Respect it. And let data, not politics, guide your reliability decisions.

Explore In Upstat

Track service uptime and performance metrics that enable error budget calculations, helping teams make data-driven reliability decisions.

Discover Monitoring Tools

What is an Error Budget?

Error budgets quantify acceptable unreliability, translating SLOs into actionable guidance for balancing feature velocity with system stability. This guide explains how error budgets work and how teams use them to make data-driven reliability decisions.

What is an Error Budget?

How Error Budgets Work

Why Error Budgets Matter

Aligning Conflicting Incentives

Quantifying Risk

Preventing Burnout

Measuring Error Budget Consumption

Availability-Based Budgets

Request-Based Budgets

Latency-Based Budgets

Using Error Budgets to Drive Decisions

Policy-Based Responses

Tracking Burn Rate

Common Mistakes

Setting Unrealistic SLOs

Ignoring Error Budget Signals

Measuring the Wrong Thing

Treating All Failures Equally

Practical Implementation

Start Simple

Automate Tracking

Review and Iterate

Conclusion: Making Reliability Measurable

Explore In Upstat