What is a Service Level Objective (SLO)?

An SLO is a target value for service reliability expressed as a specific, measurable commitment—like '99.9 percent of requests will succeed over 30 days' or '95th percentile latency under 200ms.' Unlike vague aspirations, SLOs provide quantifiable targets that guide engineering decisions about when to prioritize reliability work over new features.

How do you choose good SLO targets?

Choose SLO targets based on actual user experience requirements and historical performance, not arbitrary round numbers like 99.9%. Start by measuring current reliability for baseline, understand what users actually need for acceptable experience, set targets slightly better than current performance, and iterate based on error budget consumption and business feedback.

What SLIs should you measure for SLOs?

Common SLIs include availability (percentage of successful requests), latency (response time percentiles like p95 or p99), error rate (percentage of failed requests), and throughput (requests per second). Choose SLIs that directly impact user experience—if users care about page load speed, measure latency; if they care about data accuracy, measure error rates.

How do SLOs relate to error budgets?

Error budgets are the inverse of SLOs—they quantify acceptable unreliability. A 99.9% SLO means 0.1% error budget (about 43 minutes downtime per month). When error budget is healthy, teams can take risks shipping fast. When exhausted, teams focus on reliability. SLOs define the target; error budgets make it actionable for decision-making.

Service Level Objectives Guide: Setting Realistic SLO Targets

What Are Service Level Objectives?

A Service Level Objective is a target value for a service’s reliability. It’s not a vague aspiration like “be highly available”—it’s a specific, measurable commitment: “99.9 percent of requests will succeed over a 30-day window” or “95th percentile latency stays under 200 milliseconds.”

SLOs answer the fundamental question: how reliable does this service actually need to be?

Not “how reliable can we make it” or “how reliable do we wish it were.” But how reliable is good enough to meet user expectations without overinvesting in diminishing returns.

Most teams discover their services are already more reliable than users require. You might be chasing four nines of availability when users would be perfectly satisfied with three. That gap represents wasted engineering effort that could have shipped features instead.

Why SLOs Matter

Preventing Reliability Theater

Without explicit targets, teams default to reliability theater. Every outage becomes a crisis. Every incident triggers blame. Engineers spend nights hardening systems against edge cases users will never encounter.

SLOs replace theater with engineering discipline. They force honest conversations: what do users actually care about? What’s the cost of an extra nine of availability? Is eliminating this failure mode worth delaying that feature by a quarter?

Aligning Reliability with Business Value

Not all services need identical reliability. Your authentication service failing affects every user immediately. A batch job processing analytics reports can tolerate occasional failures without user impact.

SLOs let you differentiate. Set tight targets for critical user paths. Accept looser targets for internal tools. Invest reliability effort where it delivers business value, not uniformly across everything.

Creating Forcing Functions

SLOs paired with error budgets create self-enforcing reliability practices. When you’re within budget, deploy freely. When you’ve exhausted your budget, stop shipping and fix reliability. No arguments, no politics—just data.

This transforms reliability from a constraint on velocity into a shared framework where both goals coexist.

Choosing Service Level Indicators

Before setting an SLO, you need something to measure. Service Level Indicators are the quantitative measurements that tell you how your service is performing.

What Makes a Good SLI

Good SLIs are:

User-Focused: They measure what users experience, not internal system state. A server’s CPU utilization is a metric. The percentage of user requests that succeed is an SLI.

Measurable: You can collect the data reliably without complex instrumentation or manual calculation.

Actionable: Engineering work can improve the SLI. If you can’t affect it through code or infrastructure changes, it’s not useful.

Common SLI Categories

Availability: The most fundamental SLI—is the service working?

Measured as: successful requests divided by total requests
Example: 99.95 percent of API requests return 200-level status codes

Latency: How fast does the service respond?

Measured as: percentile response times (P50, P95, P99)
Example: 95th percentile latency stays under 300 milliseconds

Error Rate: How often do requests fail?

Measured as: failed requests divided by total requests
Example: Error rate stays under 0.1 percent

Throughput: Can the service handle the load?

Measured as: requests processed per second
Example: System sustains 10,000 requests per second during peak hours

Durability: Does data persist correctly?

Measured as: data loss events over total storage operations
Example: 99.999999999 percent of stored objects remain retrievable

Choosing Your SLI

Don’t measure everything. Start with the one or two metrics that best reflect user experience. For most services, availability and latency cover the majority of user concerns.

Ask: if this metric degrades, will users notice and care? If yes, it’s probably a good SLI. If users wouldn’t notice, measure something else.

Setting Realistic SLO Targets

This is where most teams fail. They pick aspirational targets based on what sounds impressive rather than what’s achievable or necessary.

Start with Historical Data

Never set an SLO before measuring current performance. You need at least 30 days of historical data showing actual system behavior.

Review your existing SLIs over the past month, quarter, and year. What’s your worst-case performance? What’s typical? Where are the outliers?

If your service currently delivers 99.5 percent availability, setting an SLO of 99.99 percent guarantees immediate failure. You’ll violate your target constantly and train teams to ignore it.

Choose Targets You Can Sustain

Your SLO should be tighter than your worst recent performance but looser than your best. You want a target that challenges the team but remains achievable even during difficult months.

Framework:

Look at the worst month in the past year
Set your SLO slightly better than that worst month
Ensure you can meet it 11 out of 12 months

If your worst month was 99.7 percent availability, an SLO of 99.8 percent is reasonable. You’re committing to perform better than your historical worst case while leaving margin for the occasional bad month.

The Number of Nines

Availability SLOs are often expressed as “nines”—99.9 percent is “three nines,” 99.99 percent is “four nines.”

Each additional nine roughly doubles operational complexity and cost:

Target	Downtime per Month	Difficulty
99% (two nines)	7.2 hours	Easy to achieve
99.9% (three nines)	43 minutes	Requires good practices
99.95%	21.6 minutes	Demands investment
99.99% (four nines)	4.3 minutes	Requires significant engineering
99.999% (five nines)	26 seconds	Extremely difficult and expensive

Most services don’t need five nines. Three nines is perfectly acceptable for many use cases. Don’t chase extra nines for bragging rights—chase them only when users need them and you can sustain them.

Buffer from SLAs

If you offer Service Level Agreements to customers, set internal SLOs tighter than those external commitments.

Example:

Customer SLA: 99.95 percent availability
Internal SLO: 99.98 percent availability

This buffer gives you room to detect and fix problems before customers are affected or before you owe refunds. Your SLO violation alerts you to trouble; your SLA violation triggers contract penalties.

Get Stakeholder Agreement

SLOs only work when everyone enforces them. That requires buy-in from:

Product Teams: Accept that error budget exhaustion means deployment freezes Engineering Teams: Commit to respecting budget-driven velocity changes Leadership: Support reliability work over features when budgets run low

If any stakeholder refuses to respect error budgets, your SLOs are theater. You need organizational commitment before implementation.

Implementing Your First SLO

Don’t build a comprehensive SLO system on day one. Start small and expand as you learn what works.

Pick One Critical Service

Choose your most important user-facing service. Not your entire system—one service. Authentication, API gateway, checkout flow—something users interact with directly and care about deeply.

Complex infrastructure services can wait. Start where user impact is clear and measurement is straightforward.

Define the Measurement Window

SLOs need time windows—the period over which you measure compliance.

Common windows:

30 days rolling: Most common; provides recent performance view
Calendar month: Aligns with business reporting cycles
7 days rolling: Shorter feedback loop for fast-moving services
90 days rolling: Longer view reduces noise from individual incidents

Most teams use 30-day rolling windows. It’s recent enough to detect trends but long enough to smooth out one-time incidents.

Set Up Measurement

You need systems that continuously track your SLIs and calculate SLO compliance.

Requirements:

Monitor endpoints or measure requests
Calculate success rates and latency percentiles
Track compliance against target
Alert when approaching violation

Platforms like Upstat provide the monitoring foundation—uptime tracking, response time measurement, multi-region checks—that generate the underlying SLI data teams need to track their SLOs.

Many teams start with manual tracking: review monitoring dashboards weekly, calculate SLO compliance in spreadsheets, and manually determine if targets were met. Once this proves valuable, invest in automation.

Establish Response Policies

Define what happens at different error budget levels:

Budget Healthy (over 50 percent remaining):

Normal deployment cadence
Standard feature prioritization
Routine monitoring

Budget Warning (10 to 50 percent remaining):

Increase testing rigor
Delay non-critical deploys
Investigate reliability gaps

Budget Exhausted (under 10 percent remaining):

Deployment freeze for features
All engineering effort on reliability
Post-incident reviews required
Root cause analysis mandatory

Make these policies explicit and enforce them consistently. Error budgets only work when teams respect the numbers.

When to Adjust Your SLOs

SLOs aren’t set in stone. They should evolve as your service matures and requirements change.

Quarterly Reviews

Schedule formal SLO reviews every quarter. Examine:

Historical Performance: Are you consistently exceeding targets? If you hit 99.99 percent when targeting 99.9 percent, you’re overinvesting in reliability.

User Feedback: Are users complaining about reliability issues despite meeting SLOs? Your targets might not reflect actual user expectations.

Incident Patterns: Are you exhausting error budgets repeatedly? Your targets might be unrealistic given current infrastructure.

Business Changes: Did traffic increase? Did you launch in new markets? Business changes often require SLO adjustments.

Tightening SLOs

As systems mature and reliability improves, you might tighten targets. But only tighten when:

You’ve consistently exceeded current targets for 3+ months
Teams have capacity for additional reliability investment
Users would benefit from improved reliability
The business case justifies the engineering cost

Don’t tighten targets just because you can. Tightening increases operational burden. Make sure the benefit outweighs the cost.

Loosening SLOs

Sometimes you need to loosen targets. This isn’t failure—it’s honesty.

Loosen when:

Current targets are unrealistic given infrastructure constraints
You’re constantly exhausting error budgets and freezing deployments
Users don’t notice or care about the additional reliability

It’s better to set achievable targets you consistently meet than aspirational targets you constantly violate.

Deprecating SLOs

Not all services need SLOs forever. Internal tools, experimental features, or sunset products might not justify the maintenance overhead.

Regularly audit your SLOs. Remove measurements for services that no longer justify the effort.

Common SLO Mistakes to Avoid

Measuring Too Many Things

Teams often create dozens of SLOs, tracking every conceivable metric. This creates noise and diffuses focus.

Start with one or two SLIs per service. Add more only when clearly necessary. More SLOs isn’t better—focused SLOs are better.

Aiming for Perfect Reliability

100 percent availability is impossible. Network partitions happen. Hardware fails. Software has bugs. Cloud providers have outages.

Chasing perfection wastes engineering effort on diminishing returns. Set realistic targets that satisfy users while preserving development velocity.

Ignoring Error Budget Signals

Error budgets only work when teams respect them. If you exhaust your budget but keep deploying anyway, you’ve created meaningless dashboards.

Enforcement requires discipline. When the budget runs out, velocity must decrease. No exceptions, no politics.

Setting Targets Without User Research

Many teams set SLOs based on what sounds professional rather than what users actually need. They chase four nines because it looks impressive, not because users require that level of reliability.

Talk to users. Understand their expectations. Set targets that reflect real needs, not aspirational engineering goals.

Treating All Services Identically

Your authentication service isn’t the same as your internal admin panel. Critical user paths need tight SLOs. Background jobs can tolerate looser targets.

Differentiate reliability investment based on business impact. Not everything needs identical SLOs.

Conclusion: Engineering Reliability Through Measurement

Service Level Objectives transform reliability from a vague aspiration into an engineering discipline. They force teams to explicitly define how reliable each service needs to be, measure progress toward those targets, and make data-driven decisions about when to prioritize reliability over features.

The most effective SLOs aren’t the most aggressive—they’re the most realistic. They reflect actual user needs, align with current infrastructure capabilities, and create sustainable practices teams can maintain long-term.

Start small. Pick one service. Define one or two SLIs. Set realistic targets based on historical data. Measure compliance. Adjust quarterly based on what you learn.

Over time, SLOs become the foundation for how teams think about reliability. Not something to achieve once, but a continuous practice of measurement, learning, and improvement.

Explore In Upstat

Track uptime percentages, response time percentiles, and availability metrics that provide the SLI data teams need to measure progress against SLO targets.

Discover Monitoring Capabilities