What Are SLOs, SLAs, and SLIs?
If you’ve ever worked on production systems, you’ve probably heard these three acronyms thrown around. They sound similar, they’re often used interchangeably (incorrectly), and they all relate to service reliability. But they mean very different things.
- SLI (Service Level Indicator) - A quantitative measurement of service behavior
- SLO (Service Level Objective) - An internal target for that measurement
- SLA (Service Level Agreement) - A contractual commitment with consequences
Understanding the distinction isn’t just academic. These concepts shape how teams prioritize work, how much risk they’re willing to take, and how they communicate reliability to users and stakeholders.
Service Level Indicators (SLIs)
An SLI is a metric that measures some aspect of your service’s behavior. It’s the raw data that tells you how your system is actually performing.
Good SLIs are:
- Measurable - You can collect the data
- Meaningful - They reflect user experience
- Actionable - You can improve them through engineering work
Common SLI Examples
Availability: Percentage of successful requests
- Measured as:
(successful_requests / total_requests) * 100
- Example: 99.95% of API requests return 200 status codes
Latency: How fast your service responds
- Measured as: 95th percentile response time
- Example: 95% of requests complete in under 200ms
Error Rate: How often requests fail
- Measured as:
(failed_requests / total_requests) * 100
- Example: 0.1% of requests return 5xx errors
Durability: Data retention over time
- Measured as: Percentage of data that remains intact
- Example: 99.999999999% of objects stored are retrievable
The key is choosing SLIs that matter to your users. If your service feels slow despite high availability, maybe latency is more important than uptime. If users don’t notice occasional errors, maybe throughput matters more than error rate.
Service Level Objectives (SLOs)
An SLO is a target value or range for an SLI. It’s what you’re aiming for—your internal goal for how reliable your service should be.
SLOs answer the question: How good is good enough?
Why SLOs Matter
Setting an SLO forces teams to make explicit tradeoffs between reliability and velocity. Chasing 100% uptime sounds noble, but it’s impractical and expensive. Every nine you add to your availability target (99% to 99.9% to 99.99%) roughly doubles your operational complexity.
Example SLOs:
- API availability: 99.9% over a 30-day window
- Page load latency: 95th percentile less than 500ms
- Database write error rate: less than 0.01%
Error Budgets
An SLO implies an error budget—the amount of unreliability you’re allowed before breaking your target.
If your SLO is 99.9% availability over 30 days, you have:
- Total minutes in 30 days: 43,200
- Error budget: 43,200 × 0.001 = 43.2 minutes of downtime allowed
Once you’ve burned through your error budget, you stop deploying new features and focus entirely on reliability work. This creates a forcing function: if you want to keep shipping, you need to keep services reliable.
Error budgets align incentives between product teams (who want to move fast) and reliability teams (who want stability). They provide a data-driven answer to “should we slow down or keep shipping?”
Service Level Agreements (SLAs)
An SLA is a contractual commitment to your users, backed by financial or legal consequences if you fail to meet it.
SLAs are always more conservative than SLOs. You never promise externally what you’re barely achieving internally.
SLA Structure
A typical SLA includes:
- The commitment - What you promise (e.g., 99.95% uptime)
- The measurement window - How you measure it (e.g., monthly)
- The consequences - What happens if you miss (e.g., service credits, refunds)
- Exclusions - What doesn’t count (e.g., planned maintenance, customer-caused issues)
Example SLA:
We guarantee 99.95% uptime for our API service, measured monthly. If uptime falls below this threshold (excluding scheduled maintenance), customers will receive a 10% service credit for that month.
Why the Gap Between SLO and SLA?
If your SLA is 99.95% and your SLO is also 99.95%, you have zero room for error. One incident and you’re breaching contracts.
Most teams set SLOs tighter than SLAs to create a buffer:
- SLA: 99.95% (what we promise customers)
- SLO: 99.99% (what we aim for internally)
This buffer gives you room to detect problems, respond to incidents, and improve reliability before customers are impacted—or before you owe refunds.
How They Work Together
Here’s how SLIs, SLOs, and SLAs work together in practice:
- Choose meaningful SLIs that reflect user experience (availability, latency, error rate)
- Set realistic SLOs based on current performance and reliability investment
- Offer conservative SLAs to customers, with a buffer below your SLOs
- Monitor SLIs continuously to detect when you’re approaching SLO violations
- Use error budgets to balance feature velocity with reliability work
When you’re exceeding your SLOs, you can take more risks—ship faster, experiment more, deploy more frequently. When you’re burning through your error budget, you slow down and invest in stability.
Common Mistakes
Setting Unrealistic SLOs
An SLO of 99.999% (“five nines”) sounds impressive, but it allows only 5.26 minutes of downtime per year. For most services, that’s unrealistic and counterproductive. It forces teams to overinvest in reliability at the expense of feature development.
Instead, set SLOs based on:
- User expectations - How much downtime would users actually notice or care about?
- Current performance - Where are you now? Aim for incremental improvement.
- Business impact - What’s the cost of downtime versus the cost of reliability work?
Using Availability as the Only SLI
Availability is important, but it’s not the whole story. A service can be “up” but unusable if it’s slow, throwing errors, or losing data.
Use multiple SLIs to capture different dimensions of reliability:
- Availability (is it up?)
- Latency (is it fast?)
- Error rate (is it working correctly?)
- Throughput (can it handle the load?)
Treating SLAs as Aspirations
SLAs are legal commitments. If you can’t meet them consistently, don’t put them in a contract. Your SLA should be comfortably achievable based on historical performance—with room for the occasional bad month.
Putting It Into Practice
For teams managing production systems, tracking SLIs and measuring against SLOs is essential for maintaining reliability without sacrificing velocity. Platforms like Upstat help teams monitor uptime, track incidents, and maintain visibility into service health across multiple systems. Whether you’re defining your first SLO or refining error budgets, having the right tools to measure and respond to service degradation makes all the difference.
Conclusion: Reliability as a Negotiation
SLOs, SLAs, and SLIs force teams to have honest conversations about reliability. Not “how reliable should we be?” but “how reliable do we need to be, given our constraints?”
Perfect reliability is impossible. But predictable, measurable, and improvable reliability is achievable—and that’s what these concepts enable.
Define your SLIs. Set realistic SLOs. Offer conservative SLAs. And when you miss your targets, treat it as an opportunity to learn, improve, and recalibrate—not as a failure.
That’s the difference between chasing uptime and engineering reliability.
Explore In Upstat
Track uptime, measure SLIs against your targets, and maintain visibility into service health across multiple systems with comprehensive monitoring tools.