What is a Service Level Indicator?

A Service Level Indicator is a quantitative measurement of service behavior that reflects user experience. Common SLIs include availability (percentage of successful requests), latency (response time percentiles), error rate (percentage of failed requests), and throughput (requests processed per second).

How many SLIs should I track?

Start with 2-4 SLIs that directly impact user experience. Too many indicators dilute focus and create noise. Choose metrics users actually care about—availability, latency, and error rate are common starting points. Add more only when specific user pain points demand additional visibility.

What is the difference between SLI and SLO?

An SLI is the measurement itself (what you track), while an SLO is your target value for that measurement (what you aim for). For example, latency might be your SLI measured as 95th percentile response time, and your SLO might be that 95% of requests complete in under 200ms.

How do you measure Service Level Indicators?

SLIs are measured using monitoring tools that track endpoint availability, response times, and error rates over time. Most teams calculate SLIs from metrics collected by their monitoring infrastructure, aggregating data into percentages (for availability), percentiles (for latency), or rates (for errors and throughput).

Service Level Indicators: Reliability Metrics Guide

What Are Service Level Indicators?

A Service Level Indicator is a quantitative measurement that reflects how well your service is performing from the user’s perspective. It’s not about server CPU utilization or database query counts—those are internal metrics. SLIs measure what users experience: Is the service available? Is it fast? Is it reliable?

Think of SLIs as the vital signs of your service. Just as doctors track heart rate, blood pressure, and temperature to assess patient health, engineers track availability, latency, and error rates to assess service health.

The key word is “indicator.” SLIs don’t tell you everything about your system, but they indicate whether users are having a good experience or hitting problems. And that’s what matters.

Why SLIs Matter

Most teams drown in metrics. Application performance monitoring tools can track hundreds of signals—memory usage, garbage collection pauses, network packet loss, cache hit rates, thread pool saturation. All valuable data for debugging, but not all meaningful for reliability.

SLIs force you to answer a critical question: Which metrics actually reflect user experience?

If your database is under heavy load but users don’t notice slower response times, is that a problem? Maybe not yet. If your error rate spikes but it’s all internal retries that succeed before users see failures, does it matter to reliability? Probably not.

SLIs cut through the noise. They’re the metrics that, when degraded, mean users are having a bad time. When your SLIs are healthy, your service is healthy—regardless of what internal metrics might suggest.

Common Types of SLIs

SLIs measure different dimensions of service behavior. The right combination depends on what your users need and what failure modes impact them most.

Availability

Definition: The proportion of requests that succeed.

Measurement:

Availability = (Successful Requests / Total Requests) * 100

Example:

99.9% of API requests return successful status codes (200, 201, etc.)
99.95% of database queries complete without connection errors

Availability is the most common SLI because it’s simple to measure and users notice when services are down. But “up” doesn’t always mean “working well”—a service can be available but slow or throwing errors.

Latency

Definition: How fast your service responds, typically measured as percentiles.

Measurement:

P50 (median): Half of requests complete faster than this threshold
P95: 95% of requests complete faster than this threshold
P99: 99% of requests complete faster than this threshold

Example:

P95 response time under 300ms for page loads
P99 API latency under 500ms

Latency matters because slow systems frustrate users even when they’re technically “available.” Measuring percentiles instead of averages prevents outliers from hiding systemic problems.

Error Rate

Definition: The proportion of requests that fail.

Measurement:

Error Rate = (Failed Requests / Total Requests) * 100

Example:

Under 0.1% of requests return 5xx errors
Under 0.01% of payment transactions fail

Error rate captures correctness. Even if your service is fast and available, if every tenth request throws an error, users will abandon it.

Throughput

Definition: The volume of work your service successfully processes.

Measurement:

Requests per second (RPS)
Transactions processed per minute
Messages delivered per hour

Example:

Process at least 1,000 orders per minute during peak hours
Handle 10,000 concurrent users without degradation

Throughput matters for capacity planning and understanding whether your service can handle expected load. It’s less commonly used as a standalone SLI but critical for high-volume systems.

Choosing the Right SLIs

Not every metric deserves to be an SLI. The goal isn’t comprehensive coverage—it’s meaningful measurement. Here’s how to choose.

Start With User Journeys

Map out critical user workflows:

How do users interact with your service?
What actions must succeed for users to accomplish their goals?
Where do users notice slowness or failures?

For an e-commerce site, critical journeys might include:

Browse catalog (latency-sensitive)
Add to cart (availability-critical)
Complete checkout (error-sensitive)

Each journey suggests relevant SLIs: latency for browsing, availability for cart operations, error rate for payment processing.

Measure What Users Care About

Users don’t care about internal system health—they care about their experience. Choose SLIs that reflect user-facing behavior, not backend implementation details.

Good SLIs:

Page load time (users notice slow pages)
API success rate (users notice errors)
Search result accuracy (users notice irrelevant results)

Bad SLIs (as primary indicators):

Database connection pool utilization (users don’t see this)
Memory garbage collection frequency (internal optimization metric)
Cache hit rate (useful for debugging, not user experience)

That doesn’t mean internal metrics aren’t valuable—they’re essential for debugging and capacity planning. But they shouldn’t be your primary reliability indicators.

Keep It Simple

Start with 2-4 SLIs. More than that and you’ll struggle to maintain focus. Every SLI you track requires monitoring, alerting, and analysis. Don’t create SLIs for every possible dimension unless there’s clear value.

A typical service might track:

Availability: 99.9% of requests succeed
Latency: P95 response time under 200ms
Error rate: Under 0.1% of requests fail

Three indicators. Easy to understand. Clear to communicate. Simple to monitor.

You can always add more later if specific user pain points emerge. But starting lean forces you to prioritize what actually matters.

Make Them Measurable

An SLI you can’t measure is useless. Before committing to an indicator, confirm you can collect the data reliably.

Questions to ask:

Do you have monitoring infrastructure to capture this metric?
Can you aggregate data over meaningful time windows?
Is the measurement accurate and consistent?
Can you detect when the SLI degrades?

If you can’t answer “yes” to these questions, either invest in better monitoring or choose a different SLI.

Measuring SLIs in Practice

Once you’ve chosen your indicators, you need infrastructure to measure and track them over time.

Data Collection

Most SLIs are calculated from metrics emitted by your services and infrastructure:

Availability: Count successful vs failed requests from load balancer or application logs
Latency: Collect response time distributions from request tracing
Error rate: Aggregate error status codes from application metrics

Monitoring platforms collect this raw data through agents, instrumentation, or log aggregation. Platforms like Upstat provide endpoint monitoring that tracks uptime percentages, response time percentiles, and failure rates—the foundational data teams use to calculate their Service Level Indicators.

Aggregation Windows

SLIs are typically measured over rolling time windows:

Short-term (1 hour, 24 hours): Detect immediate degradation
Medium-term (7 days, 30 days): Understand reliability trends
Long-term (90 days, 1 year): Track improvements over time

Shorter windows catch problems faster. Longer windows smooth out noise and reveal patterns.

Percentile Calculations

For latency SLIs, percentiles matter more than averages. A P95 latency of 500ms means 95% of users experience fast responses, even if the remaining 5% are slow.

Percentiles prevent outliers from hiding in averages. If your average latency is 100ms but your P99 is 10 seconds, most users have a fast experience—but 1% are suffering. Percentiles make that visible.

Common Mistakes When Defining SLIs

Tracking Too Many Indicators

More SLIs don’t mean better observability. They mean divided attention.

If you track 20 different indicators, which ones do you prioritize when multiple degrade simultaneously? Which ones justify waking someone at 3 a.m.?

Fewer, well-chosen SLIs force clarity. They make it obvious what matters and what can wait.

Measuring Internal Metrics Instead of User Experience

CPU utilization, memory pressure, and database query time are useful debugging metrics. But they’re not SLIs unless users directly experience their impact.

A database might be running hot, but if caching prevents user-facing slowness, your latency SLI remains healthy. The database metric is worth investigating, but it’s not an indicator of poor user experience yet.

Setting Unrealistic Targets

SLIs measure what’s happening. SLOs define what you’re aiming for. Don’t confuse the two.

If your current availability is 99.5%, don’t set an SLI target of 99.99%. That’s an SLO aspiration, not a measurement. First measure accurately, then decide what target makes sense.

Ignoring the Cost of Measurement

Collecting, storing, and analyzing metrics has costs—infrastructure, engineering time, complexity. Don’t track indicators you won’t act on.

If an SLI never triggers alerts or influences decisions, why measure it? Either commit to using the data or stop collecting it.

Putting SLIs to Work

SLIs don’t just sit on dashboards. They drive decisions.

Alerting: When SLIs degrade below acceptable thresholds, alerts notify teams to investigate and respond.

Error budgets: SLIs feed into error budgets, which quantify how much unreliability you can tolerate before taking action.

Prioritization: When SLIs are healthy, teams can move fast. When SLIs degrade, reliability work takes priority over new features.

Communication: SLIs provide objective data for stakeholder conversations. Instead of “the service feels slow,” you can say “P95 latency increased from 200ms to 800ms.”

Conclusion: Measure What Matters

Service Level Indicators transform vague notions of “reliability” into concrete, measurable signals. They answer the fundamental question: Is our service working for users?

The best SLIs are simple, user-focused, and actionable. They cut through metric noise to surface what actually matters. And they provide the foundation for data-driven reliability decisions—not guesswork or politics.

Start with 2-4 indicators that reflect critical user journeys. Measure them consistently. Use them to guide engineering priorities. And resist the temptation to track everything just because you can.

Reliability isn’t about perfection. It’s about knowing what to measure, when to react, and how much unreliability you can tolerate while still delivering value to users. SLIs make that possible.

Explore In Upstat

Track uptime percentages, response time percentiles, and error rates—the foundational monitoring data teams use to measure their Service Level Indicators.

Discover Monitoring Capabilities

Service Level Indicators: Measuring What Matters

Service Level Indicators are quantitative measurements that reflect service behavior from the user's perspective. This guide explains what SLIs are, how to choose meaningful indicators, and practical strategies for measuring reliability without drowning in metrics.

What Are Service Level Indicators?

Why SLIs Matter

Common Types of SLIs

Availability

Latency

Error Rate

Throughput

Choosing the Right SLIs

Start With User Journeys

Measure What Users Care About

Keep It Simple

Make Them Measurable

Measuring SLIs in Practice

Data Collection

Aggregation Windows

Percentile Calculations

Common Mistakes When Defining SLIs

Tracking Too Many Indicators

Measuring Internal Metrics Instead of User Experience

Setting Unrealistic Targets

Ignoring the Cost of Measurement

Putting SLIs to Work

Conclusion: Measure What Matters

Explore In Upstat