Blog Home  /  service-level-indicators

Service Level Indicators: Measuring What Matters

Service Level Indicators are quantitative measurements that reflect service behavior from the user's perspective. This guide explains what SLIs are, how to choose meaningful indicators, and practical strategies for measuring reliability without drowning in metrics.

November 11, 2025 5 min read
sre

What Are Service Level Indicators?

A Service Level Indicator is a quantitative measurement that reflects how well your service is performing from the user’s perspective. It’s not about server CPU utilization or database query counts—those are internal metrics. SLIs measure what users experience: Is the service available? Is it fast? Is it reliable?

Think of SLIs as the vital signs of your service. Just as doctors track heart rate, blood pressure, and temperature to assess patient health, engineers track availability, latency, and error rates to assess service health.

The key word is “indicator.” SLIs don’t tell you everything about your system, but they indicate whether users are having a good experience or hitting problems. And that’s what matters.

Why SLIs Matter

Most teams drown in metrics. Application performance monitoring tools can track hundreds of signals—memory usage, garbage collection pauses, network packet loss, cache hit rates, thread pool saturation. All valuable data for debugging, but not all meaningful for reliability.

SLIs force you to answer a critical question: Which metrics actually reflect user experience?

If your database is under heavy load but users don’t notice slower response times, is that a problem? Maybe not yet. If your error rate spikes but it’s all internal retries that succeed before users see failures, does it matter to reliability? Probably not.

SLIs cut through the noise. They’re the metrics that, when degraded, mean users are having a bad time. When your SLIs are healthy, your service is healthy—regardless of what internal metrics might suggest.

Common Types of SLIs

SLIs measure different dimensions of service behavior. The right combination depends on what your users need and what failure modes impact them most.

Availability

Definition: The proportion of requests that succeed.

Measurement:

Availability = (Successful Requests / Total Requests) * 100

Example:

  • 99.9% of API requests return successful status codes (200, 201, etc.)
  • 99.95% of database queries complete without connection errors

Availability is the most common SLI because it’s simple to measure and users notice when services are down. But “up” doesn’t always mean “working well”—a service can be available but slow or throwing errors.

Latency

Definition: How fast your service responds, typically measured as percentiles.

Measurement:

  • P50 (median): Half of requests complete faster than this threshold
  • P95: 95% of requests complete faster than this threshold
  • P99: 99% of requests complete faster than this threshold

Example:

  • P95 response time under 300ms for page loads
  • P99 API latency under 500ms

Latency matters because slow systems frustrate users even when they’re technically “available.” Measuring percentiles instead of averages prevents outliers from hiding systemic problems.

Error Rate

Definition: The proportion of requests that fail.

Measurement:

Error Rate = (Failed Requests / Total Requests) * 100

Example:

  • Under 0.1% of requests return 5xx errors
  • Under 0.01% of payment transactions fail

Error rate captures correctness. Even if your service is fast and available, if every tenth request throws an error, users will abandon it.

Throughput

Definition: The volume of work your service successfully processes.

Measurement:

  • Requests per second (RPS)
  • Transactions processed per minute
  • Messages delivered per hour

Example:

  • Process at least 1,000 orders per minute during peak hours
  • Handle 10,000 concurrent users without degradation

Throughput matters for capacity planning and understanding whether your service can handle expected load. It’s less commonly used as a standalone SLI but critical for high-volume systems.

Choosing the Right SLIs

Not every metric deserves to be an SLI. The goal isn’t comprehensive coverage—it’s meaningful measurement. Here’s how to choose.

Start With User Journeys

Map out critical user workflows:

  • How do users interact with your service?
  • What actions must succeed for users to accomplish their goals?
  • Where do users notice slowness or failures?

For an e-commerce site, critical journeys might include:

  • Browse catalog (latency-sensitive)
  • Add to cart (availability-critical)
  • Complete checkout (error-sensitive)

Each journey suggests relevant SLIs: latency for browsing, availability for cart operations, error rate for payment processing.

Measure What Users Care About

Users don’t care about internal system health—they care about their experience. Choose SLIs that reflect user-facing behavior, not backend implementation details.

Good SLIs:

  • Page load time (users notice slow pages)
  • API success rate (users notice errors)
  • Search result accuracy (users notice irrelevant results)

Bad SLIs (as primary indicators):

  • Database connection pool utilization (users don’t see this)
  • Memory garbage collection frequency (internal optimization metric)
  • Cache hit rate (useful for debugging, not user experience)

That doesn’t mean internal metrics aren’t valuable—they’re essential for debugging and capacity planning. But they shouldn’t be your primary reliability indicators.

Keep It Simple

Start with 2-4 SLIs. More than that and you’ll struggle to maintain focus. Every SLI you track requires monitoring, alerting, and analysis. Don’t create SLIs for every possible dimension unless there’s clear value.

A typical service might track:

  1. Availability: 99.9% of requests succeed
  2. Latency: P95 response time under 200ms
  3. Error rate: Under 0.1% of requests fail

Three indicators. Easy to understand. Clear to communicate. Simple to monitor.

You can always add more later if specific user pain points emerge. But starting lean forces you to prioritize what actually matters.

Make Them Measurable

An SLI you can’t measure is useless. Before committing to an indicator, confirm you can collect the data reliably.

Questions to ask:

  • Do you have monitoring infrastructure to capture this metric?
  • Can you aggregate data over meaningful time windows?
  • Is the measurement accurate and consistent?
  • Can you detect when the SLI degrades?

If you can’t answer “yes” to these questions, either invest in better monitoring or choose a different SLI.

Measuring SLIs in Practice

Once you’ve chosen your indicators, you need infrastructure to measure and track them over time.

Data Collection

Most SLIs are calculated from metrics emitted by your services and infrastructure:

  • Availability: Count successful vs failed requests from load balancer or application logs
  • Latency: Collect response time distributions from request tracing
  • Error rate: Aggregate error status codes from application metrics

Monitoring platforms collect this raw data through agents, instrumentation, or log aggregation. Platforms like Upstat provide endpoint monitoring that tracks uptime percentages, response time percentiles, and failure rates—the foundational data teams use to calculate their Service Level Indicators.

Aggregation Windows

SLIs are typically measured over rolling time windows:

  • Short-term (1 hour, 24 hours): Detect immediate degradation
  • Medium-term (7 days, 30 days): Understand reliability trends
  • Long-term (90 days, 1 year): Track improvements over time

Shorter windows catch problems faster. Longer windows smooth out noise and reveal patterns.

Percentile Calculations

For latency SLIs, percentiles matter more than averages. A P95 latency of 500ms means 95% of users experience fast responses, even if the remaining 5% are slow.

Percentiles prevent outliers from hiding in averages. If your average latency is 100ms but your P99 is 10 seconds, most users have a fast experience—but 1% are suffering. Percentiles make that visible.

Common Mistakes When Defining SLIs

Tracking Too Many Indicators

More SLIs don’t mean better observability. They mean divided attention.

If you track 20 different indicators, which ones do you prioritize when multiple degrade simultaneously? Which ones justify waking someone at 3 a.m.?

Fewer, well-chosen SLIs force clarity. They make it obvious what matters and what can wait.

Measuring Internal Metrics Instead of User Experience

CPU utilization, memory pressure, and database query time are useful debugging metrics. But they’re not SLIs unless users directly experience their impact.

A database might be running hot, but if caching prevents user-facing slowness, your latency SLI remains healthy. The database metric is worth investigating, but it’s not an indicator of poor user experience yet.

Setting Unrealistic Targets

SLIs measure what’s happening. SLOs define what you’re aiming for. Don’t confuse the two.

If your current availability is 99.5%, don’t set an SLI target of 99.99%. That’s an SLO aspiration, not a measurement. First measure accurately, then decide what target makes sense.

Ignoring the Cost of Measurement

Collecting, storing, and analyzing metrics has costs—infrastructure, engineering time, complexity. Don’t track indicators you won’t act on.

If an SLI never triggers alerts or influences decisions, why measure it? Either commit to using the data or stop collecting it.

Putting SLIs to Work

SLIs don’t just sit on dashboards. They drive decisions.

Alerting: When SLIs degrade below acceptable thresholds, alerts notify teams to investigate and respond.

Error budgets: SLIs feed into error budgets, which quantify how much unreliability you can tolerate before taking action.

Prioritization: When SLIs are healthy, teams can move fast. When SLIs degrade, reliability work takes priority over new features.

Communication: SLIs provide objective data for stakeholder conversations. Instead of “the service feels slow,” you can say “P95 latency increased from 200ms to 800ms.”

Conclusion: Measure What Matters

Service Level Indicators transform vague notions of “reliability” into concrete, measurable signals. They answer the fundamental question: Is our service working for users?

The best SLIs are simple, user-focused, and actionable. They cut through metric noise to surface what actually matters. And they provide the foundation for data-driven reliability decisions—not guesswork or politics.

Start with 2-4 indicators that reflect critical user journeys. Measure them consistently. Use them to guide engineering priorities. And resist the temptation to track everything just because you can.

Reliability isn’t about perfection. It’s about knowing what to measure, when to react, and how much unreliability you can tolerate while still delivering value to users. SLIs make that possible.

Explore In Upstat

Track uptime percentages, response time percentiles, and error rates—the foundational monitoring data teams use to measure their Service Level Indicators.