What are the four golden signals?

The four golden signals are latency (response time), traffic (demand on your system), errors (rate of failed requests), and saturation (how full your system is). These metrics provide a comprehensive view of system health and are the foundation of effective monitoring.

Why are they called golden signals?

Google SRE teams coined the term to describe the most critical metrics for understanding system behavior. If you monitor these four signals effectively, you can detect the majority of user-impacting issues. They are golden because they provide the highest signal-to-noise ratio for incident detection.

Do I need to monitor all four golden signals?

Yes, for comprehensive monitoring. Each signal reveals different failure modes. High latency without errors indicates resource constraints. Errors without high latency suggest logic bugs. Traffic patterns reveal capacity needs. Saturation predicts future failures. Monitoring only some signals means missing critical problems.

How do golden signals differ from other monitoring approaches?

Traditional monitoring focuses on infrastructure metrics like CPU and memory. Golden signals focus on user-facing behavior and service health. A server can have low CPU usage while experiencing high latency due to external dependencies. Golden signals reveal what users experience, not just what infrastructure reports.

Four Golden Signals of Monitoring: SRE Guide with Examples

Your monitoring dashboard displays hundreds of metrics. CPU utilization. Memory consumption. Disk I/O. Network bandwidth. Database connections. Cache hit rates. Queue depths. Thread pool sizes.

When an incident strikes, which metrics matter?

The four golden signals—latency, traffic, errors, and saturation—cut through the noise. These metrics reveal system health from the user perspective and catch the majority of user-impacting issues before they escalate.

This guide explains what each signal measures, how to track them effectively, and when they indicate problems.

The Four Golden Signals

Google’s Site Reliability Engineering team identified these four metrics as the foundation of effective monitoring. They apply to virtually every service, from HTTP APIs to batch processing systems to message queues.

1. Latency

What it measures: Time to process requests.

Latency tracks how long your system takes to respond. This includes successful requests and failed requests, measured separately. A fast error response tells a different story than a slow success response.

Why it matters:

Users experience latency directly. A search that returns results in 50ms feels instant. The same search taking 5 seconds feels broken, even if it eventually succeeds.

Latency reveals resource constraints, external dependency problems, and inefficient code paths. Rising latency often precedes complete failures as systems approach capacity limits.

What to measure:

Break latency measurement into phases to identify bottlenecks:

DNS resolution time: How long to resolve domain names
TCP connection time: Time to establish network connection
TLS handshake time: SSL/TLS negotiation duration
Time to first byte (TTFB): Server processing before sending response
Total response time: Complete request-response cycle

Track latency at multiple percentiles, not just averages:

p50 (median): Typical user experience
p95: Experience for 1 in 20 requests
p99: Experience for 1 in 100 requests
p99.9: Worst-case latency for power users

A service with 100ms p50 latency and 5000ms p99 latency has severe outlier problems that averages hide.

Practical thresholds:

Different services have different latency requirements:

Interactive APIs: p95 under 300ms, p99 under 500ms
Background jobs: p95 under 5 seconds, p99 under 10 seconds
Batch operations: Completion within expected window

Example alert logic:

Alert when latency exceeds thresholds for sustained periods:

if (p95_latency > 500ms for 3 consecutive minutes) {
  alert: "API latency degraded"
  severity: high
  action: "Check database query performance and external dependencies"
}

2. Traffic

What it measures: Demand on your system.

Traffic quantifies how much your system is being used. For web services, this is requests per second. For databases, it might be queries per second or connection count. For storage systems, it could be read/write operations.

Why it matters:

Traffic reveals usage patterns, capacity planning needs, and attack signatures. Sudden traffic spikes might indicate viral growth, marketing campaigns, or DDoS attacks. Unexpected traffic drops suggest upstream failures preventing requests from reaching your service.

Traffic also provides context for other metrics. An error rate of 5 percent means different things at 100 requests/second versus 10,000 requests/second.

What to measure:

Track traffic at multiple dimensions:

Total volume: Overall requests per second
Success traffic: Requests that complete successfully
Error traffic: Requests that fail
By endpoint: Which operations are most used
By region: Geographic distribution of demand
By user segment: Free tier versus paid users

Practical baselines:

Establish normal traffic ranges based on historical data:

Weekday baseline: 1,000-1,500 requests/second
Weekend baseline: 600-900 requests/second
Peak hours: 2,000-2,500 requests/second
Overnight minimum: 200-400 requests/second

Example alert logic:

Alert on significant deviations from expected patterns:

if (current_traffic < 0.5 * baseline for 5 minutes) {
  alert: "Traffic drop detected"
  severity: high
  action: "Check upstream load balancers and DNS resolution"
}

if (current_traffic > 2.0 * baseline for 5 minutes) {
  alert: "Traffic spike detected"
  severity: medium
  action: "Verify capacity, check for attack patterns"
}

3. Errors

What it measures: Rate of failed requests.

Errors track requests that fail to complete successfully. This includes HTTP 5xx errors, exceptions, timeouts, and any request that does not produce the expected result for users.

Why it matters:

Errors directly impact user experience. A user seeing error messages cannot complete their work. High error rates indicate bugs, infrastructure failures, or capacity problems.

Error rates provide different information than latency. A service can respond quickly with errors, hiding the fact that nothing is working. A service can have low error rates while responding slowly, masking performance problems.

What to measure:

Track errors with granularity:

Total error rate: Percentage of all requests that fail
By error type: 500 Internal Server Error versus 503 Service Unavailable
By endpoint: Which operations fail most often
By user action: Login errors versus data processing errors
Client errors versus server errors: 4xx (client issues) versus 5xx (server issues)

Focus monitoring and alerting on server errors (5xx). Client errors (4xx) might indicate client bugs or invalid requests but do not reflect service health.

Practical thresholds:

Different error rates require different responses:

Under 0.1 percent: Normal baseline for most services
0.1 to 1 percent: Investigate, may indicate emerging issues
1 to 5 percent: Significant degradation, page on-call team
Over 5 percent: Critical incident, all hands response

Example alert logic:

Alert when error rates exceed acceptable thresholds:

if (server_error_rate > 1% for 3 consecutive checks) {
  alert: "Elevated server error rate"
  severity: critical
  action: "Check application logs, database connectivity, external dependencies"
}

4. Saturation

What it measures: How full your system is.

Saturation reveals how close your system is to maximum capacity. This includes CPU, memory, disk, network bandwidth, database connections, and any other resource that can run out.

Why it matters:

Saturation predicts future failures. A database connection pool at 95 percent usage will hit 100 percent soon, causing new requests to fail. A service consuming 90 percent of available CPU will slow down before crashing.

Saturation provides early warning. Monitoring saturation lets you add capacity before users experience degraded performance or errors.

What to measure:

Track saturation for all constrained resources:

CPU utilization: Percentage of processing capacity used
Memory utilization: Percentage of RAM consumed
Disk utilization: Storage capacity used and I/O throughput
Network bandwidth: Percentage of network capacity used
Connection pools: Database connections, HTTP connections
Queue depths: Messages waiting for processing
Thread pools: Available threads for request handling

Each resource has different saturation thresholds. CPU can safely reach 80 percent, but database connection pools should stay under 70 percent to handle traffic spikes.

Practical thresholds:

Different resources have different safe limits:

CPU: Alert above 80 percent sustained usage
Memory: Alert above 85 percent to prevent OOM kills
Disk: Alert above 80 percent capacity
Connection pools: Alert above 70 percent utilization
Queue depths: Alert when depth grows unbounded

Example alert logic:

Alert before resources exhaust completely:

if (database_connection_pool_usage > 70% for 10 minutes) {
  alert: "Database connection pool saturation"
  severity: high
  action: "Scale connection pool size or reduce query load"
}

How Golden Signals Work Together

Monitoring all four signals reveals the complete picture of system health. Each signal provides context for the others:

Scenario 1: High latency + low errors + high saturation

Your system is overloaded but still processing requests. Users experience slow responses, but requests succeed eventually. This indicates capacity constraints. Solution: Add resources or optimize performance.

Scenario 2: Low latency + high errors + low saturation

Requests fail quickly without resource strain. This suggests application bugs, configuration problems, or external dependency failures. Solution: Check application logs and external service status.

Scenario 3: Traffic spike + high saturation + stable latency + stable errors

Your system handles increased load successfully. Resources are utilized but within limits. This indicates healthy scaling. Action: Monitor for continued growth and plan capacity increases.

Scenario 4: Traffic drop + low errors + low latency + low saturation

Everything looks healthy from your service perspective, but traffic disappeared. This indicates upstream problems. Solution: Check load balancers, DNS, CDN, or upstream services preventing traffic from reaching you.

Measuring Golden Signals in Practice

Theoretical understanding helps, but practical implementation determines monitoring effectiveness.

Instrumentation approach:

Measure golden signals at the service boundary where requests enter your system:

Application instrumentation: Libraries like Prometheus client or OpenTelemetry embedded in code
Proxy instrumentation: API gateways or reverse proxies measuring traffic
External monitoring: Synthetic checks simulating real user behavior

Combine approaches for comprehensive coverage. Application instrumentation provides detailed internal metrics. External monitoring validates what users experience.

Aggregation windows:

Different signals need different measurement intervals:

Latency: 1-minute rolling windows for p50/p95/p99 percentiles
Traffic: 1-minute request counts for real-time visibility
Errors: 1-minute error rates with 5-minute trending
Saturation: 30-second sampling for resource utilization

Balance freshness against accuracy. Too frequent sampling creates noise. Too infrequent sampling misses brief spikes.

Multi-region consideration:

Services deployed across multiple regions need region-specific baselines:

Regional traffic patterns: Europe traffic peaks while US sleeps
Regional latency baselines: Cross-ocean requests inherently slower
Regional error correlation: Issues affecting single region versus global impact

Aggregate regional metrics for global overview while maintaining regional granularity for troubleshooting.

Alert Design Based on Golden Signals

Transform signal measurements into actionable alerts:

Alert on user impact, not internal metrics:

Bad alert: “CPU usage above 80 percent” Good alert: “API p95 latency above 500ms for 5 minutes”

Bad alert: “Memory usage above 90 percent” Good alert: “Error rate above 1 percent due to OOM exceptions”

Users care about latency and errors. They do not care about CPU or memory except when those resources cause user-facing problems.

Use multi-signal conditions:

Require multiple signals to trigger critical alerts:

if ((error_rate > 1%) AND (p95_latency > 1000ms) AND (traffic > baseline)) {
  alert: "Service degradation affecting users"
  severity: critical
}

Single signal alerts create false positives. Combined signals confirm real problems.

Progressive severity levels:

Different signal thresholds trigger different severities:

Info: Early warning, no immediate action needed
Medium: Investigation required during business hours
High: Page on-call, investigation required immediately
Critical: All hands, customer-impacting outage in progress

Design thresholds so most alerts are informational, catching issues before they become emergencies.

Implementing Golden Signals with Modern Tools

Manual metric collection fails at scale. Modern monitoring platforms automate golden signal tracking.

HTTP endpoint monitoring:

Platforms like Upstat measure golden signals automatically for HTTP services:

Latency tracking: Captures DNS, TCP, TLS, TTFB, and total response time per request
Traffic measurement: Records request volume with regional distribution
Error detection: Identifies HTTP error codes and connection failures
Saturation inference: Correlates latency increases with capacity constraints

Multi-region checking from US East/West, EU West/Central, and Asia Pacific regions reveals geographic performance variations that single-location monitoring misses.

Alert condition configuration:

Modern platforms support flexible alert logic matching golden signal patterns:

// Alert when latency degrades
{
  "and": [
    { ">=": [{ "var": "response_time" }, 1000] },
    { ">=": [{ "var": "consecutive_failures" }, 3] }
  ]
}

JSON-based rule engines allow complex multi-condition alerts without code changes.

Anti-fatigue systems:

Golden signal monitoring generates high volumes of measurements. Anti-fatigue features prevent alert storms:

Deduplication: Suppress repeated alerts for same issue
Intelligent grouping: Aggregate related failures into single notification
Escalation policies: Progressive notification as incidents persist
Maintenance windows: Suppress expected degradation during deployments

Common Pitfalls

Monitoring infrastructure instead of user experience:

Teams monitor CPU, memory, and disk because those metrics are easy to collect. But infrastructure health does not guarantee service health.

A server with low CPU can still serve slow requests due to database query problems. A server with plenty of free memory can return errors due to external API failures.

Prioritize golden signals that measure what users experience.

Ignoring percentiles:

Average latency hides problems. A service with 50ms average latency might have 10 percent of requests taking 10 seconds. Those slow requests ruin user experience for affected users.

Always track p95, p99, and p99.9 percentiles in addition to median latency.

Setting alerts without baselines:

Alerting when CPU exceeds 70 percent might be too sensitive for services that regularly run hot or too permissive for services that should stay under 50 percent.

Establish baselines from historical data. Alert on deviations from normal behavior.

Collecting metrics without using them:

Dashboards full of graphs that nobody looks at waste resources. Every metric should either drive alerts or answer specific questions during incident investigation.

Review dashboard usage quarterly. Remove unused metrics.

Moving From Metrics to Action

Golden signals reveal problems. The next step is fixing them.

When latency is high:

Check database query performance
Review external API response times
Examine application profiling data
Verify resource availability
Consider caching frequently accessed data

When traffic is abnormal:

Verify upstream routing and DNS
Check for DDoS attack patterns
Review recent marketing campaigns
Investigate client behavior changes
Validate capacity planning assumptions

When errors are elevated:

Examine application error logs
Check database connectivity
Verify external dependency status
Review recent deployments for bugs
Confirm configuration correctness

When saturation is high:

Add horizontal capacity (more instances)
Optimize resource usage (more efficient code)
Implement rate limiting to protect system
Offload work to queues for async processing
Review capacity planning and growth projections

The Foundation of Reliable Systems

Monitoring every possible metric creates noise. Monitoring nothing creates surprises.

The four golden signals—latency, traffic, errors, and saturation—provide the essential foundation. These metrics reveal system health from the user perspective and detect the majority of problems that impact reliability.

Instrument your services to measure these signals. Configure alerts based on signal thresholds. Review signal trends to identify capacity needs. Use signal data to guide incident investigation.

The complexity comes from implementation details, measurement accuracy, and alert design. The concept stays simple: measure how fast your system responds, how much traffic it handles, how often it fails, and how close it is to capacity limits.

Get those four measurements right, and you catch most problems before users notice. That is why they are golden.

Explore In Upstat

Monitor HTTP endpoints with multi-region checking, track latency across DNS/TCP/TLS/TTFB phases, measure uptime and error rates, and configure alerts based on the golden signals that matter for your services.

See How Monitoring Works

Golden Signals Explained

The golden signals—latency, traffic, errors, and saturation—provide a comprehensive view of system health. Understanding what each signal measures, how to track them, and when they indicate problems helps teams build monitoring that catches issues before users notice.

The Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

How Golden Signals Work Together

Measuring Golden Signals in Practice

Alert Design Based on Golden Signals

Implementing Golden Signals with Modern Tools

Common Pitfalls

Moving From Metrics to Action

The Foundation of Reliable Systems

Explore In Upstat