Your monitoring dashboard displays hundreds of metrics. CPU utilization. Memory consumption. Disk I/O. Network bandwidth. Database connections. Cache hit rates. Queue depths. Thread pool sizes.
When an incident strikes, which metrics matter?
The four golden signals—latency, traffic, errors, and saturation—cut through the noise. These metrics reveal system health from the user perspective and catch the majority of user-impacting issues before they escalate.
This guide explains what each signal measures, how to track them effectively, and when they indicate problems.
The Four Golden Signals
Google’s Site Reliability Engineering team identified these four metrics as the foundation of effective monitoring. They apply to virtually every service, from HTTP APIs to batch processing systems to message queues.
1. Latency
What it measures: Time to process requests.
Latency tracks how long your system takes to respond. This includes successful requests and failed requests, measured separately. A fast error response tells a different story than a slow success response.
Why it matters:
Users experience latency directly. A search that returns results in 50ms feels instant. The same search taking 5 seconds feels broken, even if it eventually succeeds.
Latency reveals resource constraints, external dependency problems, and inefficient code paths. Rising latency often precedes complete failures as systems approach capacity limits.
What to measure:
Break latency measurement into phases to identify bottlenecks:
- DNS resolution time: How long to resolve domain names
- TCP connection time: Time to establish network connection
- TLS handshake time: SSL/TLS negotiation duration
- Time to first byte (TTFB): Server processing before sending response
- Total response time: Complete request-response cycle
Track latency at multiple percentiles, not just averages:
- p50 (median): Typical user experience
- p95: Experience for 1 in 20 requests
- p99: Experience for 1 in 100 requests
- p99.9: Worst-case latency for power users
A service with 100ms p50 latency and 5000ms p99 latency has severe outlier problems that averages hide.
Practical thresholds:
Different services have different latency requirements:
- Interactive APIs: p95 under 300ms, p99 under 500ms
- Background jobs: p95 under 5 seconds, p99 under 10 seconds
- Batch operations: Completion within expected window
Example alert logic:
Alert when latency exceeds thresholds for sustained periods:
if (p95_latency > 500ms for 3 consecutive minutes) {
alert: "API latency degraded"
severity: high
action: "Check database query performance and external dependencies"
} 2. Traffic
What it measures: Demand on your system.
Traffic quantifies how much your system is being used. For web services, this is requests per second. For databases, it might be queries per second or connection count. For storage systems, it could be read/write operations.
Why it matters:
Traffic reveals usage patterns, capacity planning needs, and attack signatures. Sudden traffic spikes might indicate viral growth, marketing campaigns, or DDoS attacks. Unexpected traffic drops suggest upstream failures preventing requests from reaching your service.
Traffic also provides context for other metrics. An error rate of 5 percent means different things at 100 requests/second versus 10,000 requests/second.
What to measure:
Track traffic at multiple dimensions:
- Total volume: Overall requests per second
- Success traffic: Requests that complete successfully
- Error traffic: Requests that fail
- By endpoint: Which operations are most used
- By region: Geographic distribution of demand
- By user segment: Free tier versus paid users
Practical baselines:
Establish normal traffic ranges based on historical data:
- Weekday baseline: 1,000-1,500 requests/second
- Weekend baseline: 600-900 requests/second
- Peak hours: 2,000-2,500 requests/second
- Overnight minimum: 200-400 requests/second
Example alert logic:
Alert on significant deviations from expected patterns:
if (current_traffic < 0.5 * baseline for 5 minutes) {
alert: "Traffic drop detected"
severity: high
action: "Check upstream load balancers and DNS resolution"
}
if (current_traffic > 2.0 * baseline for 5 minutes) {
alert: "Traffic spike detected"
severity: medium
action: "Verify capacity, check for attack patterns"
} 3. Errors
What it measures: Rate of failed requests.
Errors track requests that fail to complete successfully. This includes HTTP 5xx errors, exceptions, timeouts, and any request that does not produce the expected result for users.
Why it matters:
Errors directly impact user experience. A user seeing error messages cannot complete their work. High error rates indicate bugs, infrastructure failures, or capacity problems.
Error rates provide different information than latency. A service can respond quickly with errors, hiding the fact that nothing is working. A service can have low error rates while responding slowly, masking performance problems.
What to measure:
Track errors with granularity:
- Total error rate: Percentage of all requests that fail
- By error type: 500 Internal Server Error versus 503 Service Unavailable
- By endpoint: Which operations fail most often
- By user action: Login errors versus data processing errors
- Client errors versus server errors: 4xx (client issues) versus 5xx (server issues)
Focus monitoring and alerting on server errors (5xx). Client errors (4xx) might indicate client bugs or invalid requests but do not reflect service health.
Practical thresholds:
Different error rates require different responses:
- Under 0.1 percent: Normal baseline for most services
- 0.1 to 1 percent: Investigate, may indicate emerging issues
- 1 to 5 percent: Significant degradation, page on-call team
- Over 5 percent: Critical incident, all hands response
Example alert logic:
Alert when error rates exceed acceptable thresholds:
if (server_error_rate > 1% for 3 consecutive checks) {
alert: "Elevated server error rate"
severity: critical
action: "Check application logs, database connectivity, external dependencies"
} 4. Saturation
What it measures: How full your system is.
Saturation reveals how close your system is to maximum capacity. This includes CPU, memory, disk, network bandwidth, database connections, and any other resource that can run out.
Why it matters:
Saturation predicts future failures. A database connection pool at 95 percent usage will hit 100 percent soon, causing new requests to fail. A service consuming 90 percent of available CPU will slow down before crashing.
Saturation provides early warning. Monitoring saturation lets you add capacity before users experience degraded performance or errors.
What to measure:
Track saturation for all constrained resources:
- CPU utilization: Percentage of processing capacity used
- Memory utilization: Percentage of RAM consumed
- Disk utilization: Storage capacity used and I/O throughput
- Network bandwidth: Percentage of network capacity used
- Connection pools: Database connections, HTTP connections
- Queue depths: Messages waiting for processing
- Thread pools: Available threads for request handling
Each resource has different saturation thresholds. CPU can safely reach 80 percent, but database connection pools should stay under 70 percent to handle traffic spikes.
Practical thresholds:
Different resources have different safe limits:
- CPU: Alert above 80 percent sustained usage
- Memory: Alert above 85 percent to prevent OOM kills
- Disk: Alert above 80 percent capacity
- Connection pools: Alert above 70 percent utilization
- Queue depths: Alert when depth grows unbounded
Example alert logic:
Alert before resources exhaust completely:
if (database_connection_pool_usage > 70% for 10 minutes) {
alert: "Database connection pool saturation"
severity: high
action: "Scale connection pool size or reduce query load"
} How Golden Signals Work Together
Monitoring all four signals reveals the complete picture of system health. Each signal provides context for the others:
Scenario 1: High latency + low errors + high saturation
Your system is overloaded but still processing requests. Users experience slow responses, but requests succeed eventually. This indicates capacity constraints. Solution: Add resources or optimize performance.
Scenario 2: Low latency + high errors + low saturation
Requests fail quickly without resource strain. This suggests application bugs, configuration problems, or external dependency failures. Solution: Check application logs and external service status.
Scenario 3: Traffic spike + high saturation + stable latency + stable errors
Your system handles increased load successfully. Resources are utilized but within limits. This indicates healthy scaling. Action: Monitor for continued growth and plan capacity increases.
Scenario 4: Traffic drop + low errors + low latency + low saturation
Everything looks healthy from your service perspective, but traffic disappeared. This indicates upstream problems. Solution: Check load balancers, DNS, CDN, or upstream services preventing traffic from reaching you.
Measuring Golden Signals in Practice
Theoretical understanding helps, but practical implementation determines monitoring effectiveness.
Instrumentation approach:
Measure golden signals at the service boundary where requests enter your system:
- Application instrumentation: Libraries like Prometheus client or OpenTelemetry embedded in code
- Proxy instrumentation: API gateways or reverse proxies measuring traffic
- External monitoring: Synthetic checks simulating real user behavior
Combine approaches for comprehensive coverage. Application instrumentation provides detailed internal metrics. External monitoring validates what users experience.
Aggregation windows:
Different signals need different measurement intervals:
- Latency: 1-minute rolling windows for p50/p95/p99 percentiles
- Traffic: 1-minute request counts for real-time visibility
- Errors: 1-minute error rates with 5-minute trending
- Saturation: 30-second sampling for resource utilization
Balance freshness against accuracy. Too frequent sampling creates noise. Too infrequent sampling misses brief spikes.
Multi-region consideration:
Services deployed across multiple regions need region-specific baselines:
- Regional traffic patterns: Europe traffic peaks while US sleeps
- Regional latency baselines: Cross-ocean requests inherently slower
- Regional error correlation: Issues affecting single region versus global impact
Aggregate regional metrics for global overview while maintaining regional granularity for troubleshooting.
Alert Design Based on Golden Signals
Transform signal measurements into actionable alerts:
Alert on user impact, not internal metrics:
Bad alert: “CPU usage above 80 percent” Good alert: “API p95 latency above 500ms for 5 minutes”
Bad alert: “Memory usage above 90 percent” Good alert: “Error rate above 1 percent due to OOM exceptions”
Users care about latency and errors. They do not care about CPU or memory except when those resources cause user-facing problems.
Use multi-signal conditions:
Require multiple signals to trigger critical alerts:
if ((error_rate > 1%) AND (p95_latency > 1000ms) AND (traffic > baseline)) {
alert: "Service degradation affecting users"
severity: critical
} Single signal alerts create false positives. Combined signals confirm real problems.
Progressive severity levels:
Different signal thresholds trigger different severities:
- Info: Early warning, no immediate action needed
- Medium: Investigation required during business hours
- High: Page on-call, investigation required immediately
- Critical: All hands, customer-impacting outage in progress
Design thresholds so most alerts are informational, catching issues before they become emergencies.
Implementing Golden Signals with Modern Tools
Manual metric collection fails at scale. Modern monitoring platforms automate golden signal tracking.
HTTP endpoint monitoring:
Platforms like Upstat measure golden signals automatically for HTTP services:
- Latency tracking: Captures DNS, TCP, TLS, TTFB, and total response time per request
- Traffic measurement: Records request volume with regional distribution
- Error detection: Identifies HTTP error codes and connection failures
- Saturation inference: Correlates latency increases with capacity constraints
Multi-region checking from US East/West, EU West/Central, and Asia Pacific regions reveals geographic performance variations that single-location monitoring misses.
Alert condition configuration:
Modern platforms support flexible alert logic matching golden signal patterns:
// Alert when latency degrades
{
"and": [
{ ">=": [{ "var": "response_time" }, 1000] },
{ ">=": [{ "var": "consecutive_failures" }, 3] }
]
} JSON-based rule engines allow complex multi-condition alerts without code changes.
Anti-fatigue systems:
Golden signal monitoring generates high volumes of measurements. Anti-fatigue features prevent alert storms:
- Deduplication: Suppress repeated alerts for same issue
- Intelligent grouping: Aggregate related failures into single notification
- Escalation policies: Progressive notification as incidents persist
- Maintenance windows: Suppress expected degradation during deployments
Common Pitfalls
Monitoring infrastructure instead of user experience:
Teams monitor CPU, memory, and disk because those metrics are easy to collect. But infrastructure health does not guarantee service health.
A server with low CPU can still serve slow requests due to database query problems. A server with plenty of free memory can return errors due to external API failures.
Prioritize golden signals that measure what users experience.
Ignoring percentiles:
Average latency hides problems. A service with 50ms average latency might have 10 percent of requests taking 10 seconds. Those slow requests ruin user experience for affected users.
Always track p95, p99, and p99.9 percentiles in addition to median latency.
Setting alerts without baselines:
Alerting when CPU exceeds 70 percent might be too sensitive for services that regularly run hot or too permissive for services that should stay under 50 percent.
Establish baselines from historical data. Alert on deviations from normal behavior.
Collecting metrics without using them:
Dashboards full of graphs that nobody looks at waste resources. Every metric should either drive alerts or answer specific questions during incident investigation.
Review dashboard usage quarterly. Remove unused metrics.
Moving From Metrics to Action
Golden signals reveal problems. The next step is fixing them.
When latency is high:
- Check database query performance
- Review external API response times
- Examine application profiling data
- Verify resource availability
- Consider caching frequently accessed data
When traffic is abnormal:
- Verify upstream routing and DNS
- Check for DDoS attack patterns
- Review recent marketing campaigns
- Investigate client behavior changes
- Validate capacity planning assumptions
When errors are elevated:
- Examine application error logs
- Check database connectivity
- Verify external dependency status
- Review recent deployments for bugs
- Confirm configuration correctness
When saturation is high:
- Add horizontal capacity (more instances)
- Optimize resource usage (more efficient code)
- Implement rate limiting to protect system
- Offload work to queues for async processing
- Review capacity planning and growth projections
The Foundation of Reliable Systems
Monitoring every possible metric creates noise. Monitoring nothing creates surprises.
The four golden signals—latency, traffic, errors, and saturation—provide the essential foundation. These metrics reveal system health from the user perspective and detect the majority of problems that impact reliability.
Instrument your services to measure these signals. Configure alerts based on signal thresholds. Review signal trends to identify capacity needs. Use signal data to guide incident investigation.
The complexity comes from implementation details, measurement accuracy, and alert design. The concept stays simple: measure how fast your system responds, how much traffic it handles, how often it fails, and how close it is to capacity limits.
Get those four measurements right, and you catch most problems before users notice. That is why they are golden.
Explore In Upstat
Monitor HTTP endpoints with multi-region checking, track latency across DNS/TCP/TLS/TTFB phases, measure uptime and error rates, and configure alerts based on the golden signals that matter for your services.
