What are the three pillars of observability?

The three pillars of observability are logs (timestamped records of discrete events), metrics (numerical measurements aggregated over time), and traces (records of individual requests flowing through distributed systems). Together, they provide comprehensive visibility into system behavior.

Why do you need all three pillars?

Each pillar answers different questions. Logs show detailed events and context for debugging specific issues. Metrics reveal trends and patterns over time for monitoring system health. Traces show how requests flow through distributed systems, revealing bottlenecks and dependencies. You need all three for complete observability.

What's the difference between logs and metrics?

Logs are detailed, discrete event records ("User 123 logged in at 2:34 PM") that provide rich context but are expensive to store and query. Metrics are aggregated measurements ("500 logins per minute") that are efficient to store and query but lose individual event details. Logs are for debugging; metrics are for monitoring.

What are distributed traces?

Distributed traces track individual requests as they travel through multiple services in a distributed system. Each trace shows the complete path of a request, timing for each service interaction, and where delays occur. This makes it possible to debug performance issues that span multiple microservices.

Three Pillars of Observability: Logs, Metrics, Traces

The Foundation of Modern Observability

When your distributed system starts misbehaving at 3 AM, you need answers fast. Why are requests slow? Which service is the bottleneck? What changed in the last hour? Traditional monitoring shows that something is wrong, but observability tells you why it’s wrong.

Observability relies on three fundamental types of telemetry data: logs, metrics, and traces. These “three pillars” work together to provide complete visibility into system behavior. Each pillar captures different information, serves distinct purposes, and answers specific questions during debugging.

Understanding how logs, metrics, and traces complement each other transforms incident response from guesswork into systematic investigation. This post explains what each pillar does, when to use it, and how they work together in production systems.

What Are the Three Pillars?

The three pillars represent different ways of capturing system behavior:

Logs record discrete events with detailed context about what happened at a specific moment.

Metrics aggregate numerical measurements over time to show trends and patterns.

Traces track individual requests as they flow through distributed systems, revealing the path and timing of operations.

Each pillar provides a different lens for understanding system behavior. Used together, they enable comprehensive observability that supports both proactive monitoring and reactive debugging.

Pillar One: Logs

Logs are timestamped records of discrete events that happened in your system. Application code emits logs when requests arrive, when database queries execute, when errors occur, and when state changes.

What Logs Contain

A log entry typically includes:

Timestamp: When the event occurred
Severity: Info, warning, error, critical
Message: Human-readable description of what happened
Context: User ID, request ID, service name, additional metadata

Example log entry:

2025-11-04T14:23:45Z [ERROR] OrderService: Payment processing failed
  user_id: 12345
  order_id: ord_789abc
  payment_gateway: stripe
  error: card_declined
  amount: $127.50

When Logs Are Essential

Logs excel at providing detailed context about specific events:

Error investigation: When something goes wrong, logs explain exactly what failed, with stack traces and variable states.

Audit trails: Logs document who did what and when, essential for security and compliance.

Event sequencing: Logs show the order of operations, helping reconstruct what led to a problem.

Edge case debugging: Unusual situations that don’t fit patterns show up clearly in logs with full context.

Log Challenges

Volume: Production systems generate millions of log entries. Finding relevant logs requires good filtering and search.

Cost: Storing and indexing logs at scale becomes expensive. Teams must balance retention with budget.

Structure: Unstructured text logs are hard to query. Structured logging with consistent fields enables better analysis.

Noise: Too much logging drowns out important information. Appropriate log levels help signal-to-noise ratios.

Pillar Two: Metrics

Metrics are numerical measurements aggregated over time. Instead of capturing every individual event, metrics summarize system behavior using counts, rates, and statistical distributions.

What Metrics Measure

Common metric types include:

Counters: Total requests, total errors, cache hits
Gauges: Current CPU usage, active connections, queue depth
Histograms: Response time distribution, request size distribution
Rates: Requests per second, errors per minute

Example metrics:

http_requests_total{service="api", endpoint="/checkout"}: 45,892
http_request_duration_seconds{service="api", p99}: 0.450
database_connections_active{pool="main"}: 23
error_rate{service="api"}: 0.0012

When Metrics Are Essential

Metrics excel at trend analysis and alerting:

Performance monitoring: Is response time increasing? Are error rates within acceptable bounds?

Capacity planning: How much headroom remains before scaling? What’s the growth trend?

SLA tracking: Are we meeting uptime targets? What’s our actual availability?

Alert triggers: Metrics power threshold-based alerts that fire when values cross defined boundaries.

Metric Challenges

Cardinality: High-cardinality dimensions (user IDs, request IDs) create enormous metric series that overwhelm systems.

Aggregation loss: Metrics lose individual event details. A P99 latency metric doesn’t identify which specific request was slow.

Dimensionality: Deciding which labels to add requires balancing query flexibility against metric explosion.

Context limitations: Metrics show what changed but not why. Investigation requires other data sources.

Pillar Three: Traces

Traces track individual requests as they flow through distributed systems. A single user action might trigger calls to ten microservices, two databases, and a cache. Traces show this entire path with timing for each step.

What Traces Capture

A trace consists of spans representing operations within a request:

Trace ID: Unique identifier linking all spans for one request
Span ID: Unique identifier for each operation
Parent span ID: Shows operation hierarchy
Operation name: What this span represents
Start time and duration: When it started and how long it took
Attributes: Request parameters, status codes, error details

Example trace structure:

Trace: e7b2c4a9-1234-5678-abcd-ef1234567890
├─ [200ms] Gateway: /api/checkout
   ├─ [150ms] OrderService: createOrder
   │  ├─ [80ms] Database: INSERT INTO orders
   │  └─ [60ms] InventoryService: reserveItems
   │     └─ [45ms] Database: UPDATE inventory
   └─ [40ms] PaymentService: processPayment
      └─ [35ms] Stripe API: charge

When Traces Are Essential

Traces excel at understanding request flows:

Latency investigation: Where in the distributed call chain is time being spent?

Dependency mapping: How do services actually interact at runtime, not just in architecture diagrams?

Performance optimization: Which operations are the bottlenecks? Where can we parallelize?

Error correlation: When a request fails, which service caused the failure and how did it propagate?

Trace Challenges

Instrumentation complexity: Every service must propagate trace context correctly. Missing instrumentation creates gaps.

Sampling: Capturing every trace is prohibitively expensive. Sampling risks missing important transactions.

Storage costs: Traces contain detailed data for every sampled request. Retention windows are often short.

Implementation coordination: Distributed tracing requires coordination across teams to ensure consistent instrumentation.

How the Three Pillars Work Together

The power of observability emerges when you use all three pillars together. Each provides different information that complements the others.

The Investigation Workflow

Step 1: Metrics detect the problem

Monitoring dashboards show error rate spiking from 0.1 percent to 5 percent. An alert fires. You know something is wrong.

Step 2: Metrics narrow the scope

Filter metrics by endpoint, region, and service. The spike affects only /api/checkout in the US-East region. You know where the problem is.

Step 3: Traces identify the bottleneck

Query traces for slow /api/checkout requests in US-East. Traces reveal that PaymentService calls are timing out after 30 seconds. You know which component is failing.

Step 4: Logs explain why

Search logs for PaymentService errors with matching trace IDs. Logs show connection pool exhaustion: all database connections are held by long-running queries. You know why it’s failing.

Step 5: Metrics verify the fix

After scaling database connection pools, metrics confirm error rates returned to baseline. Traces show response times back to normal. You know the problem is resolved.

Complementary Strengths

Each pillar compensates for the others’ weaknesses:

Metrics provide the overview that would be impossible with logs alone. You can’t understand system-wide trends by reading individual log entries.

Logs provide the detail that metrics aggregate away. When metrics show something is wrong, logs explain the specifics.

Traces provide the connections that neither logs nor metrics capture. They show how operations relate across service boundaries.

What Modern Systems Actually Need

Not every system needs all three pillars at the same level of sophistication. The right investment depends on your architecture and operational maturity.

Start with Metrics

Every system needs basic metrics: request rate, error rate, response time, and resource utilization. These fundamentals support alerting and performance tracking.

Tools like Prometheus, CloudWatch, or Datadog provide metrics collection and visualization. This foundation works for monolithic applications and simple architectures.

Add Structured Logs

As complexity grows, upgrade from text logs to structured logs with consistent fields. Include request IDs and trace IDs in every log entry so you can connect related events.

Structured logging enables querying like “show me all errors for user 12345 in the last hour” instead of searching through text. Tools like ELK Stack, Loki, or Splunk handle log aggregation and search.

Implement Distributed Tracing

When you have multiple services communicating with each other, distributed tracing becomes essential. You can’t understand request flows across service boundaries without traces.

Frameworks like OpenTelemetry provide language-agnostic instrumentation. Backends like Jaeger, Zipkin, Tempo, or commercial APM tools store and visualize traces.

Balance Cost and Value

Observability infrastructure isn’t free. Each pillar has storage, processing, and operational costs:

Metrics: Relatively inexpensive but explode with high cardinality Logs: Expensive at scale, requiring careful retention policies Traces: Very expensive due to detailed per-request data

Start with the pillars that address your biggest pain points. Add sophistication as needs grow.

Building Effective Observability

Implementing the three pillars requires technical work and operational discipline.

Technical Implementation

Standardize instrumentation: Use consistent libraries and conventions across all services. OpenTelemetry provides vendor-neutral instrumentation.

Propagate context: Ensure trace IDs and span IDs flow through every system component, including message queues and async workers.

Structure everything: Use structured logging and add meaningful attributes to metrics and spans. Consistent field names enable better querying.

Sample intelligently: Capture all errors and slow transactions. Sample successful fast requests at lower rates to manage costs.

Operational Practices

Define standards: Document what teams must log, which metrics to export, and when to create spans. Consistency across teams is critical.

Build dashboards: Create monitoring dashboards that combine metrics from related services. Make them accessible during incidents.

Enable searching: Invest in log and trace search capabilities. Teams need to find relevant data quickly during outages.

Train engineers: Teach investigation workflows that leverage all three pillars. Observability only helps if teams know how to use it.

Common Implementation Mistakes

Teams building observability capabilities make predictable errors:

Logging Everything

More logs don’t automatically mean better observability. Excessive logging creates noise that hides important signals and increases costs.

Solution: Use appropriate log levels. Debug logs stay disabled in production. Info logs document significant events. Errors capture problems.

Metric Explosion

Adding high-cardinality labels (user IDs, session IDs) to metrics creates millions of time series that overwhelm monitoring systems.

Solution: Use metrics for aggregated patterns. Use traces and logs for high-cardinality investigation.

Incomplete Tracing

Instrumenting some services but not others creates gaps that make traces useless. Partial visibility is worse than no visibility.

Solution: Treat distributed tracing as all-or-nothing. Incomplete instrumentation provides false confidence.

Ignoring Correlation

When logs, metrics, and traces can’t be connected, you lose the power of observability. Each pillar becomes an isolated data silo.

Solution: Use trace IDs everywhere. Link logs to traces through shared trace IDs. Tag metrics with trace IDs when possible.

How Upstat Provides Observability Support

Incident response platforms need observability features to support effective debugging. Upstat combines monitoring and observability capabilities for comprehensive system visibility.

The platform monitors HTTP and HTTPS endpoints from multiple geographic regions, collecting performance metrics including DNS resolution time, TCP connection time, TLS handshake duration, and time to first byte. These metrics provide the foundation for understanding service health and performance trends.

Beyond basic monitoring, Upstat maintains detailed event logs for every check execution, status change, and incident action. These logs include structured data that links monitoring events to incident context, supporting investigation workflows that require connecting metrics to specific operational events.

When incidents occur, teams can query historical performance data, examine regional check results, and trace status changes over time to understand what triggered alerts and how system behavior evolved during outages.

Moving Forward with Observability

Start where you are. If you only have basic metrics today, that’s fine. Add structured logging next. Implement distributed tracing when microservices make it necessary.

The goal isn’t implementing perfect observability on day one. The goal is building visibility that improves debugging and reduces mean time to resolution.

Metrics tell you when problems occur. Traces show you where problems live. Logs explain why problems happened.

Together, these three pillars transform how teams understand and debug complex distributed systems. Invest in all three as your architecture demands it, and you’ll spend less time guessing and more time fixing.

Explore In Upstat

Track system health with detailed performance metrics and event logs that connect monitoring data to incident context.

See How Health Monitoring Works

Three Pillars of Observability

Observability relies on three types of telemetry data: logs for detailed events, metrics for aggregated measurements, and traces for request flows. Understanding how these pillars complement each other helps teams build effective debugging workflows and maintain reliable distributed systems.