The Foundation of Modern Observability
When your distributed system starts misbehaving at 3 AM, you need answers fast. Why are requests slow? Which service is the bottleneck? What changed in the last hour? Traditional monitoring shows that something is wrong, but observability tells you why it’s wrong.
Observability relies on three fundamental types of telemetry data: logs, metrics, and traces. These “three pillars” work together to provide complete visibility into system behavior. Each pillar captures different information, serves distinct purposes, and answers specific questions during debugging.
Understanding how logs, metrics, and traces complement each other transforms incident response from guesswork into systematic investigation. This post explains what each pillar does, when to use it, and how they work together in production systems.
What Are the Three Pillars?
The three pillars represent different ways of capturing system behavior:
Logs record discrete events with detailed context about what happened at a specific moment.
Metrics aggregate numerical measurements over time to show trends and patterns.
Traces track individual requests as they flow through distributed systems, revealing the path and timing of operations.
Each pillar provides a different lens for understanding system behavior. Used together, they enable comprehensive observability that supports both proactive monitoring and reactive debugging.
Pillar One: Logs
Logs are timestamped records of discrete events that happened in your system. Application code emits logs when requests arrive, when database queries execute, when errors occur, and when state changes.
What Logs Contain
A log entry typically includes:
- Timestamp: When the event occurred
- Severity: Info, warning, error, critical
- Message: Human-readable description of what happened
- Context: User ID, request ID, service name, additional metadata
Example log entry:
2025-11-04T14:23:45Z [ERROR] OrderService: Payment processing failed
user_id: 12345
order_id: ord_789abc
payment_gateway: stripe
error: card_declined
amount: $127.50 When Logs Are Essential
Logs excel at providing detailed context about specific events:
Error investigation: When something goes wrong, logs explain exactly what failed, with stack traces and variable states.
Audit trails: Logs document who did what and when, essential for security and compliance.
Event sequencing: Logs show the order of operations, helping reconstruct what led to a problem.
Edge case debugging: Unusual situations that don’t fit patterns show up clearly in logs with full context.
Log Challenges
Volume: Production systems generate millions of log entries. Finding relevant logs requires good filtering and search.
Cost: Storing and indexing logs at scale becomes expensive. Teams must balance retention with budget.
Structure: Unstructured text logs are hard to query. Structured logging with consistent fields enables better analysis.
Noise: Too much logging drowns out important information. Appropriate log levels help signal-to-noise ratios.
Pillar Two: Metrics
Metrics are numerical measurements aggregated over time. Instead of capturing every individual event, metrics summarize system behavior using counts, rates, and statistical distributions.
What Metrics Measure
Common metric types include:
- Counters: Total requests, total errors, cache hits
- Gauges: Current CPU usage, active connections, queue depth
- Histograms: Response time distribution, request size distribution
- Rates: Requests per second, errors per minute
Example metrics:
http_requests_total{service="api", endpoint="/checkout"}: 45,892
http_request_duration_seconds{service="api", p99}: 0.450
database_connections_active{pool="main"}: 23
error_rate{service="api"}: 0.0012 When Metrics Are Essential
Metrics excel at trend analysis and alerting:
Performance monitoring: Is response time increasing? Are error rates within acceptable bounds?
Capacity planning: How much headroom remains before scaling? What’s the growth trend?
SLA tracking: Are we meeting uptime targets? What’s our actual availability?
Alert triggers: Metrics power threshold-based alerts that fire when values cross defined boundaries.
Metric Challenges
Cardinality: High-cardinality dimensions (user IDs, request IDs) create enormous metric series that overwhelm systems.
Aggregation loss: Metrics lose individual event details. A P99 latency metric doesn’t identify which specific request was slow.
Dimensionality: Deciding which labels to add requires balancing query flexibility against metric explosion.
Context limitations: Metrics show what changed but not why. Investigation requires other data sources.
Pillar Three: Traces
Traces track individual requests as they flow through distributed systems. A single user action might trigger calls to ten microservices, two databases, and a cache. Traces show this entire path with timing for each step.
What Traces Capture
A trace consists of spans representing operations within a request:
- Trace ID: Unique identifier linking all spans for one request
- Span ID: Unique identifier for each operation
- Parent span ID: Shows operation hierarchy
- Operation name: What this span represents
- Start time and duration: When it started and how long it took
- Attributes: Request parameters, status codes, error details
Example trace structure:
Trace: e7b2c4a9-1234-5678-abcd-ef1234567890
├─ [200ms] Gateway: /api/checkout
├─ [150ms] OrderService: createOrder
│ ├─ [80ms] Database: INSERT INTO orders
│ └─ [60ms] InventoryService: reserveItems
│ └─ [45ms] Database: UPDATE inventory
└─ [40ms] PaymentService: processPayment
└─ [35ms] Stripe API: charge When Traces Are Essential
Traces excel at understanding request flows:
Latency investigation: Where in the distributed call chain is time being spent?
Dependency mapping: How do services actually interact at runtime, not just in architecture diagrams?
Performance optimization: Which operations are the bottlenecks? Where can we parallelize?
Error correlation: When a request fails, which service caused the failure and how did it propagate?
Trace Challenges
Instrumentation complexity: Every service must propagate trace context correctly. Missing instrumentation creates gaps.
Sampling: Capturing every trace is prohibitively expensive. Sampling risks missing important transactions.
Storage costs: Traces contain detailed data for every sampled request. Retention windows are often short.
Implementation coordination: Distributed tracing requires coordination across teams to ensure consistent instrumentation.
How the Three Pillars Work Together
The power of observability emerges when you use all three pillars together. Each provides different information that complements the others.
The Investigation Workflow
Step 1: Metrics detect the problem
Monitoring dashboards show error rate spiking from 0.1 percent to 5 percent. An alert fires. You know something is wrong.
Step 2: Metrics narrow the scope
Filter metrics by endpoint, region, and service. The spike affects only /api/checkout in the US-East region. You know where the problem is.
Step 3: Traces identify the bottleneck
Query traces for slow /api/checkout requests in US-East. Traces reveal that PaymentService calls are timing out after 30 seconds. You know which component is failing.
Step 4: Logs explain why
Search logs for PaymentService errors with matching trace IDs. Logs show connection pool exhaustion: all database connections are held by long-running queries. You know why it’s failing.
Step 5: Metrics verify the fix
After scaling database connection pools, metrics confirm error rates returned to baseline. Traces show response times back to normal. You know the problem is resolved.
Complementary Strengths
Each pillar compensates for the others’ weaknesses:
Metrics provide the overview that would be impossible with logs alone. You can’t understand system-wide trends by reading individual log entries.
Logs provide the detail that metrics aggregate away. When metrics show something is wrong, logs explain the specifics.
Traces provide the connections that neither logs nor metrics capture. They show how operations relate across service boundaries.
What Modern Systems Actually Need
Not every system needs all three pillars at the same level of sophistication. The right investment depends on your architecture and operational maturity.
Start with Metrics
Every system needs basic metrics: request rate, error rate, response time, and resource utilization. These fundamentals support alerting and performance tracking.
Tools like Prometheus, CloudWatch, or Datadog provide metrics collection and visualization. This foundation works for monolithic applications and simple architectures.
Add Structured Logs
As complexity grows, upgrade from text logs to structured logs with consistent fields. Include request IDs and trace IDs in every log entry so you can connect related events.
Structured logging enables querying like “show me all errors for user 12345 in the last hour” instead of searching through text. Tools like ELK Stack, Loki, or Splunk handle log aggregation and search.
Implement Distributed Tracing
When you have multiple services communicating with each other, distributed tracing becomes essential. You can’t understand request flows across service boundaries without traces.
Frameworks like OpenTelemetry provide language-agnostic instrumentation. Backends like Jaeger, Zipkin, Tempo, or commercial APM tools store and visualize traces.
Balance Cost and Value
Observability infrastructure isn’t free. Each pillar has storage, processing, and operational costs:
Metrics: Relatively inexpensive but explode with high cardinality Logs: Expensive at scale, requiring careful retention policies Traces: Very expensive due to detailed per-request data
Start with the pillars that address your biggest pain points. Add sophistication as needs grow.
Building Effective Observability
Implementing the three pillars requires technical work and operational discipline.
Technical Implementation
Standardize instrumentation: Use consistent libraries and conventions across all services. OpenTelemetry provides vendor-neutral instrumentation.
Propagate context: Ensure trace IDs and span IDs flow through every system component, including message queues and async workers.
Structure everything: Use structured logging and add meaningful attributes to metrics and spans. Consistent field names enable better querying.
Sample intelligently: Capture all errors and slow transactions. Sample successful fast requests at lower rates to manage costs.
Operational Practices
Define standards: Document what teams must log, which metrics to export, and when to create spans. Consistency across teams is critical.
Build dashboards: Create monitoring dashboards that combine metrics from related services. Make them accessible during incidents.
Enable searching: Invest in log and trace search capabilities. Teams need to find relevant data quickly during outages.
Train engineers: Teach investigation workflows that leverage all three pillars. Observability only helps if teams know how to use it.
Common Implementation Mistakes
Teams building observability capabilities make predictable errors:
Logging Everything
More logs don’t automatically mean better observability. Excessive logging creates noise that hides important signals and increases costs.
Solution: Use appropriate log levels. Debug logs stay disabled in production. Info logs document significant events. Errors capture problems.
Metric Explosion
Adding high-cardinality labels (user IDs, session IDs) to metrics creates millions of time series that overwhelm monitoring systems.
Solution: Use metrics for aggregated patterns. Use traces and logs for high-cardinality investigation.
Incomplete Tracing
Instrumenting some services but not others creates gaps that make traces useless. Partial visibility is worse than no visibility.
Solution: Treat distributed tracing as all-or-nothing. Incomplete instrumentation provides false confidence.
Ignoring Correlation
When logs, metrics, and traces can’t be connected, you lose the power of observability. Each pillar becomes an isolated data silo.
Solution: Use trace IDs everywhere. Link logs to traces through shared trace IDs. Tag metrics with trace IDs when possible.
How Upstat Provides Observability Support
Incident response platforms need observability features to support effective debugging. Upstat combines monitoring and observability capabilities for comprehensive system visibility.
The platform monitors HTTP and HTTPS endpoints from multiple geographic regions, collecting performance metrics including DNS resolution time, TCP connection time, TLS handshake duration, and time to first byte. These metrics provide the foundation for understanding service health and performance trends.
Beyond basic monitoring, Upstat maintains detailed event logs for every check execution, status change, and incident action. These logs include structured data that links monitoring events to incident context, supporting investigation workflows that require connecting metrics to specific operational events.
When incidents occur, teams can query historical performance data, examine regional check results, and trace status changes over time to understand what triggered alerts and how system behavior evolved during outages.
Moving Forward with Observability
Start where you are. If you only have basic metrics today, that’s fine. Add structured logging next. Implement distributed tracing when microservices make it necessary.
The goal isn’t implementing perfect observability on day one. The goal is building visibility that improves debugging and reduces mean time to resolution.
Metrics tell you when problems occur. Traces show you where problems live. Logs explain why problems happened.
Together, these three pillars transform how teams understand and debug complex distributed systems. Invest in all three as your architecture demands it, and you’ll spend less time guessing and more time fixing.
Explore In Upstat
Track system health with detailed performance metrics and event logs that connect monitoring data to incident context.
