Three Pillars of Observability
Updated June 8, 2026Something is wrong in production. Response times are spiking. Users are complaining. Your pager is going off at 2 AM.
How do you find the problem?
If your answer is "ssh into servers and poke around," you don't have observability. If your answer is "check the dashboard, trace a few slow requests, search the logs," you do. The difference between those two scenarios is the three pillars of observability: logs, metrics, and traces.
Monitoring vs Observability
These two words get used interchangeably, but they mean different things.
Monitoring is knowing that something is wrong. Your CPU is at 95%. Your error rate crossed the threshold. The SLA alert fired. Monitoring is reactive — it tells you when predefined conditions are met.
Observability is the ability to understand why something is wrong, by asking arbitrary questions of your system without having to add new instrumentation. An observable system lets you explore its internal state through the outputs it exposes. You can ask questions you didn't anticipate when you wrote the code.
Monitoring tells you the house is on fire. Observability lets you figure out which room and why.
The three pillars provide the raw material for observability. Use all three — none of them alone is sufficient.
Pillar 1: Logs
A log is a timestamped record of something that happened. It's the most primitive form of telemetry and also the most human-readable.
2026-06-03T14:32:01.221Z ERROR [OrderService] Payment processing failed: timeout after 5000ms
userId: u_8821
orderId: ord_44291
paymentMethod: stripe
attempt: 3Structured vs Unstructured Logging
Unstructured logs are plain text. They're readable by humans but terrible for machines. Searching for all errors related to a specific user means regex matching across gigabytes of text.
Structured logs are JSON (or another machine-parseable format). Every field is named and typed. You can filter by userId = "u_8821" or paymentMethod = "stripe" instantly, without regex.
{
"timestamp": "2026-06-03T14:32:01.221Z",
"level": "ERROR",
"service": "OrderService",
"message": "Payment processing failed",
"userId": "u_8821",
"orderId": "ord_44291",
"paymentMethod": "stripe",
"durationMs": 5000,
"attempt": 3
}Always use structured logging in production systems. The operational difference when debugging an incident at 3 AM is enormous.
Log Levels
Use log levels consistently — they're how you filter noise from signal:
| Level | When to Use |
|---|---|
| DEBUG | Detailed diagnostic info. Never in production (too noisy). |
| INFO | Normal operations: service started, user logged in, job completed. |
| WARN | Unexpected but handled: retry succeeded after failure, degraded mode. |
| ERROR | Something failed that shouldn't have: exception, request failed. |
| FATAL | System cannot continue: unrecoverable error, service shutting down. |
Centralized Logging
In a distributed system, logs from 50 services across 200 instances need to end up in one searchable place. Don't SSH into individual servers.
- ELK Stack (Elasticsearch + Logstash + Kibana) — the classic open-source stack. Logstash (or Filebeat) ships logs to Elasticsearch for indexing. Kibana provides the search UI.
- Grafana Loki — log aggregation designed to work alongside Prometheus. Much cheaper than ELK because it doesn't index log content, only labels (service name, environment, etc.). Full-text search is done at query time.
- Splunk — enterprise-grade, extremely powerful, extremely expensive. The choice when compliance and advanced analytics matter more than cost.
- Datadog Logs, AWS CloudWatch Logs, Google Cloud Logging — managed, cloud-native options that integrate with the rest of each cloud provider's observability suite.
Pillar 2: Metrics
A metric is a numeric measurement over time. Error rate. Request latency. CPU utilization. Active connections. Queue depth. Revenue per minute.
Unlike logs (which describe specific events), metrics are aggregated. You don't store "this specific request took 142ms" — you store "the p99 latency in the last minute was 312ms." That aggregation is what makes metrics cheap to store and fast to query.
The Four Golden Signals
Google's Site Reliability Engineering book defines four signals to monitor for any service:
- Latency — How long does a request take? (P50, P95, P99 — not just the average)
- Traffic — How much demand is there? (requests per second)
- Errors — What fraction of requests fail? (5xx rate, timeout rate)
- Saturation — How full is the system? (CPU %, memory %, queue depth)
If you instrument nothing else, instrument these four for every service.
Prometheus and Grafana
Prometheus is the de facto standard for metrics collection in cloud-native systems. It works by "scraping" — pulling metrics from each service on a regular interval. Your service exposes a /metrics endpoint; Prometheus reads it every 15 seconds and stores the data in its time-series database.
Grafana is the visualization layer. Connect it to Prometheus (or any other data source) and build dashboards. Grafana also handles alerting: define a threshold, and Grafana fires an alert to PagerDuty, Slack, or your on-call system when it's crossed.
The combination of Prometheus + Grafana is the standard open-source metrics stack. It's what Kubernetes uses internally, and it's what most cloud-native companies run.
Pillar 3: Traces
A trace shows you the complete journey of a single request as it travels through your distributed system.
Without traces, you know that P99 latency is 800ms. But which service is slow? Which database query? Which external API call? Logs show you what happened in each service individually, but correlating them across services requires a trace ID to link them.
A trace is composed of spans — individual units of work. Each span records:
- What work was done (name of the operation)
- Which service did it
- When it started and how long it took
- Any relevant attributes (HTTP status, database query, user ID)
- Its parent span (so you can reconstruct the call tree)
Trace: ord_44291
├── [OrderService] POST /orders 248ms
│ ├── [InventoryService] checkStock 15ms
│ ├── [PaymentService] chargeCard 210ms ← slow!
│ │ └── [Stripe API] POST /charges 198ms ← root cause
│ └── [NotificationService] sendEmail 18msAt a glance: the Stripe API call is taking 198ms. That's your root cause.
Distributed Tracing Tools
- Jaeger (CNCF) — open-source, built by Uber, designed for large-scale distributed tracing. Integrates with OpenTelemetry.
- Zipkin (Twitter) — older, simpler, battle-tested. Good for smaller deployments.
- Tempo (Grafana) — open-source trace storage that works natively with Grafana, Prometheus, and Loki for a fully integrated observability stack.
- Datadog APM, AWS X-Ray, Honeycomb — managed options, each with strong integrations in their respective ecosystems.
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the open standard for instrumenting applications to produce logs, metrics, and traces. Instead of writing Datadog-specific or Jaeger-specific instrumentation, you write to the OpenTelemetry API. The OTel collector routes the data to whichever backends you choose.
This vendor neutrality is huge: you can switch from Datadog to Grafana Cloud without changing your application code. OTel is the safe default for new instrumentation work.
Using All Three Together
The power is in correlation. A production incident workflow looks like this:
- Metrics alert fires — P99 latency is 800ms, up from 120ms baseline (Grafana/Prometheus)
- Filter logs for errors in the affected time window — spot a spike in "payment timeout" errors (ELK/Loki)
- Find a slow trace for a failed payment request — see that Stripe API calls started taking 200ms+ (Jaeger/Zipkin)
- Cross-reference with Stripe's status page — Stripe reported a regional incident starting at the same timestamp
Total time to root cause: minutes, not hours.
Summary
Observability is built on three pillars: logs for the narrative of what happened, metrics for the aggregate picture over time, and traces for the journey of individual requests through your system. You need all three. Logs without metrics give you no alerting. Metrics without traces give you no root cause. Traces without logs miss the edge cases and context. The modern open-source stack — Prometheus + Grafana for metrics, Loki or ELK for logs, Jaeger or Tempo for traces, unified under OpenTelemetry — gives you world-class observability that was previously only available to companies with billion-dollar engineering budgets.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices