Distributed Tracing
Updated June 8, 2026Distributed tracing is the practice of tracking a single request as it moves through a distributed system, recording the timing and outcome of every operation it triggers. The result is a complete, ordered picture of where a request spent its time, including which service was slow, which downstream call it was waiting on, and where an error originated.
Metrics tell you that p99 latency is 800ms. Logs tell you errors occurred in the Payment Service. Tracing tells you that a specific request spent 12ms in the API Gateway, 8ms in Auth, and 760ms waiting on a Stripe API call. Those are three different layers of the observability stack, and you need all three.
Traces and Spans
A trace represents the entire lifecycle of one request. It's identified by a globally unique trace ID, generated when the request enters the system.
A span represents a single unit of work within that trace. Each service call, database query, and external API call gets its own span. Every span records:
- The operation name (
payment-service.charge,postgres.query,stripe.POST /charges) - Start timestamp and duration
- The trace ID it belongs to
- Its own span ID
- The parent span ID (so the call tree can be reconstructed)
- Tags and metadata (HTTP status, error message, user ID)
Spans from multiple services collected and stitched into a single trace timeline
When the trace backend receives all spans with the same trace ID, it reconstructs the call tree:
Trace: ord_44291 (total: 812ms)
├── [API Gateway] POST /checkout 22ms
│ ├── [Auth Service] validateToken 8ms
│ ├── [Order Service] createOrder 780ms
│ │ ├── [Inventory] reserveItem 12ms
│ │ └── [Payment] chargeCard 762ms ← slow
│ │ └── [Stripe] POST /charges 748ms ← root cause
│ └── [Notify Service] sendEmail 18msThis visualization makes it immediately obvious that Stripe is the bottleneck. No log searching required.
Propagation
For spans to form a connected trace, the trace context must be passed from service to service. When Service A calls Service B, it includes the trace ID and its own span ID in the request headers. Service B creates a new child span with Service A's span ID as the parent.
The W3C traceparent header is the standard format: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. This single header carries the trace ID, parent span ID, and sampling flags.
OpenTelemetry handles context propagation automatically. Once you initialize the SDK in each service, it injects and extracts the traceparent header on all outgoing and incoming HTTP calls.
Sampling
Generating and storing a full trace for every request is expensive. A high-traffic service at 10,000 requests per second would produce millions of spans per second. The storage and processing cost is prohibitive.
Head-based sampling makes the decision at the start of a request. The API Gateway probabilistically decides to trace N% of incoming requests and marks them via the sampling flag in the traceparent header. All downstream services respect the flag: traced requests produce spans, untraced requests don't. The trade-off is that you might not capture a rare, slow request if it happened to fall in the unsampled fraction.
Tail-based sampling buffers spans in memory and makes the sampling decision after the request completes. If the request was fast and successful, discard the spans. If it was slow, returned an error, or triggered a specific condition, keep them. This is more useful for debugging because you capture the interesting cases. It's also more complex to operate: you need a collector that holds all in-flight spans until the trace is complete.
A common production approach: trace 1-5% of all traffic with head-based sampling (baseline observability), and use tail-based sampling to capture 100% of traces that result in errors or exceed a latency threshold.
Tools
Jaeger (CNCF project, built by Uber) is the main open-source distributed tracing backend. It accepts spans via OpenTelemetry and provides a search UI and trace visualization. Designed for large-scale deployments.
Zipkin (built by Twitter) is older and simpler. Good for smaller deployments or teams that want minimal operational overhead.
Grafana Tempo is the trace storage backend in the Grafana observability stack. It integrates natively with Loki (logs) and Prometheus (metrics), enabling you to jump from a metric spike to a correlated trace to the relevant log lines in a single workflow.
Managed options: Datadog APM, AWS X-Ray, Honeycomb, and Lightstep are managed platforms that handle collection, storage, and visualization with no infrastructure to operate. Honeycomb is particularly strong for tail-based sampling and high-cardinality trace queries.
OpenTelemetry
OpenTelemetry is the open standard for trace instrumentation. Instead of writing Jaeger-specific or Datadog-specific code, you write to the OpenTelemetry SDK, and an OTel Collector routes spans to whichever backend you choose.
This vendor neutrality matters long-term. Switching from Datadog to Grafana Cloud means changing the collector configuration, not rewriting instrumentation in every service. Auto-instrumentation libraries for common frameworks (Express, gRPC, SQLAlchemy, Spring) can instrument most of a service's outbound calls without any manual span creation.
Summary
Distributed tracing records the full journey of a request through a distributed system as a tree of timed spans. Traces are connected by a trace ID propagated through request headers using the W3C traceparent standard. Use sampling to control cost: head-based sampling for baseline coverage, tail-based sampling to capture all errors and slow requests. OpenTelemetry is the standard instrumentation API; Jaeger, Tempo, and managed platforms like Datadog handle storage and visualization. Tracing is the layer of observability that answers "which service caused this latency" after metrics tell you latency is elevated.
Saved on this device only
Sign in to sync progress across devices