Metrics & Instrumentation
Updated June 8, 2026Metrics are numeric measurements sampled over time. Error rate. Request latency. CPU utilization. Active database connections. Queue depth. Unlike logs, which record discrete events with full context, metrics are aggregated: you don't store "this specific request took 142ms." You store "the p99 latency in the last minute was 312ms." That aggregation is what makes metrics cheap to store, fast to query, and suitable for real-time alerting.
Logs tell you what happened in detail. Metrics tell you whether the system is healthy right now.
Metric Types
Counter: a value that only increases. Total HTTP requests received, total orders placed, total exceptions thrown. Counters are used to compute rates: if the requests counter grew by 1,500 in the last 5 seconds, the request rate is 300/s.
Gauge: a value that can go up or down. Current memory usage, active database connections, queue length, CPU utilization. A gauge captures the current state at the moment of measurement.
Histogram: records the distribution of a value. Used primarily for latency and payload size. Histograms let you compute percentiles: p50, p95, p99. This matters because averages hide tail latency. If 99 requests complete in 100ms and one takes 10,000ms, the average looks acceptable. The p99 does not.
The RED Method
For any HTTP or RPC service, three metrics cover most failure modes:
- Rate: requests per second being served
- Errors: fraction of requests returning errors (5xx rate, timeout rate)
- Duration: latency distribution (p50, p95, p99)
A dashboard showing these three for every microservice will surface the majority of production problems. Add them first before adding any others.
Prometheus and the Pull Model
Prometheus is the standard open-source metrics system. Rather than having each service push data to a central collector, Prometheus pulls: your service exposes a /metrics endpoint, and Prometheus scrapes it every 15 seconds.
# /metrics endpoint output
http_requests_total{method="GET", status="200"} 10534
http_requests_total{method="POST", status="500"} 42
memory_usage_bytes 204857600
http_request_duration_seconds_bucket{le="0.1"} 9812
http_request_duration_seconds_bucket{le="0.5"} 10490Prometheus stores these time-series values in its own database. Grafana connects to Prometheus and renders dashboards and alerts.
The pull model has an operational advantage: adding a new service to monitoring means registering its endpoint with Prometheus, not configuring the service to push somewhere. It also makes it easy to see when a service stops responding entirely (the scrape fails).
Labels and Cardinality
A raw counter like http_requests_total: 1000 isn't useful for diagnosis. Labels let you slice the data:
http_requests_total{service="payment", endpoint="/charge", status_code="200", region="us-east-1"} 982
http_requests_total{service="payment", endpoint="/charge", status_code="500", region="us-east-1"} 18Now you can query: "show me the error rate for the payment service, grouped by region." You can see that us-east-1 is failing while eu-west-1 is fine.
The constraint is cardinality. Every unique combination of label values creates a separate time series in Prometheus's storage. Labels with a small, finite set of values (HTTP status codes, regions, service names) are fine. Labels with unbounded values (user IDs, order IDs, session IDs) will create millions of series and crash your metrics database. Never use entity identifiers as metric labels.
The Four Golden Signals
Google's SRE book defines four signals to measure for every service:
- Latency: how long requests take, including the p99 to surface tail issues
- Traffic: request rate (requests per second)
- Errors: fraction of failing requests
- Saturation: how full the resource is (CPU %, memory %, queue depth)
RED and the Four Golden Signals cover largely the same ground from slightly different angles. Either works as a starting framework. The point is to instrument something standard across all services so you can compare them on the same axes.
Business Metrics
System metrics show you whether the infrastructure is healthy. Business metrics show you whether the product is working.
A checkout service might have a perfect p99 latency but a "checkout completion rate" that dropped 10% because a CSS change hid the payment button on mobile. The infrastructure looks fine. The business is not.
Instrument business events the same way you instrument infrastructure: checkout completions, signup conversions, video play events. These metrics are often the first to detect problems that system metrics miss entirely.
Summary
Metrics are numeric time-series measurements aggregated across all requests, cheap to store and fast to query. Use counters for totals, gauges for current state, and histograms for latency distributions. Apply the RED method (Rate, Errors, Duration) to every service as a baseline. Use Prometheus for collection and Grafana for visualization. Keep label cardinality low by never using entity IDs as label values. Add business metrics alongside infrastructure metrics so you're measuring user outcomes, not just server health.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices