Dashboards & Runbooks

Updated June 8, 2026
M
Magic Magnets Team
8 min read

Dashboards make the current state of a system visible. Runbooks tell an engineer what to do when something in that state is wrong. They work together: the dashboard surfaces the anomaly, the runbook guides the response.

What to Put on a Dashboard

The mistake most teams make when building their first dashboard is putting everything on it. A dashboard with 80 panels is as useless as one with none. You can't find the signal.

Start with the Four Golden Signals from Google's SRE book, applied to each service:

  1. Latency: how long requests take (p50, p95, p99, not just average)
  2. Traffic: request rate (requests per second or per minute)
  3. Errors: fraction of requests returning errors
  4. Saturation: how full the resource is (CPU %, memory %, queue depth)

These four metrics surface most production problems. Once you have them for every service, add metrics specific to your domain: checkout conversion rate, active video streams, message queue lag.

Dashboard Tiers

Not everything belongs on the same dashboard. Organize by audience and scope.

Service health dashboard: one per service, showing RED metrics (Rate, Errors, Duration) and key resource utilization. This is what the on-call engineer opens first when paged for that service.

Business metrics dashboard: checkout completion rate, signup funnel, revenue per minute. Shows whether the product is working, not just whether the servers are running. A deploy that breaks the payment button won't show up on a CPU dashboard but will appear here immediately.

Infrastructure dashboard: host-level CPU, memory, disk I/O, network throughput across the fleet. Used for capacity planning and diagnosing node-level problems that service dashboards don't expose.

Keep each dashboard focused on its audience. An engineer handling a 3 AM page for the payment service doesn't need to see CDN cache hit rates.

Grafana in Practice

Grafana is the standard open-source dashboard tool for cloud-native systems. It connects to Prometheus, Loki, Tempo, and most cloud metrics APIs as data sources. Dashboards are defined as JSON and can be version-controlled alongside application code.

Variable selectors at the top of a Grafana dashboard let users filter by environment, region, or service without editing queries. A single dashboard template can serve production, staging, and development with a dropdown.

Datadog and New Relic are managed alternatives that bundle metrics, logs, and traces into one platform. They cost more but eliminate the operational work of running Prometheus, Loki, and Grafana yourself.

Runbooks

A runbook is a document that answers the question: "The [alert name] just fired. What do I do?"

It's written for someone who is stressed and possibly half-asleep. It should require no prior knowledge of the system beyond what's written in the document. Every diagnosis step should include a direct link to the specific dashboard or tool. Every action should be specific enough to execute without interpretation.

A runbook entry for a "Payment Service p99 latency high" alert:

Alert: Payment Service p99 > 2s for 5 minutes

Impact: Users are experiencing slow or failing checkouts.

Step 1: Check [Payment Service dashboard] for error rate. If errors > 5%, this is a severity-1 incident. Follow the SEV-1 incident process.

Step 2: Check [deploy log] for any deploys in the last 30 minutes. If a deploy is recent, roll back using [Jenkins rollback link].

Step 3: Check [Stripe status page]. If Stripe has an active incident, the fix is to wait. Set a timer for 15 minutes and check again.

Step 4: If none of the above, check [database dashboard] for connection pool exhaustion. If active connections > 90% of pool size, restart the payment service pod in [Kubernetes dashboard].

Escalation: If not resolved in 20 minutes, page the Payment Platform team via @payment-platform in Slack.

That specificity is what makes a runbook valuable. Vague guidance like "check the database" leaves the engineer guessing. Specific links and thresholds don't.

Keeping Runbooks Current

Runbooks go stale. Links break, procedures change, new failure modes appear. A runbook that reflects a system from 18 months ago can mislead an engineer during an incident.

Two practices help. First, update the runbook as part of resolving every incident. The post-mortem step for any incident where the runbook was unhelpful should include a runbook update. Second, run runbook reviews quarterly: walk through each runbook on the team and confirm it still reflects the system.

Automation

The best version of a runbook is one that doesn't need to be read. If an alert fires because CPU is at 90% and the right action is always "add more instances," that step should happen automatically via auto-scaling rules. If a service crashes and the right action is always "restart the pod," Kubernetes already does that.

Automation removes the latency of paging a human, waiting for them to wake up, and waiting for them to execute steps. For well-understood failure modes, it also eliminates human error.

Reserve manual runbooks for failure modes that are novel, multi-system, or require judgment. Automate the routine ones.

Summary

Dashboards surface problems. Runbooks tell engineers how to respond. Build dashboards around the Four Golden Signals (Latency, Traffic, Errors, Saturation) per service, supplemented by business metrics. Write runbooks for stressed engineers: specific steps, direct links, and clear escalation paths. Update runbooks during post-mortems, not months later. Automate any runbook step that is always the right action for a known condition.

Distributed Tracing

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices