Alert & Monitoring
Updated June 8, 2026Metrics and dashboards tell you what's happening. Alerting tells a human that something requires attention right now. A perfect Grafana dashboard showing a degraded checkout flow is useless if no engineer is looking at the screen. Alerting is the system that routes anomalies to the right person at the right time.
Alert on Symptoms, Not Causes
The most common alerting mistake is writing rules against causes instead of symptoms.
Cause-based alert: "Notify when database CPU exceeds 90%."
This will page engineers for routine batch jobs, scheduled analytics queries, and index rebuilds that have no user impact. Engineers learn to ignore it. When a real problem hits, the alert is already noise.
Symptom-based alert: "Notify when p99 latency on the checkout API exceeds 2 seconds."
This fires when users are actually experiencing something. If database CPU at 90% is causing slow queries, the symptom alert catches it. If CPU is high but latency is fine (a batch job), no one is paged.
Write alerts against user-facing Service Level Objectives (SLOs): error rate, availability, and latency thresholds that directly reflect user experience. The infrastructure metrics that explain why an SLO is being missed belong in dashboards for diagnosis, not in alerts.
Alert Tiers
Not every problem justifies waking someone up. Structure alerts into at least two tiers:
Paging alerts are high-severity, immediate, and require action now. Examples: checkout error rate above 5%, primary database unreachable, p99 latency above SLO for 5 consecutive minutes. Route these through PagerDuty or OpsGenie with phone escalation. If the on-call engineer doesn't acknowledge within 5 minutes, escalate to the next tier.
Ticket alerts are problems that need to be addressed but the system is still functional. Examples: disk utilization above 80%, cache hit rate declining, certificate expiring in 14 days. Route these to a Slack channel or Jira automatically. They're addressed during business hours.
Every alert that fires must be actionable. If the response to an alert is "this happens sometimes, ignore it," that alert should be deleted immediately. Noise from low-quality alerts causes engineers to start dismissing notifications, and a real incident gets lost.
Alert Fatigue
Alert fatigue is when engineers stop responding to alerts because the signal-to-noise ratio is too low. It's one of the most dangerous conditions an on-call rotation can reach. The team becomes desensitized to notifications, and a major outage pages without response because it looks like the usual noise.
Common causes: alerting on causes instead of symptoms, low thresholds that fire before problems are real, alerts that resolve on their own without action, and separate alerts for every instance of a replicated service rather than an aggregate view.
Review your alert history regularly. Any alert that fired in the last 30 days and required no action should be either removed or converted to a ticket alert.
Routing and Escalation
Alerts should route to the team that owns the service, not a single central on-call queue. At Amazon, engineers own and operate the services they write: when code they deployed breaks, they're the ones paged. This directly ties code quality to operational cost and creates strong incentives to write reliable systems and effective alerts.
A basic escalation chain:
- Page the on-call engineer for the owning team
- If unacknowledged after 5 minutes, escalate to the team lead
- If unacknowledged after 15 minutes, escalate to the engineering manager and open a war room
The Incident Lifecycle
When a paging alert fires, a standard incident response keeps recovery fast:
- Acknowledge: the on-call engineer takes the page, stopping escalation
- Triage: establish impact scope. How many users are affected? Is it degraded or completely down?
- Mitigate: restore service, even imperfectly. Roll back the recent deploy, scale up the affected component, flip a feature flag to disable the broken path. Mitigation is more important than root cause analysis during active impact.
- Resolve: confirm the alert is cleared and service metrics have returned to normal
- Post-mortem: write up what happened, why the alert fired, what the fix was, and what changes will prevent recurrence. Update the runbook.
Runbooks
Every paging alert should have a linked runbook: a short document that tells the engineer receiving the page what to check and what actions to take. Runbooks are written for someone who is half-asleep and under stress. They include direct links to dashboards, specific queries to run, and the escalation contact if the step-by-step doesn't resolve the problem.
A runbook for a "Search API latency high" alert might say:
- Check the Search service dashboard for error spikes (link)
- Check whether a deploy happened in the last 30 minutes (link to deploy log)
- If yes, roll back via the deploy tool (link)
- If no deploy, check Elasticsearch CPU (link to Elasticsearch dashboard)
- If Elasticsearch CPU > 90%, enable rate limiting via feature flag (link)
- If still unresolved after 15 minutes, page the Search Platform team (contact)
The runbook replaces guesswork with procedure.
Summary
Alerts must be actionable. Write alert rules against symptoms (user-facing SLOs) rather than causes (infrastructure thresholds). Separate paging alerts from ticket alerts based on severity and required response time. Delete any alert that routinely fires without requiring action. Link every paging alert to a runbook with step-by-step diagnostic and mitigation steps. Treat alert fatigue as a first-class operational risk: when engineers stop responding to alerts, incidents stop being caught.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices