Logging Best Practices

Updated June 8, 2026

Magic Magnets Team

9 min read

Logs are the oldest form of observability. Before dashboards, before distributed tracing, before APM tools — there were logs. A text file on a server, recording what the process was doing.

The fundamental idea is still the same. What's changed is the scale, the tooling, and — most importantly — the standards for doing it well. Because bad logging is almost as dangerous as no logging. It gives you the illusion of visibility while burying the signal in noise.

Why Logging Matters

Think about the last time something broke in production. What did you do first? You looked at logs. They're your first-response tool for every incident:

Debugging: what exactly happened, in what order, with what context?
Auditing: who did what, when, and with what result?
Security forensics: what requests did this compromised account make?
Root cause analysis: what was the error message, and which line of code threw it?

No metric tells you "the database query that caused the timeout was SELECT * FROM orders WHERE user_id = ...". Only a log does.

But logs are only useful if you can find the right one at the right time. That's what good logging practices enable.

Structured vs Unstructured Logging

Here's a real log line from a system that logs in plain text:

[2026-06-03 14:32:01] ERROR: Payment failed for user 8821 after 3 attempts

And here's the same event logged as structured JSON:

{
  "timestamp": "2026-06-03T14:32:01.221Z",
  "level": "error",
  "service": "payment-service",
  "message": "Payment failed after max retries",
  "userId": "u_8821",
  "orderId": "ord_44291",
  "attempts": 3,
  "durationMs": 5012,
  "errorCode": "STRIPE_TIMEOUT",
  "traceId": "4bf92f3577b34da6"
}

The structured version lets you:

Filter instantly: userId = "u_8821" — find all events for this user in seconds
Aggregate: how many STRIPE_TIMEOUT errors in the last hour?
Join with traces: the traceId links this log to the full distributed trace

Always use structured logging in production. The difference when debugging an incident at 3 AM is the difference between a 10-minute investigation and a 2-hour one.

Use a logging library that emits structured output natively: Pino or Winston in Node.js, structlog in Python, Zap or slog in Go, Logback with Logstash encoder in Java.

Log Levels: Signal vs Noise

Log levels exist to let you filter. If everything is INFO, nothing is. Here's how to use them correctly:

DEBUG

Verbose diagnostic information. Variable values, intermediate computation steps, SQL queries with parameters.

Rule: Never emit DEBUG logs in production by default. They generate enormous volume and can expose sensitive data. Enable them temporarily when diagnosing a specific issue.

INFO

Normal, expected operations that are worth recording.

"Service started on port 3000"
"User u_8821 logged in"
"Order ord_44291 created successfully"
"Background job completed: processed 1,432 records in 8.2s"

Rule: INFO logs should give you a coherent narrative of what your system did. Not every function call — just meaningful business events and lifecycle events.

WARN

Something unexpected happened, but the system handled it and continued.

"Retry succeeded after 2 failures (attempt 3)"
"Cache miss: falling back to database"
"Rate limit approaching for user u_8821 (85% of quota used)"
"Deprecated API endpoint called: /v1/users (use /v2/users)"

Rule: A WARN should prompt investigation but not immediate action. If WARNs go unacknowledged for a long time, they often turn into ERRORs eventually.

ERROR

Something failed that shouldn't have. This requires attention.

"Database query timed out after 5000ms"
"Payment processing failed: STRIPE_TIMEOUT"
"Failed to send notification email to u_8821: invalid address"
"Unhandled exception in request handler"

Rule: Every ERROR should be actionable. If you're logging ERRORs that you always ignore, they're WARNs. If you can't imagine fixing an ERROR when it fires, it shouldn't be an ERROR.

FATAL

The system cannot continue. Shutting down.

"Cannot connect to database after 10 attempts. Exiting."
"Required configuration missing: DATABASE_URL"

Rule: Use FATAL sparingly. Most errors should be handled gracefully. FATAL means the process is about to terminate.

What to Log

Log things that help you answer "what happened?"

Request received (with method, path, user ID, but not full request body by default)
Request completed (with status code, duration)
Significant state changes (order placed, payment processed, user account locked)
Errors and exceptions (with full stack trace and context)
External calls (to databases, APIs, queues) with outcomes and timing
Background job starts and completions
Configuration loaded at startup
Retries and fallbacks

What NOT to Log

This is just as important. Logging the wrong things causes two problems: security breaches and noise.

Never Log Sensitive Data

Passwords — ever, for any reason
Full credit card numbers or CVVs — log only the last 4 digits
Social Security Numbers or national ID numbers
API keys, tokens, secrets
Full JWT tokens — log only the user ID extracted from the token
Encryption keys

Be Careful with PII (Personally Identifiable Information)

Full names, email addresses, phone numbers, addresses — log sparingly and only when necessary for debugging
Health information, financial data — subject to HIPAA, PCI-DSS, GDPR
When you do log PII, ensure your logging infrastructure has appropriate access controls and retention policies

Rule of thumb: if you wouldn't be comfortable with this log line appearing in a news story, don't log it.

Don't Log Noise

Don't log inside tight loops (generates millions of lines)
Don't log at DEBUG level in production
Don't log the same event redundantly (once per request is enough)
Don't log health check requests unless you're debugging health checks

Centralized Logging: Getting Logs to One Place

In a distributed system, logs from dozens of services across hundreds of instances need to be searchable in one place. You need a centralized logging system.

ELK Stack (Elasticsearch + Logstash/Filebeat + Kibana)

The classic open-source centralized logging stack:

Filebeat runs as a sidecar or agent on each instance, shipping logs to Logstash (or directly to Elasticsearch)
Logstash parses, transforms, and enriches log events
Elasticsearch indexes and stores the logs
Kibana provides search, filtering, and visualization

Very powerful, very customizable, and quite expensive to operate at scale (Elasticsearch is resource-hungry). Most teams use the Elastic Cloud managed service to avoid operational burden.

Grafana Loki

Loki takes a different approach: it indexes only log metadata (labels like service name, environment, host) — not the full log content. Logs are stored compressed as plain files. Full-text search happens at query time via LogQL.

This makes Loki dramatically cheaper than ELK for high-volume log storage. It integrates natively with Grafana, so if you're already using Prometheus + Grafana for metrics, adding Loki gives you a fully integrated open-source observability stack.

Splunk

The enterprise choice. Splunk can ingest almost any log format, offers a powerful query language (SPL), and has extensive compliance and security features (used heavily in financial services, healthcare, government). The price tag is substantial — it's typically the choice when compliance requirements drive the decision more than cost.

Cloud-Native Options

AWS CloudWatch Logs — tight integration with AWS services; the path of least resistance if you're all-in on AWS
Google Cloud Logging — same idea on GCP
Azure Monitor Logs — same on Azure
Datadog Log Management — part of Datadog's unified observability platform; excellent if you're already paying for Datadog

Correlation: Connecting Logs to Traces

The most powerful technique in centralized logging: trace ID correlation.

Add a trace ID to every log line for the duration of a request. When something goes wrong and you have a trace ID from your distributed tracing system, you can instantly pull up every log line from every service that was involved in that specific request.

{ "traceId": "4bf92f3577b34da6", "spanId": "a2fb4a1d1a96d312", "service": "order-service", "message": "Order created", ... }
{ "traceId": "4bf92f3577b34da6", "spanId": "b9c1f2e3d4a56789", "service": "payment-service", "message": "Charge initiated", ... }
{ "traceId": "4bf92f3577b34da6", "spanId": "b9c1f2e3d4a56789", "service": "payment-service", "message": "Charge failed: timeout", ... }

OpenTelemetry handles this automatically when you use it for distributed tracing — the trace context propagates through request headers and gets injected into logs by the SDK.

Log Retention

Logs take space. You need a retention policy:

Production errors and warnings: keep for 90-365 days (depends on compliance requirements)
Production info logs: 30-90 days
Debug logs: 7-14 days (only emit when needed)
Security and audit logs: often 1-7 years (HIPAA: 6 years, PCI-DSS: 1 year, GDPR: as long as necessary)

Use your logging platform's lifecycle policies to automatically archive old logs to cold storage (S3 Glacier, for example) after 30 days and delete after the retention period.

Summary

Good logging is a professional discipline, not an afterthought. Use structured logging so logs are machine-queryable, not just human-readable. Use log levels consistently to separate signal from noise. Know what not to log — passwords, full PII, secrets — because a logging mistake can be a security incident. Route everything to a centralized logging system (ELK, Loki, or a managed service) so you can search across your entire infrastructure at once. And add trace IDs to correlate log lines with distributed traces. The logs you write today are the primary tool the engineer on call tomorrow — possibly you — will have when something goes wrong at 2 AM.

Log Aggregation

How helpful was this content?

Comments

0/2000

Saved on this device only