Log Aggregation

Updated June 8, 2026

Magic Magnets Team

8 min read

In a distributed system, a single user request may touch 15 services running across dozens of containers in multiple availability zones. Each service writes its own logs. Log aggregation is the practice of collecting those logs from every node and routing them to a single searchable store.

Without it, debugging a production failure means SSH-ing into individual servers, hoping the container that logged the error hasn't been rescheduled already, and manually correlating timestamps across files. Log aggregation replaces that with one search query.

The Core Pipeline

Every log aggregation setup has three stages:

Collection: a lightweight agent runs on each host or container, reads log output, and ships it forward. Filebeat, Fluent Bit, and the Datadog agent are common choices.
Processing: an intermediate layer parses, enriches, and normalizes events. Logstash and Fluentd are the main open-source options. Managed pipelines like Kinesis Firehose skip this layer for simple cases.
Storage and search: a purpose-built backend indexes the data and answers queries. Elasticsearch is the standard for full-text search. Loki trades full-text indexing for cheaper label-based queries.

algobase.dev

Log pipeline: services emit structured JSON, a shipper collects and buffers to Kafka, Elasticsearch indexes for search, and Kibana provides the query UI.

1 / 1

Log pipeline from services through shipper to indexed storage

Structured Logging is Required

Aggregation only works if the data is machine-parseable. Freeform strings like "Payment failed for user 8821 after 3 attempts" require regex extraction to filter or aggregate by field. JSON emitted natively by your logger does not.

{
  "timestamp": "2026-06-03T10:00:00Z",
  "level": "error",
  "event": "checkout_failed",
  "userId": "u_123",
  "itemId": "prod_456",
  "errorCode": "STRIPE_TIMEOUT",
  "durationMs": 5012,
  "traceId": "4bf92f3577b34da6"
}

With structured logs, Elasticsearch understands each field as a typed value. Queries like level:error AND itemId:prod_456 return exact results with no post-processing.

Storage Backends

ELK Stack (Elasticsearch + Logstash/Filebeat + Kibana) is the classic open-source choice. Elasticsearch indexes every token in every field, which makes ad-hoc full-text searches fast but storage costs high. Kibana provides the query UI and visualization layer. Most teams use Elastic Cloud rather than operating their own clusters.

Grafana Loki takes the opposite approach: it indexes only labels (service name, environment, host) and stores log content as compressed chunks. Full-text search runs at query time via LogQL, which is slower but dramatically cheaper for high-volume logs. Loki integrates natively with Grafana, so if you're already running Prometheus and Grafana for metrics, adding Loki gives you a complete open-source observability stack.

Splunk is the enterprise option. Its query language (SPL) is powerful, its compliance features are extensive, and it ingests almost any log format. The licensing cost is significant, which is why it tends to appear in financial services, healthcare, and government rather than startups.

Cloud-native options: AWS CloudWatch Logs, Google Cloud Logging, and Azure Monitor Logs are each the path of least resistance within their respective clouds. Managed, no infrastructure to run, but you pay for vendor lock-in.

Buffering and Reliability

At scale, log pipelines need a buffer between collectors and storage. If Elasticsearch goes down during a traffic spike, you don't want to lose logs from that window.

A message queue like Kafka acts as a durable buffer: collectors write to Kafka topics, and the storage backend consumes at its own pace. Uber, for example, routes all service logs through Kafka before indexing in Elasticsearch. If the indexing layer falls behind, logs queue up safely rather than being dropped.

Cost and Retention

Storing searchable text at scale is expensive. Elasticsearch clusters serving heavy query loads can exceed the cost of the application database they're helping to debug.

Two strategies keep costs manageable. First, log level filtering at the shipper: drop DEBUG-level logs before they leave the host in production. DEBUG events that would be valuable for a specific investigation can be enabled temporarily per service. Second, tiered retention: keep the last 30-90 days in hot, searchable storage and archive older logs to object storage (S3 Glacier, GCS Nearline). Set retention policies in your logging platform to handle this automatically.

Never Log PII

Once a log line reaches a centralized aggregator, it's queryable by everyone with access, retained for months, and potentially exported during audits. Fields containing passwords, full credit card numbers, social security numbers, unmasked email addresses, or API keys become a data breach surface.

Audit your log fields before aggregation is set up. Add masking at the application layer, not the logging layer, because PII that enters a shipper has already left the application's security perimeter.

Summary

Log aggregation centralizes distributed log output into a single searchable store. The pipeline is: structured logs from application code, collected by a lightweight shipper, optionally buffered through a queue for reliability, indexed in Elasticsearch or Loki, and queried through Kibana or Grafana. The operational payoff is the ability to search across your entire infrastructure with a single query during an incident. The main cost is storage: filter aggressively at the source, tier retention aggressively in the backend, and never let PII reach the aggregator.

Correlation IDs

How helpful was this content?

Comments

0/2000

Saved on this device only