Dead Letter Queues

Updated June 6, 2026

Magic Magnets Team

6 min read

Let's say you're running an e-commerce store. When a customer places an order, an event goes into a message queue. A background worker picks up the event, processes the payment, and sends a confirmation email.

One day, a customer enters a completely invalid email address format that breaks your email provider's API.

Your background worker grabs the message, tries to send the email, gets a 500 error, and crashes. Because you're using at-least-once delivery, the message broker puts it back in the queue. Another worker picks it up. It crashes again. The broker puts it back. Over and over.

This single broken message is now acting like a poison pill, endlessly crashing your workers and blocking legitimate orders from being processed.

How do you stop this? You use a Dead Letter Queue (DLQ).

What is a Dead Letter Queue?

A Dead Letter Queue is a secondary queue specifically designed to hold messages that cannot be processed successfully after a configured number of attempts.

Instead of infinitely retrying a failing message and blocking your system, the message broker moves the problematic message out of the main queue and into the DLQ.

Think of a real-world post office. A mail carrier is trying to deliver a letter, but the address is completely smudged. The carrier doesn't stand on the sidewalk forever. After a few failed attempts, they toss it into an "Undeliverable Mail" bin. This gets the bad letter out of the daily route. Later, a clerk can slowly go through the bin to figure out the address or return it to the sender.

How It Works

When you configure a queue in AWS SQS, RabbitMQ, or Google Cloud Pub/Sub, you specify a maximum receive count (max retries):

Worker pulls the message, fails to process it, and crashes (no ACK sent)
Broker retries after a backoff period — maybe it was a temporary network issue
Worker fails again, and again
The broker sees the message has exceeded the max retry limit (e.g., 3)
The broker moves the message from the main queue to the DLQ automatically
The main queue continues processing all other messages

algobase.dev

A poison pill message is one that consistently causes the consumer to fail. With at-least-once delivery, every crash sends the message back to the queue for another attempt. Without a DLQ, this becomes an infinite loop: the same bad message crashes workers indefinitely, consuming resources and blocking healthy messages behind it. A DLQ breaks this cycle. You configure a maximum receive count — typically 3-5 retries. Once a message exceeds the limit, the broker automatically moves it out of the main queue into the DLQ. The main queue continues processing all other messages as if nothing happened. The failed message is preserved in the DLQ for later inspection.

1 / 1

Poison pill message — retry loop until max retries, then routed to DLQ

Quiz Time

What problem does a Dead Letter Queue (DLQ) solve that retry logic alone cannot?

What to Do With the DLQ

Having a DLQ is only half the battle. If you let messages rot in the DLQ, you're quietly losing data. You need a strategy:

Set up alerts. You should have an alarm that fires whenever the DLQ size goes above zero. A message in a DLQ means a bug happened or a downstream service is broken.

Inspect and fix. Engineers look at the messages in the DLQ to diagnose the failure. Why did they fail? Maybe there's a null pointer exception triggered by a specific input format. You patch the bug and deploy a fix.

Re-drive (replay). Once the fix is deployed, you take the messages from the DLQ and put them back into the main queue. The system processes them as if nothing went wrong.

algobase.dev

A DLQ is only useful if you act on it. The most important operational practice is alerting: any message landing in the DLQ should trigger a notification to your on-call rotation. A DLQ with messages means something broke — either the message payload is invalid, a downstream service threw an unexpected error, or a bug was deployed. Engineers inspect the raw payloads in the DLQ to diagnose the root cause. Once the fix is deployed, the messages are "re-driven" back into the main queue, where they process successfully. No data is lost, and the failure window is contained to the time between detection and fix.

1 / 1

DLQ recovery — alert fires, engineers inspect and fix, messages re-driven to main queue

Quiz Time

You deploy a fix for a bug that caused messages to land in the DLQ. What should you do next?

Real-World Examples

Stripe Webhooks: When Stripe charges a customer's subscription, it sends a webhook to your server. If your server is down and returns 500 errors, Stripe retries a few times and eventually gives up. In an event-driven architecture, you'd route that webhook payload through a queue. If it repeatedly fails, sending it to a DLQ ensures you don't lose the fact that a user paid you — allowing manual credit later.

Video Encoding: A user uploads a corrupted video file. The video encoding worker will repeatedly crash trying to read the file. Moving this job to a DLQ prevents the corrupted file from clogging the encoding pipeline for everyone else.

DLQ Configuration Best Practices

Set max retries thoughtfully. Too low (1-2) and you'll DLQ messages from transient network errors. Too high (20+) and a poison pill blocks your queue for a long time before being isolated.
Alert on DLQ depth > 0. Any message in a DLQ is a signal that something broke.
Store raw payloads. Log the full message body when it arrives in the DLQ — you need this for debugging.
Test your re-drive process. Knowing how to replay messages before you need to do it under pressure is valuable.

Quiz Time

What is the recommended practice for alerting on Dead Letter Queues?

Summary

A Dead Letter Queue is a safety net for message-driven architectures. It catches "poison pill" messages that repeatedly fail to process, removing them from the main queue so they don't block healthy traffic. By monitoring and inspecting your DLQs, you can recover from unexpected bugs, bad data, or downstream API failures without losing critical information. The DLQ is not where messages go to die — it's where they wait for someone to fix the problem and replay them.

What is Caching?

How helpful was this content?

Comments

0/2000

Saved on this device only