Dead Letter Queues
Updated June 6, 2026Let's say you're running an e-commerce store. When a customer places an order, an event goes into a message queue. A background worker picks up the event, processes the payment, and sends a confirmation email.
One day, a customer enters a completely invalid email address format that breaks your email provider's API.
Your background worker grabs the message, tries to send the email, gets a 500 error, and crashes. Because you're using at-least-once delivery, the message broker puts it back in the queue. Another worker picks it up. It crashes again. The broker puts it back. Over and over.
This single broken message is now acting like a poison pill, endlessly crashing your workers and blocking legitimate orders from being processed.
How do you stop this? You use a Dead Letter Queue (DLQ).
What is a Dead Letter Queue?
A Dead Letter Queue is a secondary queue specifically designed to hold messages that cannot be processed successfully after a configured number of attempts.
Instead of infinitely retrying a failing message and blocking your system, the message broker moves the problematic message out of the main queue and into the DLQ.
Think of a real-world post office. A mail carrier is trying to deliver a letter, but the address is completely smudged. The carrier doesn't stand on the sidewalk forever. After a few failed attempts, they toss it into an "Undeliverable Mail" bin. This gets the bad letter out of the daily route. Later, a clerk can slowly go through the bin to figure out the address or return it to the sender.
How It Works
When you configure a queue in AWS SQS, RabbitMQ, or Google Cloud Pub/Sub, you specify a maximum receive count (max retries):
- Worker pulls the message, fails to process it, and crashes (no ACK sent)
- Broker retries after a backoff period — maybe it was a temporary network issue
- Worker fails again, and again
- The broker sees the message has exceeded the max retry limit (e.g., 3)
- The broker moves the message from the main queue to the DLQ automatically
- The main queue continues processing all other messages
Poison pill message — retry loop until max retries, then routed to DLQ
What problem does a Dead Letter Queue (DLQ) solve that retry logic alone cannot?
What to Do With the DLQ
Having a DLQ is only half the battle. If you let messages rot in the DLQ, you're quietly losing data. You need a strategy:
Set up alerts. You should have an alarm that fires whenever the DLQ size goes above zero. A message in a DLQ means a bug happened or a downstream service is broken.
Inspect and fix. Engineers look at the messages in the DLQ to diagnose the failure. Why did they fail? Maybe there's a null pointer exception triggered by a specific input format. You patch the bug and deploy a fix.
Re-drive (replay). Once the fix is deployed, you take the messages from the DLQ and put them back into the main queue. The system processes them as if nothing went wrong.
DLQ recovery — alert fires, engineers inspect and fix, messages re-driven to main queue
You deploy a fix for a bug that caused messages to land in the DLQ. What should you do next?
Real-World Examples
Stripe Webhooks: When Stripe charges a customer's subscription, it sends a webhook to your server. If your server is down and returns 500 errors, Stripe retries a few times and eventually gives up. In an event-driven architecture, you'd route that webhook payload through a queue. If it repeatedly fails, sending it to a DLQ ensures you don't lose the fact that a user paid you — allowing manual credit later.
Video Encoding: A user uploads a corrupted video file. The video encoding worker will repeatedly crash trying to read the file. Moving this job to a DLQ prevents the corrupted file from clogging the encoding pipeline for everyone else.
DLQ Configuration Best Practices
- Set max retries thoughtfully. Too low (1-2) and you'll DLQ messages from transient network errors. Too high (20+) and a poison pill blocks your queue for a long time before being isolated.
- Alert on DLQ depth > 0. Any message in a DLQ is a signal that something broke.
- Store raw payloads. Log the full message body when it arrives in the DLQ — you need this for debugging.
- Test your re-drive process. Knowing how to replay messages before you need to do it under pressure is valuable.
What is the recommended practice for alerting on Dead Letter Queues?
Summary
A Dead Letter Queue is a safety net for message-driven architectures. It catches "poison pill" messages that repeatedly fail to process, removing them from the main queue so they don't block healthy traffic. By monitoring and inspecting your DLQs, you can recover from unexpected bugs, bad data, or downstream API failures without losing critical information. The DLQ is not where messages go to die — it's where they wait for someone to fix the problem and replay them.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices