In message-based architectures the most expensive sentence is this: “This message arrives only once.” In production messages repeat, get delayed, lose their order, or the same event is regenerated through different channels. So the real question isn’t “will it repeat?”; it should be “what do we do when it repeats?”
In this post I’ll cover replay and idempotency (resistance to repeated processing) not just as a design pattern, but from an operational runbook and observability perspective.
Why you need to make peace with the “at-least-once” reality
Most brokers and distributed systems behave in practice as “at-least-once”. The reason is simple: network errors, consumer restarts, timeouts, and ack uncertainty. In this environment, two requirements emerge for reliability:
- Idempotent consumer: receiving the same message again must not corrupt the result
- Replay strategy: you must be able to controllably reprocess DLQ or stored events
The idempotency key: the central decision
The most fundamental question for idempotency is: “What makes this operation unique?”
In practice, three common key types:
- Business key: order number, invoice number, transaction id
- Event id: a unique event identifier provided by the producer
- Natural key + version: (entity id, version/sequence)
The most common mistake I see in the field: “no idempotency key, but there’s a correlation in the log.” Correlation id is for tracing; on its own it may not be enough for idempotency.
Outbox: tying the source and the message into the same transaction
The outbox pattern reduces the inconsistency of “the data was written but no event was emitted” or “the event was emitted but no data was written.” A simple summary:
- Inside a transaction, the application writes both the domain change and the outbox record
- A separate publisher safely emits from the outbox to the broker
- The “published” marker is updated idempotently
Operational benefit: when replay is needed, “source data + event history” are managed together.
DLQ: not a dump, but controlled quarantine
A Dead Letter Queue exists to “rescue” the message; not to “ignore” it. In DLQ design, the runbook should answer these questions:
- Are the reasons for falling into the DLQ classified? (schema, validation, downstream error)
- For which reasons is there automatic retry, and for which is manual intervention required?
- How will replay be performed and how will side effects be controlled?
Replay runbook: safe steps for re-emission
Replay is not a “command”; it’s an operational flow. The template I use:
- Freeze: Prevent the same messages from landing in the DLQ again (root cause)
- Sample: Take 10–50 messages, classify them, understand impact
- Dry run: Try in staging or in an isolated consumer (where possible)
- Gradual replay: Apply in production with batching/ratelimit
- Verify: business metrics + technical metrics
- Close: postmortem and permanent improvement
A simple approach for gradual replay
A system-agnostic principle: “don’t start at 1x speed.”
- First 5 min: low rate, observe
- Then: increase in a controlled way
- If you see errors: automatic stop threshold
This approach prevents replay from turning into a second incident.
Observability: idempotency must be visible
If you’ve performed an idempotency check, turn it into a metric:
idempotency_hit(came again, no operation performed)idempotency_miss(first arrival, operation performed)dedup_store_latency(state store latency)replay_batch_success/fail
Without these metrics, “are we idempotent?” is, in production, just a matter of belief.
Conclusion
In messaging architecture, replay and idempotency are not just a software design decision; they’re a matter of operational maturity. If you strengthen consistency with outbox, manage the DLQ like quarantine, and tie replay to a runbook, then a “repeated message” stops being a frightening unknown in production. From an operational leadership perspective, the biggest gain is being able to replace incident-time panic (“it happened again”) with the calm of “we have a replay plan.”