Learn how the Outbox Pattern atomically stores updates, guarantees event delivery, retries on broker failures, and routes errors to a dead‑letter queue.
Introduction
In distributed systems, reliably publishing events to a message broker while keeping the underlying business data consistent is a classic challenge.
The Outbox Pattern solves this by persisting events in the same database transaction that modifies the business data, then delivering those events asynchronously.
This post walks through how the pattern guarantees delivery, why you never need to roll back a committed transaction when publishing fails, and how to handle truly unrecoverable errors.
How the Outbox Pattern Works
| Step | What Happens |
|---|---|
| 1️⃣ Start a DB transaction | Open a transaction in your service. |
| 2️⃣ Update business state | e.g., create an order, change inventory, etc. |
| 3️⃣ Insert an outbox record | Write a row to an outbox_events table containing the event type, payload, and any correlation IDs. This write happens inside the same transaction as the business update. |
| 4️⃣ Commit the transaction | Both the business data and the outbox entry become durable together. |
| 5️⃣ Asynchronous processor polls | A background job (or change‑data‑capture stream) reads rows where sent_at IS NULL. |
| 6️⃣ Publish to the broker | The processor sends the payload to Kafka, RabbitMQ, etc. |
| 7️⃣ Mark the row as sent | On success, update sent_at (or delete the row). If publishing fails, the row stays untouched for a later retry. |
Result: The database guarantees atomicity between the business change and the "intent to publish". The separate processor guarantees at‑least‑once delivery despite broker outages.
Guarantees of Service Delivery
- Atomic persistence – because the outbox entry is written in the same transaction as the domain data, either both succeed or both fail.
- Durable storage – the outbox table lives in the same relational store that already guarantees durability and recovery.
- Retry‑until‑success – the poller keeps trying until the broker acknowledges the message.
- Decoupling – your core service code doesn’t need to know whether the broker is up; it only cares about committing its transaction.
The pattern therefore provides eventual consistency and at‑least‑once semantics without forcing the service to block on external systems.
What Happens When Publishing Fails?
- The transaction has already been committed. The business operation (e.g., “order created”) is now part of the system’s state.
- The outbox row remains unprocessed, so the poller will retry.
- No rollback is required or possible – you cannot undo a committed transaction without breaking the atomic guarantee you just achieved.
Why Not Roll Back?
| Situation | Correct Action |
|---|---|
| Broker down before you attempt to publish | Still commit the transaction; let the background processor retry later. |
| Broker down after you’ve committed | The outbox row stays in the table; the processor will resend when the broker recovers. |
| Permanent publishing error (malformed payload, schema mismatch) | Move the message to a Dead Letter Queue (DLQ), flag it as failed, and alert ops. Do not roll back the business data. |
Handling Irrecoverable Failures
- Detect the error – the poller catches exceptions that are not transient (e.g., validation errors).
- Mark the event as failed – add a
status = 'failed'column or move the row to a separatedead_letter_outboxtable. - Log and alert – feed the error into monitoring/alerting pipelines.
- Manual or automated re‑processing – after fixing the payload or broker config, replay the event from the DLQ.
Because the original business transaction is already committed, you never need to “undo” it.
Building a Robust Outbox Implementation
Core Table Schema
CREATE TABLE outbox_events (
id BIGSERIAL PRIMARY KEY,
aggregate_type TEXT NOT NULL, -- e.g., 'Order'
aggregate_id BIGINT NOT NULL, -- primary key of the domain entity
event_type TEXT NOT NULL, -- e.g., 'OrderCreated'
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
sent_at TIMESTAMPTZ NULL,
failed_at TIMESTAMPTZ NULL,
error_message TEXT NULL,
status TEXT NOT NULL DEFAULT 'pending' -- pending | sent | failed
);
Transactional Write (Pseudo‑code)
with db.transaction():
order = Order.create(user_id=uid, total=price)
outbox = OutboxEvent(
aggregate_type='Order',
aggregate_id=order.id,
event_type='OrderCreated',
payload=json.dumps({'order_id': order.id, 'total': price})
)
db.save(outbox) # both inserts happen atomically
# commit happens here
Polling Publisher (Simplified)
def publish_loop():
while True:
batch = db.fetch(
"SELECT * FROM outbox_events WHERE sent_at IS NULL AND status='pending' LIMIT 50"
)
for ev in batch:
try:
kafka.producer.send(ev.event_type, ev.payload.encode())
db.execute(
"UPDATE outbox_events SET sent_at = now(), status='sent' WHERE id = %s",
(ev.id,)
)
except TransientError as e:
logger.warning(f'Transient failure for {ev.id}: {e}')
# leave row untouched – will be retried
except PermanentError as e:
db.execute(
"UPDATE outbox_events SET failed_at = now(), status='failed', error_message=%s WHERE id = %s",
(str(e), ev.id)
)
alert_ops(ev, e)
sleep(POLL_INTERVAL)
Monitoring & Alerts
- Metrics: number of pending events, failed events, average latency from
created_at→sent_at. - Dashboards: visualize backlog spikes indicating broker issues.
- Alert thresholds: e.g., “> 5 min backlog” → page on‑call.
Idempotent Consumers – Closing the Loop
Since the Outbox pattern delivers at‑least‑once, downstream services must tolerate duplicates:
- Include a deduplication key (the
idof the outbox row) in the message. - Consumers store processed IDs in a fast lookup (Redis, DB) and ignore repeats.
- Alternatively, design business logic to be idempotent (e.g., “create order if not exists”).
TL;DR Checklist
- Write domain changes and outbox record inside a single DB transaction.
- Commit the transaction; never try to roll it back after a publish failure.
- Run an asynchronous poller that reliably pushes outbox rows to the broker, retrying on transient errors.
- Move permanently failing rows to a DLQ, log, and alert.
- Make consumers idempotent to handle at‑least‑once delivery safely.
By embracing eventual consistency and decoupling persistence from delivery, the Outbox pattern gives you reliable event propagation without tightly coupling your service to external messaging systems.