A transaction is authorized. You write the result to the database. Then you publish an event to Kafka so downstream services can react - fire a webhook, update the settlement ledger, trigger a payout. The database write succeeds. The Kafka publish fails. Now the database says "approved" but no webhook was sent, no settlement was recorded, and the merchant thinks the payment never happened.
Or worse: the Kafka publish succeeds but the database transaction rolls back. Now downstream services think the payment went through, but there is no record. The merchant gets a webhook for a transaction that does not exist.
This is the dual-write problem. Two systems need to be updated atomically, but there is no transaction that spans both. In payment infrastructure, this is not a theoretical concern. It is the root cause of phantom transactions, missing webhooks, settlement mismatches, and hours of manual investigation.
Why dual-write always fails eventually
The Dual-Write Problem
Three ways a direct DB + Kafka write can fail.
DB commits, but the event never reaches downstream services.
DB Write
INSERT INTO payments ...
Kafka Publish
Connection refused / timeout
Webhook Service
Never notified
Settlement
Never triggered
The naive approach: write to the database, then publish to the message broker. Two separate operations in sequence.
Failure modes:
- Database succeeds, publish fails. The transaction is recorded but no event is emitted. Downstream services never know it happened. Webhooks are never sent. Settlements are never matched.
- Publish succeeds, database fails. Downstream services process an event for a transaction that was rolled back. Phantom records. Duplicate credits. Reconciliation breaks.
- Process crashes between the two operations. The database write committed but the publish never executed. The event is lost permanently unless something detects and retries it.
The traditional solution is a distributed transaction (2PC) spanning the database and the message broker. In practice, 2PC is not viable for payment infrastructure. It is slow, fragile under network partitions, and most message brokers do not support it. Kafka does not participate in XA transactions. R2DBC does not support 2PC. And even if they did, the latency and availability cost makes it incompatible with high-throughput payment processing.
The Transactional Outbox pattern
The solution: do not publish directly to the message broker. Instead, write the event to an outbox table in the same database, inside the same transaction that updates the business entity.
One database transaction. One atomic commit. The transaction record and the outbox message either both persist or neither does. No inconsistency possible.
A separate process - the outbox relay - reads uncommitted messages from the outbox table and publishes them to Kafka. Once published and acknowledged, the relay marks the message as sent.
The flow:
- Begin database transaction
- Insert or update the transaction record (e.g., status = "approved")
- Insert an outbox message in the same transaction (e.g., event = "transaction.approved")
- Commit transaction - both writes are atomic
- Outbox relay picks up the message, publishes to Kafka
- On successful publish, relay marks the outbox row as delivered
If the relay crashes after publishing but before marking delivered, it will re-publish on restart. This means consumers must be idempotent - they must handle duplicate messages gracefully. But at-least-once delivery with idempotent consumers is far safer than at-most-once delivery where events silently disappear.
“One database transaction. One atomic commit. The transaction record and the event message either both persist or neither does.
Transactional Outbox Pattern
Entity and event written in one atomic transaction. Relay publishes after commit.
Incoming Payment Request
POST /api/v1/payments
Payment Gateway Service
REST API
Receives request
Outbox Relay
Polls + publishes
Entity Table
Outbox Table
Kafka
Downstream services consume
Implementing with R2DBC and Kotlin Coroutines
Our payment gateway runs on Spring Boot WebFlux with Kotlin Coroutines and R2DBC for non-blocking database access. The outbox pattern fits naturally into this stack.
The coroutine-based transaction block wraps both the business write and the outbox insert in a single R2DBC transaction. Because R2DBC is non-blocking, this does not hold a thread during the database round-trip. The coroutine suspends, the thread is released, and another transaction can proceed on the same thread.
The outbox table is simple: an ID, the event type, the payload (serialized as JSON), a created timestamp, and a delivered flag. The relay queries for undelivered rows, publishes them to the appropriate Kafka topic, and updates the delivered flag - again using R2DBC, non-blocking end to end.
Key design decisions in our implementation:
- Ordering guarantees. Outbox messages for the same transaction are published in insertion order. The relay processes them sequentially per transaction ID, ensuring downstream consumers see state transitions in the correct order.
- Retry with backoff. If Kafka is temporarily unavailable, the relay retries with exponential backoff. Messages accumulate in the outbox table until the broker recovers. No messages are lost.
- Cleanup. Delivered messages are pruned after a configurable retention period. The outbox table stays small.
- Monitoring. We track outbox lag - the time between message creation and delivery. If lag exceeds thresholds, alerts fire. Under normal operation, lag is under 100ms.
What this solves in payment operations
The outbox pattern is not about engineering elegance. It solves specific operational problems that payment teams deal with daily.
Missing webhooks. A merchant reports they never received a payment notification. Without the outbox, the event was published to Kafka but Kafka had a brief network blip. The event is gone. With the outbox, the message sits in the database until confirmed delivered. Nothing is lost.
Settlement mismatches. The settlement service shows a different count than the transaction database. Without the outbox, some transaction state changes never reached the settlement stream. With the outbox, every state change is guaranteed to be published. If the numbers do not match, the problem is downstream - not in the event pipeline.
Phantom transactions. A downstream service processed a payment that was actually rolled back. Without the outbox, the event was published before the database transaction committed. With the outbox, events are only written inside committed transactions. Rolled-back transactions produce no events.
Debugging and audit. The outbox table is a persistent log of every event the system intended to publish. When something goes wrong, you do not reconstruct from Kafka offsets and consumer group states. You query the outbox and see exactly what was written, when, and whether it was delivered.
Tracing across the entire flow
Transaction Trace
Full lifecycle from creation to PSP callback confirmation.
... 3 hours later ...
Every inbound request to the payment gateway carries a trace ID - generated at the API entry point or propagated from the caller. This ID is attached to every database operation, every outbox message, every Kafka event, and every downstream service call.
When a merchant calls support about transactiontxn_8f2k4m9x, we pull the trace ID and see:
- API request received at 14:23:07.102
- Database transaction committed at 14:23:07.118
- Outbox message created at 14:23:07.118 (same commit)
- Outbox relay published to Kafka at 14:23:07.195
- Webhook dispatcher consumed event at 14:23:07.210
- Webhook delivered to merchant at 14:23:07.340
- PSP callback received at 17:45:12.891 (3 hours later)
- State transition published via outbox at 17:45:12.904
- Settlement matched at 02:15:33.204 (next day)
One trace ID. The complete lifecycle from API call to settlement. The outbox messages are part of the trace - you can see exactly when each event was created, when it was published, and how long delivery took.
The trace ID propagates through Kotlin coroutines via the coroutine context. When a coroutine suspends and resumes on a different thread, the trace ID follows. No MDC hacks. No manual propagation. The micrometer-tracing library handles it, and every log line includes the trace ID automatically.
“The outbox is not just a delivery mechanism. It is an audit trail. Every event the system intended to publish is recorded, timestamped, and traceable.
Idempotency on the consumer side
The outbox guarantees at-least-once delivery. That means consumers will occasionally see the same message twice - after relay restarts, network retries, or Kafka rebalances.
Every consumer in our system is idempotent by design. The transaction state machine rejects transitions that have already occurred. If a "transaction approved" event arrives twice, the second one is a no-op because the transaction is already in the "approved" state. Logged, acknowledged, not processed.
For webhook delivery specifically: each webhook carries a unique event ID. The dispatcher tracks delivered event IDs. Duplicates are suppressed before they reach the merchant. Merchants who implement their own idempotency (checking the event ID) get a second layer of protection.
Why not Change Data Capture?
Pattern Comparison
Transactional Outbox vs Change Data Capture (Debezium)
Transactional Outbox
CDC (Debezium)
Event Schema Control
You define the event shape
Selective Publishing
Choose which changes emit events
R2DBC Compatible
Works with reactive DB drivers
Operational Simplicity
No external infra dependency
Zero Application Changes
Reads directly from WAL
Captures All DB Mutations
Including direct SQL changes
Outbox
CDC
An alternative to the outbox relay is Change Data Capture (CDC) - reading the database transaction log (WAL in PostgreSQL) and publishing changes directly to Kafka. Tools like Debezium do this.
We considered CDC and chose the outbox approach for payment-specific reasons:
- Event schema control. CDC publishes raw database changes. The outbox publishes domain events with controlled schemas. Downstream consumers receive "transaction.approved" with a defined payload - not a database row diff.
- Selective publishing. Not every database write should become an event. The outbox lets us choose exactly which state transitions are published. CDC publishes everything or requires complex filtering.
- Operational simplicity. CDC adds infrastructure (Debezium, Kafka Connect, connector management). The outbox is a table and a coroutine. Less moving parts in a system where every moving part is a failure risk.
- R2DBC compatibility. CDC tools typically require access to the database replication stream, which adds operational complexity with R2DBC connection pools. The outbox works through the same R2DBC connection the application already uses.
The business case
Engineers understand why consistency matters. The business case is simpler: inconsistency costs money.
Every missing webhook is a support ticket. Every settlement mismatch is a manual investigation. Every phantom transaction is a potential dispute. At scale, these add up to full-time headcount dedicated to fixing data problems that should not exist.
The outbox pattern eliminates an entire category of operational issues. Not reduces. Eliminates. If the database committed it, the event will be published. If the database rolled it back, no event exists. The state of the database and the state of the event stream are always consistent.
For operators processing thousands of transactions per hour, this is the difference between a finance team that reconciles clean data and a finance team that spends every morning hunting for mismatches.
Consistency is not a feature. It is the foundation that everything else - settlements, webhooks, reporting, disputes - depends on. Without it, you are building on sand.