Why We Built Payment Infrastructure on Reactive Architecture

A blocking payment system loses you money every time a provider slows down. Not because it crashes. Because threads freeze on I/O, the queue backs up, and transactions that could have been approved on a different provider never get the chance. The customer sees a timeout. You see a decline. The revenue is gone.

We built Exirom on a fully reactive, non-blocking stack because payment infrastructure is an I/O coordination problem. Every meaningful operation - PSP authorization, database write, cache lookup, message publish - is a network call. If your threads block on those calls, your throughput is capped by your slowest dependency. In payments, that dependency changes every hour.

What reactive means in practice

No thread ever waits for I/O. When a request goes to a PSP, the thread is released immediately and picks up the next transaction. When the response arrives, a different thread continues the flow. This applies to every outbound I/O boundary: PSP authorization requests, PostgreSQL via R2DBC, Redis, Kafka publishing, and outbound webhook delivery. Internal service-to-service calls use gRPC with Kotlin coroutine stubs - suspend functions, not blocking threads.

A traditional thread-per-request server with 200 threads handles 200 concurrent I/O calls. At 500ms average PSP latency, that caps at 400 transactions per second. A reactive server with the same hardware never blocks - the same box pushes 2,000-5,000 concurrent operations. Observed under our production load: 5-10x throughput on identical infrastructure.

Thread Model Comparison

How threads are utilized in blocking vs reactive systems.

Blocking (Traditional)

Thread 1

PSP auth...

Thread 2

DB query (JDBC)...

Thread 3

Redis lookup...

Thread 4

Kafka publish...

Thread 5

HTTP webhook...

Queue

...

Every I/O call blocks a thread. DB, cache, PSP, message bus - all waiting.

Reactive (Exirom)

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Queueempty - threads released on every I/O boundary

R2DBC, Redis, PSP calls, Kafka - all non-blocking. Threads never idle.

PSP HTTPPostgreSQL (R2DBC)RedisKafkaWebhooks

Cascading depends on speed you cannot fake

Smart routing with cascading is the core value of payment orchestration. If the first provider returns an eligible soft decline, a backup route can be attempted fast enough for the customer to stay in the checkout flow.

But cascading only works if the retry is fast enough. The customer is watching a loading screen. If Provider A takes 800ms to decline and your system needs 100ms of thread scheduling before it can fire the retry, Provider B now has less than a second to authorize before the checkout times out.

In a reactive system, the decline callback triggers the routing engine immediately. No thread scheduling. No queue wait. We measured the gap in production: 12ms between first decline and second authorization request. Under the same load on a blocking architecture, internal benchmarks showed 80-150ms - enough that a meaningful percentage of retries would have timed out at the checkout.

During a production incident where a major provider soft-declined 40% of traffic, our cascading recovered 87% of those declines on secondary providers. The 12ms retry gap made that recovery rate possible. At 80-150ms, modeling shows recovery drops to 60-70%.

“
The milliseconds between a decline and a retry are not a technical detail. They are the difference between recovered revenue and a lost customer.

Cascading Timeline

Provider A soft-declines. Retry fires to Provider B in 12ms.

Time

0ms

100ms

200ms

300ms

400ms

Customer

clicks "Pay"success

Router

route

retry

Provider A

authorizing... soft decline

Provider B

authorizing... approved

Decline at 280ms

Retry fired at 292ms (12ms gap)

Approved at 390ms

Predictable behavior under stress

Speed matters. But what senior engineers actually care about is predictability.

A blocking system performs fine at normal load. At 60% capacity, it performs fine. At 85%, latency starts climbing. At 95%, it falls off a cliff. Tail latency explodes. Timeouts cascade. The system goes from healthy to degraded to failing in minutes.

Stress Behavior

How latency scales with load in blocking vs reactive systems.

Blocking (Thread-per-Request)

Reactive (Non-blocking)

Cliff at ~90% load Plateau with backpressure

A reactive system degrades linearly. Throughput scales with load until resource limits are hit, then it applies backpressure - slowing intake rather than collapsing. No cliff. No cascade. No global degradation from one slow dependency.

Provider Isolation

What happens when the primary acquirer degrades to 1800ms latency.

All providers healthy...

Blocking (Thread-per-Request)

Affected: 0%

Primary

280ms

Backup

280ms

APM

280ms

Wallet

280ms

Reactive (Non-blocking)

Affected: 0%

Primary

280ms

Backup

280ms

APM

280ms

Wallet

280ms

In payment terms: when the primary acquirer starts responding in 2 seconds instead of 300ms, a blocking system runs out of threads and everything fails - including transactions to the backup acquirer, which is perfectly healthy. In our reactive system, the primary acquirer gets backpressure (reduced request rate), the routing engine shifts traffic to healthier providers, and the backup acquirer continues at full speed. Stable tail latency across the board.

This is not a nice-to-have. Every PSP has bad days. Network issues, maintenance windows, capacity limits during peak. The question is whether one slow provider takes down your entire payment stack or just reduces its own throughput.

Event-driven, not request-response

Reactive I/O is half the architecture. The other half is how services communicate.

We use Kafka as the backbone. Every transaction state change is an event. The routing engine publishes "routed to primary acquirer." The PSP adapter publishes "authorization declined." The callback handler publishes status updates hours later. Services consume events independently - no direct calls, no synchronous chains.

Why this matters operationally: when a PSP sends a callback 3 hours after the original transaction, the callback handler - a separate service receiving inbound requests - processes it and publishes the state change to Kafka. The routing engine does not need to be involved. When a settlement file arrives at 2am, the reconciler processes 50,000 line items against the event log. No batch jobs. No cron scripts. Every transaction already has a complete, ordered history.

Event-Driven Architecture

Every service publishes and consumes events through Kafka.

API Gateway

Routing Engine

PSP Adapter

Callback Handler

Risk Scorer

Settlement

Webhooks

publish / consume

Kafka Event Streamordered / immutable / replayable

tx.created

tx.routed

psp.auth.sent

psp.declined

tx.cascaded

psp.approved

callback.rx

webhook.sent

settled

Decoupled

Services operate independently. One failure does not cascade.

Replayable

Every event stored. Full transaction history on demand.

Real-time

Monitoring, alerts, and rerouting consume the same stream.

Monitoring built into the architecture

Because every event flows through Kafka, we get real-time monitoring as a side effect. Not as a bolt-on polling system.

A stream processor computes rolling approval rates, latency percentiles, and error rates per provider, per BIN range, per geography - updated every second. When a provider degrades, the routing engine knows through the same stream. Traffic shifts before the next transaction arrives.

Detection-to-reroute in production: under 3 seconds. Polling-based monitoring with 30-60 second intervals means 3,000-6,000 transactions routed to a failing provider before anyone notices.

“
We do not poll for provider health. Every transaction is a health signal. The architecture that processes payments is the same architecture that monitors them.

Idempotency by architecture

PSP callbacks are inbound requests - each provider hits your callback endpoints with status updates. This is a separate problem from outbound I/O. The challenge is not speed. It is correctness.

Every PSP behaves differently. One sends a single callback. Another sends three - pending, approved, settled. A third retries the same callback if it does not get a 200 within 5 seconds. A fourth changes status retroactively 6 hours later. Multiply by 20 providers and thousands of transactions per hour.

Every transaction in our system has a state machine - a directed graph of valid transitions, not a status string. "Settled" cannot move to "pending." Duplicate callbacks are acknowledged but processed once. Out-of-order events are detected by sequence numbers in the event log. The callback handler is a separate service that publishes state transitions to Kafka - decoupled from the authorization flow entirely.

Result: zero duplicate charges. Zero phantom credits. Clean reconciliation data every day. Your finance team works with facts, not artifacts.

Why most platforms are not built this way

Most payment platforms started as blocking monoliths. They work. They process transactions. They generate reports. But they are fundamentally limited by an architecture that was not designed for I/O-heavy, multi-provider, failure-prone workloads.

Rewriting a production payment system from blocking to reactive is a multi-year project that most companies cannot justify. So they optimize around the edges: more servers, bigger thread pools, tuned timeouts. These help - up to a point. But they increase cost without increasing resilience. Scaling a blocking system means paying more to handle the same failure modes.

The alternative - patching a blocking core with async wrappers - creates the worst of both worlds: reactive complexity with blocking failure modes. The system looks modern but collapses the same way under provider degradation.

We started from zero in 2024. No legacy. No migration path. We chose reactive because the problem demanded it. Payment orchestration is I/O coordination under adversarial conditions - slow providers, flaky callbacks, unpredictable latency, concurrent failures. The infrastructure is invisible to operators. But the architecture shows up in every metric that matters: approval rates, latency, uptime, recovery speed.

Architecture is the product.

Why We Built Payment Infrastructure on Reactive Architecture

What reactive means in practice

Cascading depends on speed you cannot fake

Predictable behavior under stress

Event-driven, not request-response

Monitoring built into the architecture

Idempotency by architecture

Why most platforms are not built this way

Related reading

How We Guarantee Consistency in Distributed Payment Systems

What is Payment Orchestration? A Technical Guide

Stay updated on payment infrastructure