21 May 2026 • 21 min read
From 400 TPS to 4,800 TPS: How FinPulse Rebuilt Its Payment Infrastructure to Orchestrate 47 Countries
When FinPulse's payment orchestration platform buckled under 400 transactions per second during Black Friday 2024 — triggering 1,200 merchant escalations and $3.2M in SLA penalties — the company faced a critical decision: rebuild or accept permanent client erosion. With six enterprise renewals totalling $9.8M at risk, our team set a 10-week deadline. This case study documents the event-driven microservices rebuild that lifted throughput from 400 to 4,800 TPS, cut end-to-end latency P99 from 2,800ms to 380ms, and eliminated every SLA breach in the following peak cycle. We cover the architectural split of command and query paths using CQRS, the dual-write data migration strategy that preserved all 140M transaction records, the load-testing failures surfaced in week five that saved the cutover, and the post-launch operational lessons — from circuit-breaker design for 34 external bank APIs to an error-budget policy that dropped monthly incident count by 60 percent. Any team running a latency-sensitive financial platform will find actionable patterns here.
Overview
In March 2024, FinPulse — a B2B payment orchestration platform processing $2.3B in annual transaction volume across 47 countries — approached our team with an open crisis. During the previous year's Black Friday peak (November 2023), their platform had buckled under approximately 400 transactions per second (TPS). The result was 1,200 merchant escalation tickets, $3.2M in SLA penalty credits already paid out, and at least six enterprise clients who had privately signaled that they would not renew their contracts unless the platform's reliability improved materially before their renewal dates in Q1 2025.
FinPulse sits at a uniquely demanding position in the financial infrastructure stack: it is not a payment processor itself, but an orchestration layer that routes, splits, reconciles, and retries transactions across 34 acquiring banks, 12 payout networks, and 8 card schemes in real time. Every transaction that touches FinPulse's platform — and there were an estimated 140M of them in 2024 — passes through at least three internal services before a final routing decision is made. The latency budget for that entire inner path was 800 milliseconds end to end; during Black Friday 2023 the actual P99 was 2,800 milliseconds. The gap was not just uncomfortable — it was contractually and operationally untenable.
Our engagement lasted 10 weeks and produced a complete platform rebuild that ultimately ran at 4,800 TPS during Black Friday 2024 — a 12× uplift from the prior year. That rebuild also reduced end-to-end latency P99 from 2,800ms to 380ms, eliminated every SLA breach during that peak window, allowed FinPulse to close six stalled enterprise renewals worth roughly $9.8M in ARR, and set an internal culture tone that prioritises technical investment as a direct revenue lever. This case study documents the architectural decisions, implementation strategy, incident-driven hardening, data-migration approach, and post-launch economics of that engagement — along with the operational and organizational lessons the FinPulse and our joint team absorbed along the way.
Challenge
The Legacy Architecture
FinPulse's platform was built in 2019 — early enough that the team reached for the most familiar, pragmatic choices available at the time rather than the most durable. The result was a three-tier Node.js application running on AWS ECS, backed by a single Amazon RDS MySQL cluster with a 32-CPU instance class, a Redis caching layer for read-heavy endpoints, and a single Amazon SQS queue for asynchronous work. The application and database shared the same availability zone with no cross-AZ redundancy. Median transaction latency was 410ms under normal load; P99 quickly saturated above 2,000ms at just 350 TPS, and at approximately 500 TPS the RDS write queue began to back up so severely that new transactions simply stopped being acknowledged to merchant callers.
Reported uptime SLA for that platform was 99.5% per calendar year. In practice, FinPulse achieved 99.12% over its 2023 fiscal year, driven primarily by three recurring failure modes.
Failure Mode 1 — Database Write Queue Backpressure
The MySQL cluster handled every transaction write synchronously. Under peak load — roughly 400 concurrent writes — the InnoDB flush-to-disk cycle could not keep pace with the rate of committed transactions. Write queue depth increased monotonically. Within two minutes of sustained peak load, new callers began timing out at 800ms. Recovery took 8 to 12 minutes of load-shedding and write pause. Each incident killed approximately 18,000 transactions and generated roughly $180,000 in automatic SLA reparations.
Failure Mode 2 — Synchronous External Dependencies
Every single routing decision in the platform required a synchronous API call to at least one external provider: bank availability check, FX rate fetch, compliance validation through an identity provider, and an anti-fraud risk call. Four synchronous hops with a median overhead of 80ms each produced a worst-case path latency of 320ms before a single database operation was attempted — and that was on a clean, uncongested network. Under load, any provider that exceeded its own two-second timeout added a full retry cycle, compounding path latency unpredictably.
Failure Mode 3 — Monolithic Deployment and Recovery Risk
FinPulse's deployment process shipped the entire application in a single ECS task image, meaning every code push — regardless of the size of the change — required re-deploying all services together. Deployment windows averaged 32 minutes. The rollback process was manual and required a database migration reversal, adding another 20-minute head before the system was operational again. No blue-green or canary pattern had ever been implemented. Recovery from any production incident was a binary choice: recover and serve degraded performance, or stop serving entirely and hold a customer-impacting outage. The company chose degraded serving more often than not, which kept the clock running on SLA penalty accumulation.
These three failure modes combined created a compound headwind: latency climbed, availability slipped, merchant trust eroded, and the business was accumulating SLA debt faster than it could campaign for new renewals. The 90-day window before renewal season was the forcing function that made the rebuild possible.
Goals
Technical Goals
We agreed on four concrete, measurable technical objectives before writing a single line of code.
Goal 1 — Throughput: 4,000 TPS sustained, 5,000 TPS burst for 10 minutes. This gave a 10× safety margin over the 2023 peak of 400 TPS, allowing for at least five years of organic growth without another architectural intervention.
Goal 2 — End-to-end latency P99: 500ms or less. The contractual latency SLA with enterprise merchants was 800ms end to end. Setting an internal P99 target of 500ms with a 300ms buffer allowed engineering to use the headroom for retry logic and error-recovery without burning through the contractual budget.
Goal 3 — Availability: 99.95% monthly. This required eliminating the write-queue backpressure failure mode entirely and reducing the mean time to recovery (MTTR) for any incident below four minutes.
Goal 4 — Deploy without friction. Any single engineer should be able to deploy a code change to production in under five minutes with automated canary validation and zero coordination.
Business Goals
Technical success metrics needed to connect directly to business outcomes:
Revenue protection and growth: FinPulse had approximately $9.8M in enterprise renewals at risk during the Q1 2025 renewal window. The platform rebuild was designed to clear those renewals at full rate and at the higher price tier that FinPulse was offering in its new enterprise pricing model.
SLA penalty elimination: Full-year 2024 SLA penalties had reached approximately $4.7M — roughly 5% of total annual revenue. Z SLA penalty consumption in 2025 was a hard business requirement.
Velocity improvement: Engineering management had set an internal velocity target of 2× increase in feature delivery throughput after the rebuild. The monolithic architecture had driven deployment frequency down to approximately once per month per team.
What We Did Not Do
Scope was tight: 10 weeks. We explicitly ruled out several directions that would have sounded appealing in isolation but would have added weeks or months without delivering measurable business value against the stated goals.
No cross-border data-pipeline rebuild. The existing ETL pipeline for financial reporting was slow and brittle, but it was not contributing to the peak-throughput failure mode. It was deferred to a subsequent project.
No multi-region active-active deployment. The most ambitious architects would argue for multi-region active-active as a reliability baseline for a critical financial platform. In a 10-week window, active-active was beyond scope. We committed to active-passive with automated failover in a second availability zone — achievable, measurable, and sufficient for the stated availability target.
No front-end changes. The merchant dashboard was rebuilt as a separate effort on a different team timeline. Our scope was platform latency, throughput, and availability — not the merchant UI.
Approach
Architecture Decision: Event-Driven Microservices with CQRS
FinPulse's processing problem was a perfect match for a command-query responsibility segregation (CQRS) pattern with event-stream synchronization. The primary failure modes — write queue backpressure, synchronous externals, and slow read path recovery — all pointed to a single architectural problem: the application was mixing write-path path (transaction orchestration) and read-path demand (merchant dashboards, reporting queries, reconciliation APIs) on the same database.
Our decision split every transaction into two stages:
Command path (write, synchronous, low-latency-required): A transaction enters the system, a routing decision is made, and a result is returned to the caller. This path must complete in 500ms or less and cannot wait for expensive downstream operations like reconciliation, reporting updates, or fee calculation. We offloaded those downstream writes into an asynchronous event stream.
Query path (read, asynchronous, no tight latency budget): Report writes, dashboard farms, reconciliation runs, and compliance audit logs consume events from a durable stream and maintain their own materialised read models. They are never synchronous against the command path.
The asynchronous separation meant the database write queue no longer backed up because the synchronous command path handled only the minimal write-set: one transaction record. All downstream enrichment — ledger entry, auth-code storage, fee-model evaluation, reconciliation event — was handled in an eventually consistent model via stream consumers. A single write operation cannot cause a write-queue cascade failure by dragging downstream work into the synchronous path.
Technology Stack Choices
Given the 10-week timeline, we needed to minimise new operational surface area. Every technology chosen was either already in FinPulse's AWS environment or was a managed service requiring minimal team learning.
| Layer | Technology | Rationale |
|---|---|---|
| Command API | AWS API Gateway + Lambda | Eliminates container orchestration complexity for synchronous ingress; automatic per-transaction scaling |
| Event streaming | AWS Kinesis Data Streams | At-least-once delivery semantics, per-record checkpoint capability, sub-millisecond tail latency under 5000 TPS |
| Command transactions | Amazon DynamoDB | Single-digit millisecond writes at any scale; no per-request connection management; ACID transactions via TransactWriteItems |
| Routed transactions | Aurora PostgreSQL (serverless v2) | Preserves transactional guarantees for reconciliation and ledger use cases while scaling independently |
| Routing service | AWS Fargate on ECS | Stateful routing logic that needed persistent connections to acquiring bank APIs; Fargate isolates this complexity from Lambda cold-start behaviour |
| Bank API calls | Node.js with built-in retry + Circuit Breaker (opossum) | Team familiarity; opossum proved reliable at 400 ms circuit-break open thresholds | Observability | Datadog APM + CloudWatch + X-Ray | FinPulse already owned Datadog; added X-Ray integration for Lambda traces and transaction-level span annotation |
External Dependency Isolation
Four synchronous external API hops made latency uncontrollable under load. We tackled this with two complementary strategies.
Strategy 1 — Parallelise where ordering allowed: The identity/compliance check and FX rate fetch could execute in parallel because neither is a prerequisite for the other. Server-side routing orchestration in the Fargate routing service executed these in Promise.all with a four-second combined timeout, reducing the worst-case path from four sequential hops to two sequential plus one parallel step.
Strategy 2 — Circuit breakers and timeouts on every external call: We instrumented every outbound call with opossum circuit breakers configured to open after three consecutive failures or two consecutive timeouts. Open circuits returned cached or default values — for example, a frozen FX rate from the 30-second regional cache rather than a blocking call to a rate provider that was already degraded.
Implementation
Week 1 — Baseline Measurement and Infrastructure Prep
Before writing any migration code, we spent the first week building the measurement layer and core infrastructure scaffold. The most important decision of the entire project was made in this week: establish a factual, organisationally trusted baseline before any changes were made.
The baseline we captured against a production mirror:
- P99 end-to-end transaction latency at 400 TPS: 2,800ms
- RDS write queue depth at 400 TPS: 3,400 pending writes
- RDS CPU at 400 TPS: 96% across the 32-CPU primary instance
- Average synchronous external API latency per hop: 88ms (low load), 420ms (peak)
- Deployment time per full release cycle: 47 minutes
With baseline defined, we provisioned command infrastructure — Kinesis data stream, DynamoDB table with 4,000 RCU and 4,000 WCU, Aurora PostgreSQL serverless v2 cluster, and Cognito-based service identity — using AWS CDK. All infrastructure was version-controlled and deployed via a CI pipeline before any application code was written, ensuring that the infrastructure could not drift from its defined state during the project.
Week 2 — Command Path and Event Streaming
The command API gateway handled the transaction ingress path. Each arriving transaction was validated, a routing decision was made via the Fargate routing service, and the result was written to DynamoDB as the authoritative transaction record. A single asynchronous event was published to Kinesis with the routing decision before the command API returned a response to the caller — typically in under 120ms. The caller received an immediate acknowledgement within 800ms of submission; all downstream enrichment ran asynchronously.
The key insight here is that the caller only needed to know the routing decision — which acquiring bank to hit, with which credentials, at which endpoint. The caller did not need the ledger entry, auth-code storage, or reconciliation event to complete. Decoupling the two paths meant a spike in transaction volume could not cascade into a reconciliation pipeline failure, and vice versa.
Three asynchronous stream consumers were created in parallel:
- Ledger consumer — writes to Aurora PostgreSQL for double-entry ledger maintenance
- Auth-code consumer — stores bank authentication codes with expiry for merchant retrieval
- Reconciliation consumer — writes to the reconciliation engine for nightly bank-statement matching
None of these consumers sat on the synchronous path. They each operated at eventual consistency, with at-least-once delivery semantics guaranteeing that each transaction event would be processed at least once.
Week 3 — Routing Service and External Dependency Hardening
The routing service needed persistent TCP connections to 34 acquiring bank APIs. This is the part that could not move to Lambda — Lambda cold starts or lack of persistent connections made it unsuitable for bank-grade connection management. Fargate on ECS with a warm pool of two tasks provided exactly what was needed: sub-second start time, process-level connection pooling, and independent scaling per routing lane.
We also architecturally separated the routing decision — which bank to hit — from the bank API call — actually posting the transaction to the bank. By extracting routing decision logic into a rules engine backed by Redis, pure routing decisions complete in under 5ms before any external call is made. The expensive external calls then execute asynchronously through the stream consumer that handles bank-api responses and reconciliation.
Week 4 — Kinesis Backlog Handler and Observability Layer
Kinesis throughput is driven in part by the number of consumer shards and in part by per-shard throughput capacity. We ran load tests at peak simulated throughput on weeks 4 and 5 to size Kinesis correctly and to validate that stream consumers could keep pace with a writer producing at 5,000 events per second across 20 shards.
The observability layer was the layer that allowed us to make those sizing decisions with confidence. Datadog APM traced the full end-to-end path for every transaction from API Gateway, through the routing service, into Kinesis, and out through stream consumers with a per-transaction trace ID propagated at every hop. Service-level dashboards tracked Kinesis PutRecords.Success rate and consumer GetRecords.IteratorAge — a critical metric: if IteratorAge was growing, consumers were falling behind and Kinesis would begin dropping records, triggering retry storms.
Week 5 — Load Testing and Chaos Engineering
We ran a structured load test on week 5 against a production-scale environment mirroring the real AWS VPC, using k6 to simulate 5,000 concurrent writers each posting at 1 TPS — producing 5,000 TPS held for ten minutes, with a 15-minute ramp-up.
Two things broke in testing that had not broken in development:
DynamoDB per-second write throttle at 4,000 WCU with burst capacity. The fix: switched to on-demand mode (the throughput capacity we had provisioned was just within the margin of the projected write load during peak, but on-demand removes the ceiling entirely). Cost increased by approximately ten percent; throughput ceiling was removed.
Aurora PostgreSQL was not actually running serverless v2 — the Terraform module had scaffolded provisioned capacity instead, meaning the database capped out at approximately 2,500 TPS writes rather than scaling to match the consumer flow. Detected in load testing, corrected before the next test.
Weeks 6–7 — Data Migration and Cutover Preparation
Data migration was the riskiest element of the entire project. FinPulse had approximately 140M transaction records in MySQL. We could not afford a Big Bang cutover of the write path — every merchant would have lost data during the cutover window. Instead, we used a dual-write pattern:
During a four-hour write window (low-traffic Sunday night), FinPulse's existing platform began writing every new transaction to both MySQL and DynamoDB simultaneously. We used a CDC pipeline (Debezium on the MySQL bin log) to stream existing records into DynamoDB as a batch. Once the backlog was fully synchronised — approximately 36 hours of batch processing — the command path switches were gated behind a feature flag.
A team of five engineers stood by during the 20-minute cutover window with a full rollback procedure tested in rehearsal the week before. Rollback time was under two minutes. Only one rollback was needed: a misconfigured feature flag in the Kinesis Producer Library on the first attempt caused the dual-write to fail silently on 20% of transactions. The flag was corrected and five-minute grey release confirmed.
Weeks 8–10 — Hardening and Gradual Traffic Ramp
Instead of trading full traffic at once, we opened it at 10% on week 8, 30% on week 9, and 70% on week 10 — with real merchant traffic from production load in parallel. AWS X-Ray distributed trace visualisation allowed the team to see end-to-end latency in real time across live traffic, and Datadog AP alerts triggered on P99 breach for any four-minute window.
By the end of week 10 the platform was running at full production traffic with zero incidents for seven consecutive days. Engineering teams began rolling service-dependency migrations with confidence — disconnect services one by one from the legacy platform and rebuild them on the new event-driven stack.
Results
Quantitative Platform Improvements
| Metric | Pre-Rebuild (2023 peak) | Post-Rebuild (2024 peak) | Change |
|---|---|---|---|
| Sustained TPS | 400 | 4,800 | ↑ 12× |
| End-to-end latency P99 | 2,800ms | 380ms | ↓ 86% |
| Write queue depth at peak TPS | 3,400 pending writes | 12 pending writes | ↓ 99.6% | SLA incidents (Black Friday period) | 1,200 escalations | 0 | ↓ 100% |
| MTTR (average incident) | 18 minutes | 2.4 minutes | ↓ 87% |
| SLA penalties (annual) | $4.7M | $0 | ↓ 100% |
| Deployment frequency | Monthly | Multiple per week | ↑ ≥12× |
| AWS monthly infrastructure cost | $87,000 | $31,000 | ↓ 64% |
Business Outcomes
Six enterprise accounts that had signalled in late 2024 that they would not renew signed new contracts extending through 2026 at the 15% price increase tier. Total value of retained contracts: approximately $9.8M in annualised revenue. The platform rebuild directly and measurably drove that retention by demonstrating — to management and merchants — that the engineering team was capable of solving systemic reliability problems within a tight deadline.
FinPulse closed 12 new enterprise accounts in Q1 2025 — approximately 44% more than the same quarter in the prior year — with platform reliability cited as a deciding factor in 9 of those closes per sales team reporting.
Internal churn declined by approximately 39% among mid-level engineers over the 12 months following the rebuild. The chief engineering officer attributed this in part to the team's increased confidence after entering and surviving a high-stakes rebuild with a clear technical plan and visible positive outcomes.
Metrics
Performance Metrics
Three tiers of performance metrics drove the engineering decisions, post-launch monitoring, and incident response throughout the project and continued to structure the platform's reliability operations afterwards.
Tier 1 — Merchant-facing SLIs: end-to-end transaction latency P99, transaction acknowledgement rate (percentage of transactions acknowledged within contract SLA timeout), and platform uptime percentage. These are the metrics merchant contracts reference.
Tier 2 — Internal service SLOIs: per-service error rate, write path operation duration, Kinesis GetRecords IteratorAge, and circuit-breaker open state. These are internal operational guardrails.
Tier 3 — Cost and velocity: infrastructure cost per million transactions processed, deployment lead time, and number of deployments per week. These are the metrics engineering leadership tracks for team health and business productivity.
Together these three tiers form a three-layer observability pyramid that covers both what the business directly cares about and the early-warning signals that detect problems before they reach the merchant contract.
Synthetic Monitoring and Error Budget Policy
FinPulse adopted a formal error-budget policy in the month after go-live, inspired by Google's Site Reliability Engineering approach. Every merchant-facing SLI was tracked against a monthly error budget (for example, no more than 21.9 minutes of monthly downtime at 99.95% availability). When a team's services consumed more than 50% of the monthly budget before the month was 50% complete, the on-call team paused all feature releases and focused exclusively on reliability until the budget consumed fell below 50%. The policy was invoked only twice in the 18 months following the rebuild — both times in the same month as a bank API degradation unrelated to FinPulse's internal operations — and reduced per-month incident count on average by approximately 60% compared to the same period before policy adoption.
Lessons Learned
1. Establish Baseline Metrics Before Engineering Begins
Week 1 of the project established a factual baseline that everyone — merchant success, engineering leadership, and the delivery team — agreed on before any changes were committed. That alignment made it possible to argue against scope expansion, demonstrate measurable improvement to skeptical stakeholders, and set recovery targets for every engineering decision. Without a shared baseline, the rebuild could have become a technical exercise disconnected from business outcomes.
2. The Easiest Failure Mode to Fix Is the One You Find in Load Testing, Not Production
DynamoDB throughput limits and Aurora serverless v2 provisioning issues emerged in load testing in week 5 — four weeks before cutover. Both were fixed in a matter of hours. In a failure-first timeline, they would have manifested on Black Friday with merchant escalations already in flight.
3. Dual-Write and CDC Are Safer Than Big-Bang for Financial Writes
Financial transaction data is the one type of system state where correctness is not negotiable. The dual-write pattern during cutover — writing to both databases until reconciliation was confirmed — allowed a two-minute return to the previous state without any transaction loss. A cutover where both databases maintained a brief period without synchronization would have left 20,000 transactions in an uncertain state. The dual-write approach cost approximately 36 hours of extra processing but eliminated the risk.
4. Circuit Breakers and Eventual Consistency Must Be Designed, Not Added Later
The event-driven architecture made eventual consistency an inherent property of the system. That consistency was not a problem to tolerate — it was a design feature to exploit: by splitting command and query paths, eventual consistency of the read model never blocked the write path. Circuit breakers for external banks were designed in phase, not bolted on afterward. This is an architectural decision, not a configuration default.
5. A Dedicated Platform Team Unlocks Product Team Velocity
FinPulse established a formal platform engineering team (three engineers, one manager) as part of the post-rebuild handoff. This team owned: the event streaming infrastructure, CI/CD pipeline, Kinesis backpressure configuration, observability standardisation, and infrastructure supports. The product engineering teams were freed to focus entirely on merchant-facing features. Eighteen months after the rebuild, the platform team's NPS score — measured via quarterly engineering surveys — reached 72, while the NPS for pay and benefits across broader engineering was approximately the same. This suggested that the platform team's investment in reducing the operational burden on product engineering was paying compound returns in both quality of life and speed of product delivery.
6. Bloom Filters Are the Right Default for Rate-Limiting Outer Layer
One post-launch optimisation we implemented 18 months into operation was a bloom filter pre-check for standard routing rules, which dropped 94% of requests from the heavier routing engine before any Kinesis write was attempted. The implementation was approximately 30 lines of code and cost roughly $400 per month in DynamoDB read capacity; it reduced Kinesis PutCost by approximately $6,700 per month — a 17× return. Small optimisations at the request-entry layer compound enormously across the volume profile of a platform processing 140M transactions per year.
Conclusion
Payment orchestration might sound like an unglamorous infrastructure problem — but it sits at the intersection of reliability, regulatory compliance, financial accountability, and merchant trust in a way that very few engineering problems do. The FinPulse rebuild proved that the same architecture and operational discipline that powers the most demanding consumer-facing platforms are directly applicable and directly valuable in financial operations infrastructure.
The three things that, in retrospect, determined whether this project succeeded or failed were: a tight, explicitly defined scope refusing feature expansion; a rigorous baseline and load-testing regime that surfaced practical failure modes before production exposure; and an organisational culture willing to invest in platform velocity as a function of merchant retention rather than treating infrastructure debt as a quarterly burn problem. Payment platforms that are processing millions of dollars per hour cannot wait for a perfect architecture before they modernise — the architecture has to improve fast enough to protect the platform's most valuable currencies: reliability, latency, and merchant trust.
FinPulse processes approximately $70M per month through the rebuilt platform as of Q2 2025 and has had zero SLA incidents in 18 consecutive months of production. The engineering team now spends 76% of its time on feature and merchant-qualified work, with platform reliability treated as a feature rather than a cost centre.
