How We Helped a FinTech Startup Scale Transaction Processing from 1K to 100K TPS Without Downtime
A 1500% performance leap, zero downtime, and a 40% reduction in operational costs — here's the full playbook behind one of the most demanding infrastructure migrations we've ever led.
Case StudyFintechScalabilityKafkaMicroservicesCloud InfrastructurePerformanceDevOpsCQRS
---
## Overview
In early 2025, a Series B fintech startup approached us with a ticking clock. Their payment infrastructure, built on a monolithic Node.js service, was buckling under growth: 1,000 transactions per second (TPS) had become the ceiling, and peak loads during regional payment windows were causing 3–5 minute outages every quarter. The business was losing merchant trust, and the engineering team was burning out on incident post-mortems.
Over the next 14 weeks, our team designed and executed a full platform migration — from a single-threaded monolith to an event-driven, horizontally scalable architecture. The result was a 100x throughput increase, zero downtime during Black Friday–season spikes, and a 40% reduction in monthly cloud spend.
---
## Challenge
The client operated a payment orchestration platform that sat between merchants, acquiring banks, and regional payment gateways. Three core problems made scaling uniquely difficult:
1. **Synchronous call chains.** Every transaction flowed through a single Express.js server with no queueing. A slow bank API response blocked the entire worker pool.
2. **Stateful session locking.** User sessions and pending transactions were stored in-memory, meaning horizontal scaling required sticky sessions — a fatal flaw during autoscaling events.
3. **Legacy database schema.** The primary PostgreSQL instance handled both transactional writes and complex analytics queries on the same tables, causing lock contention that spiked with traffic.
Beyond technical debt, there was organizational pressure: a critical merchant onboarding deadline in 90 days meant the platform had to handle 50K TPS before the end of Q3. Any slip would mean broken contracts and reputational damage in a tight B2B market.
---
## Goals
We aligned the project around four measurable objectives:
- **Throughput:** Sustain 100,000 TPS during simulated peak load with p99 latency under 200ms.
- **Availability:** Achieve 99.99% uptime during the migration period — no planned or unplanned downtime windows exceeding 30 seconds.
- **Cost efficiency:** Reduce monthly infrastructure spend by at least 30% through right-sizing and eliminating overprovisioned failover capacity.
- **Developer velocity:** Cut deploy times from 45 minutes to under 10 minutes and reduce incident resolution time by 50%.
---
## Approach
Rather than attempting a risky big-bang rewrite, we chose a **strangler-fig pattern**: incrementally peel off functionality from the monolith, route traffic through new services, and decommission old code only after the replacement proved stable in production.
### 1. Event-Driven Core
We introduced Apache Kafka as the central nervous system. Instead of synchronous HTTP calls between services, all state changes (payment authorized, settlement failed, refund initiated) became immutable events. This decoupled the ingestion layer from downstream processing and eliminated head-of-line blocking.
### 2. CQRS & Event Sourcing
For the highest-traffic paths — authorization and capture — we separated read and write concerns. Write models persisted command events to Kafka and a compact PostgreSQL table. Read models projected into Redis clusters keyed by merchant and transaction ID, giving sub-millisecond lookups for status pages and dashboards.
### 3. Async Worker Pools
CPU-intensive tasks (3DSecure redirect validation, risk scoring, currency conversion) moved to Go-based worker services connected to Kafka consumer groups. Because workers were stateless, Kubernetes Horizontal Pod Autoscaling could react to lag metrics in under 30 seconds.
### 4. Database Decomposition
We split the monolith database into three bounded contexts:
- **Transactions DB:** Write-optimized, sharded by merchant ID.
- **Settlements DB:** Eventually consistent, batch-updated every 5 minutes.
- **Analytics DB:** Read replicas via logical replication, serving BI queries without affecting OLTP performance.
---
## Implementation
The implementation spanned three phases over 14 weeks.
**Weeks 1–3: Foundation & Observability**
Before writing a single line of new business logic, we instrumented everything. OpenTelemetry collectors fed traces, metrics, and logs into a Grafana stack. We established SLOs for latency, error rate, and throughput, and built automated canary deployments using Argo Rollouts.
**Weeks 4–8: Incremental Decomposition**
Using feature flags (LaunchDarkly), we routed 5% of live traffic to the new Kafka-backed ingestion path. We ran chaos tests weekly: killed worker pods, injected latency into bank API responses, and throttled database connections. Each failure revealed a bottleneck before it reached production at scale.
**Weeks 9–12: Data Migration & Cutover**
We backfilled 18 months of transaction history into the new read models using a CDC (Change Data Capture) pipeline. Once read consistency was verified, we performed a staged cutover — first internal staff, then 20% of merchants, then full production traffic. The old monolith stayed read-only for rollback until week 14.
**Week 13–14: Hardening & Cleanup**
Post-migration, we focused on cost optimization. Spot instances replaced 60% of worker nodes. Database storage tiers moved from SSD to cheaper provisioned IOPS for cold settlements data. We also removed dead code paths and reduced container image sizes by 70%, cutting deploy times.
---
## Results
### Performance
The platform handled 112,000 TPS during the final load test — exceeding the target by 12%. p99 latency settled at 165ms, well within the 200ms SLO. During a live production simulation of regional peak traffic (91,000 TPS), the system autoscaled from 12 to 47 worker pods in 22 seconds with zero dropped connections.
### Reliability
Over the 14-week migration, the platform recorded 99.996% availability. No postmortems were required after week 5. The mean time to detect (MTTD) incidents dropped from 14 minutes to 47 seconds, thanks to proactive anomaly detection on the new observability stack.
### Cost
Monthly cloud spend fell from $48,000 to $28,500 — a 41% reduction. The biggest win came from eliminating overprovisioned failover databases and moving batch workloads to spot instances.
### Developer Experience
Deploy pipelines, previously fragile 45-minute monolith builds, now take 7 minutes on average. The team shipped 23 feature releases during the migration quarter, compared to 9 in the previous quarter, because independent services could be deployed without coordinating across a 40-person engineering org.
---
## Key Metrics at a Glance
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Max Sustained TPS | ~1,000 | 112,000 | +11,100% |
| p99 Latency | 890ms | 165ms | -81% |
| Monthly Uptime | 99.85% | 99.996% | +0.146pp |
| Monthly Cloud Spend | $48,000 | $28,500 | -41% |
| Deploy Time | 45 min | 7 min | -84% |
| Incident Resolution Time | 4.2 hours | 1.1 hours | -74% |
---
## Lessons Learned
**1. Observability is not optional.**
We instrumented before refactoring, and that decision saved weeks of debugging. When the first Kafka consumer lag spike hit during week 4, we had trace data to pinpoint the slow partition assignment in under 10 minutes. Without observability, we would have been guessing.
**2. Strangle, don't rewrite.**
The strangler-fig approach meant the business never stopped. Merchants processed transactions continuously, and feature flags gave us an instant kill switch if a new path misbehaved. A big-bang rewrite would have required a maintenance window that the client could not afford.
**3. Data migration is the hidden complexity.**
Backfilling 18 months of transaction data into new read models took longer than expected because of eventual consistency edge cases. We now budget 30% more time for data migration in any event-sourced project and run consistency checks at every shard boundary.
**4. Cost optimization works best when done post-feature-complete.**
If we had tried to choose spot instances and downsize databases during active development, we would have sacrificed reliability for savings. Instead, we built for correctness first, optimized spend second, and still achieved a 41% reduction.
---
## Conclusion
What made this project succeed was not any single technology choice — Kafka, CQRS, Go workers, Kubernetes — but the discipline of incremental change backed by real-time observability. The client now has a platform that can absorb 10x growth without another architectural overhaul, an engineering team that ships faster with confidence, and a lower cost structure that improves unit economics at every scale.
If your team is staring at a similar scaling cliff, the playbook is straightforward: instrument, decompose, async-ize, and cut over in stages. The hardest part isn't the code — it's the patience to let small, reversible changes compound into a transformative result.
---
*Cover image: Modern fintech infrastructure visualization*