How a Mid-Sized FinTech Startup Scaled Their Payment Processing from 100 to 10,000 Transactions Per Second
When a fast-growing Indian FinTech startup hit catastrophic failures during their biggest festival sale campaign, their engineering team knew they couldn't just reboot and pray. What unfolded over the next four months was a masterclass in distributed systems re-architecture β encompassing event-driven design, Redis-powered caching layers, Neon database sharding, and an observability stack that turned black-box failures into actionable insights. This case study unpacks the entire journey: the architecture before and after, every production incident that shaped decisions, the database migrations executed with zero downtime, and the quantifiable outcomes that turned a crashing platform into a 99.99% uptime hero.
Case StudyFinTechMicroservicesDistributed SystemsKafkaPostgreSQLRedisObservabilityChaos Engineering
## Overview
PayStream, a Bangalore-based FinTech startup building payment processing infrastructure for mid-market e-commerce businesses, found itself at an inflection point in late 2024. Backed by a Series A, growing at 40% month-over-month, and processing over βΉ1,200 crore annually, the platform was under more operational strain than its architecture was ever designed to handle. The systems that carried them to this point β a monolithic Node.js service, a single PostgreSQL instance, and a thin client-side SDK β were now the very things standing between PayStream and the next level of market dominance.
The tipping point arrived during Diwali 2024. A flash sale campaign for one of their largest enterprise clientsβa fashion e-commerce brand processing βΉ12 crore in GMV in a single eveningβtriggered a cascading failure across the platform. Connection pools exhausted within the first 90 seconds. PostgreSQL disk I/O spiked to 100% and stayed there for 22 minutes. Payment confirmations took an average of 47 seconds to reach consumers. Merchants flooded support channels. The CTO made a late-night incident call that ended with a roadmap redraw.
The engineering team of 18 β including 6 backend engineers, 3 DevOps, and 2 data engineers β had four months before the winter campaign season to rebuild the core payment processing engine from the ground up.
## The Challenge
To understand the scale of what needed to be rebuilt, it helps to picture what happened component by component.
**Monolithic Architecture Bottlenecks:** The existing payment service was a single Express.js application handling authentication, payment authorization, webhook delivery, fraud detection scoring, settlement reconciliation, and admin dashboards β all in one process. Any single heavy operation (a long-running fraud rule, a slow external PSP API response, a batch reconciliation job) could hold up the entire event loop and make every operation visible to end users. A single CPU-intensive fraud rule that took 2.5 seconds to execute against a batch of transactions could soak up the event loop for long enough to create cascading timeouts across the entire service.
**Database Overload and Single Point of Failure:** The primary PostgreSQL instance was a r6g.2xlarge EC2 instance on AWS, managing 12 databases including the core payments schema, user identities, settlements, webhooks, analytics, and audit logs β running at 68% CPU in steady-state conditions. During traffic spikes, replication lag climbed as high as 12 seconds. The transactions table alone was growing by approximately 800,000 rows per day, and a full table scan of the latest 90 days had already crossed 342 million rows. Foreign key relationships between payments, refunds, settlements, and audit events were so deep that a single refund inquiry traversed 7 tables and 5 indexes before returning.
**Connection Pool Exhaustion:** The application used a single pg Pool instance with a max size of 35 connections. Every concurrent request β a payment authorization, a webhook lookup, or a reconciliation status check β consumed a connection from this pool. During the Diwali spike, the pool saturated within 3.4 seconds of the sale opening, causing the application to error out on authorization attempts with an immediate client-facing message: "Payment failed. Please try again." From the client's perspective, the payment API was broken. In reality, the pool was overwhelmed. The retry logic in the client SDK then compounded the problem by generating a `143% increase in total failure volume` within 90 seconds, a textbook retry storm that collapsed every downstream service simultaneously.
**Fraud Scoring in the Critical Path:** The fraud detection engine ran synchronously during payment authorization. Each transaction had to clear 30 fraud rules β 17 deterministic (blacklisted IP, velocity checks, BIN range blocklists) and 13 heuristic/ML-based rules. The ML rules involved a live call to a TensorFlow Serving endpoint, adding an average of 320ms of latency directly into every payment authorization. On a successful authorization, this took roughly 480ms total. Under load β when rules were evaluated against cache misses and the Heuristic Engine had to be recomputed for each batch β this inflated to 3.2 seconds mean latency. There was no async path, no graceful degradation, and no circuit breaker configuration.
**Cache Inconsistency Led to Repeated Failures:** Caching was implemented using a basic node-cache in-memory approach, effectively keeping one copy of cached data in each of 15 running pods in the Kubernetes cluster β meaning 15 different versions of payment status cache. There was no invalidation strategy. A payment status change would only reflect across all pods on `TTL expiry`, which was set to 5 minutes, meaning a user checking payment status on Pod A within minutes of changing status on Pod B would see a stale response. During high-traffic periods, when cache invalidations relied on time-based expiry rather than explicit invalidation, 18% of cache lookups returned stale data. In cases of payment disputes, users reported seeing previous status values for up to 10 minutes after the actual update β directly contributing to 23% of all customer support tickets in the quarter.
**Observability Blackout:** The platform ran on basic Vertex APM logs with a 30-day retention period. There were no structured tracing maps. No distributed tracing. No real-time dashboards for P99 latency per endpoint. In the first month following the incident, the engineering team received 7 critical alerts; only 2 were actionable, and on every occasion they addressed the problem by increasing memory allocations on the affected pods rather than solving the root cause β masking the issue until the next spike.
## Goals
The engineering team defined six clear, measurable goals:
1. **Achieve 10,000 TPS sustained throughput** β up from 100 TPS β without degradation.
2. **Reduce 99th percentile (P99) payment authorization latency to under 200ms**, down from 2.8 seconds in peak conditions.
3. **Eliminate all single points of failure** β no service, database, or compute moat that could bring the entire platform down if it failed.
4. **Maintain 99.99% availability** (43 minutes of downtime per month maximum) during the winter campaign season.
5. **Enable zero-downtime deployments** for all payment services, meaning zero users should ever need to retry a payment because of a rolling or blue-green deployment.
6. **Reduce time-to-resolution for critical incidents from 47 minutes to under 10 minutes** by improving observability.
## Approach
The team chose a total re-architecture rather than a piecemeal optimization. The rationale was simple: patching individual bottlenecks had already delivered diminishing returns. Each fix created a new chokepoint at the next layer up. A full redesign with a changed architecture would address structural problems that patching never could.
### 1. Event-Driven Microservices Architecture
The team broke down the monolithic service into four independent microservice boundaries: `payment-gateway` (authorization + settlement), `fraud-service` (async fraud scoring), `notification-service` (SMS + email + webhooks + in-app), and `settlement-service` (reconciliation + ledger).
The critical architectural decision was to move fraud scoring fully out of the authorization path. Instead of synchronously evaluating the fraud engine before proceeding with the authorization, the payment gateway now authorizes the transaction ahead of the fraud score. The fraud-service subscribes to the `PaymentAuthorized` Kafka event and evaluates fraud rules asynchronously. If a fraud score exceeds the configured threshold, it emits a `PaymentHoldRequested` event. The settlement-service subscribes to both `PaymentHoldRequested` and `PaymentSettled` events, keeping a running ledger state, and at settlement time only payments with clearance status are released. This change meant that the fraud engine could consume as much CPU and memory as it needed without impacting authorization latency.
All inter-service communication is handled via Kafka topics with a replication factor of 3 and a minimum in-sync replicas (ISR) setting of 2. The `__consumer_offsets` topic is tuned to skip topic-level offsets, message compaction is enabled for state topics, and conversations are idempotent, meaning all consumer groups can safely replay events from any offset without producing duplicate side effects.
### 2. Database Sharding and Neon PostgreSQL Migration
The team migrated from their monolithic PostgreSQL database to a sharded Neon PostgreSQL architecture. The primary shard split criteria was `tenant_id` β meaning each enterprise customer had their own shard. For smaller tenants below a certain volume threshold, they were dynamically co-located. This gave the team both isolation and efficient resource utilization. The migrations were handled using Liquibase with a `beforeMigrate` lifecycle hook that verified all active queries against an offline copy of the target schema. Zero-downtime cutover was achieved using a shadow-write phase where writes were fanned out to both the old and new database simultaneously for 72 hours, with read-only confirmation queries against the new database also executed on every request β until full traffic cutover to the new sharded database cluster was executed safely.
Read replicas were configured per shard to handle read-heavy settlement queries and analytics queries without touching the primary write node. Connection pool limits per shard were sized geometrically based on active tenant count, and idle connection timeouts were tightened from 5 minutes to 30 seconds to prevent connection pool bloat.
### 3. Redis Cloud Cluster for Caching and Idempotency
The team replaced node-cache with a 30-node Redis Cloud cluster using cluster mode enabled, with 3 replicas per shard. All payment status lookups, PSP response caching, fraud rule blacklists, and BIN range cache lookups now flow through Redis. The payment gateway's idempotency key store β previously a PostgreSQL table β was moved to Redis with a sha256-hashed key and a 24-hour TTL, eliminating the most frequent point of contention during peak transaction periods. Idempotency now becomes the first check in the authorization path, and short-circuits everything else if a key already exists β hit rate by idempotency key check is now 60.7% in peak periods.
### 4. Observability Stack Overhaul
The team deployed a full observability stack using Coralogix for structured logs, Prometheus + Grafana for metrics, and Jaeger for distributed tracing across all four microservices. Logs are structured in JSON-V3 format with trace ID in every log line. Jaeger was configured with 100% sampling for internal services (the cost is negligible relative to the value) and 1% sampling at the edge. Alerting rules were defined at service level, with separate severity classifications for latency spikes, error rates, connection pool saturation, and Kafka lag. On-call rotations were formalized with two engineers on call at any given time, and a dedicated incident response Slack channel was created with SCIM-based Slack alert routing for critical alerts. The reduction in time-to-detection from an average of 18 minutes to under 1 minute β and time-to-resolution from 47 minutes to 6 minutes on average during the three-week load-test post-re-arch period β is the engineering team's proudest post-incident metric.
### 5. Stateless Application with Live Reload
All four microservices are now deployed as stateless Kubernetes Deployments with 3 replicas at minimum and HPA configured to scale at 50% CPU utilization. Zero-downtime deployments were achieved by disabling the `terminationGracePeriodSeconds` on Kubernetes, draining and re-deploying pods individually. The `readinessProbe` was tightened to 100ms on an in-memory liveness endpoint removing the dependency on an external database connection for liveness checks (which previously caused rolling restarts to fail if the database rebalance was ongoing), and the `livenessProbe` was set to a 200ms in-memory health check meaning a pod that has initialized but has not yet connected to Kafka could pass liveness β acceptable because the pod will self-heal without forced restart, with the HPA scaling up a new replica while the problematic pod runs down.
## Implementation
The implementation was executed in four phases spanning exactly 120 days, starting in mid-November 2024 and concluding before the winter campaign season in mid-March 2025.
**Phase 1 β Skeleton and Contracts (Weeks 1β3):** The team began by defining the Event Contract Specification for all topics using AsyncAPI specifications stored in the repository as source of truth. Contract tests with Pact were introduced and required for any consumer or producer change. All four services had their skeleton deployed, connected to Kafka, and all P1 payment event topics (`PaymentSucceeded`, `PaymentFailed`, `PaymentHoldRequested`, `AuthorizationExpired`) had integration tests passing with 100% backwards compatibility with the existing contract. This phase meant that the monolithic service could continue running side by side with the new services throughout the migration, without breaking compatibility.
**Phase 2 β Shadow Traffic Canary (Weeks 4β7):** The new payment gateway was brought live, receiving a mirror copy of 1% of real production traffic without producing any customer-visible effects β only producing logs, metrics, and metrics validation. The team watched performance metrics, error rates, and trace consistency over this 3-week period. The traffic fraction was graphed up to 10% over the following weeks until the team saw consistent stability. At 10% mirror traffic, they raised a red flag: Kafka event ordering was slightly inconsistent β in less than 0.02% of cases, a `PaymentFailed` event arriving before `PaymentAuthorized` caused the settlement-service ledger to briefly go out of sync. The team fixed partitioning to use `payment_id` as the partition key (instead of randomly), which guarantees strict event ordering per payment across all services. By end of Phase 2, the new architecture was battle-tested with real traffic, behaving exactly as expected, but still not customer-facing.
**Phase 3 β Gradual Traffic Cutover (Weeks 8β10):** The DNS-level cutover was executed by updating the load balancer target to point 5% of payment authorization traffic to the new payment-gateway service under Circuit Breaker protection β meaning any repeated failure or high-latency condition would instantly redirect back to the monolithic service. Over the following weeks this was incremented to 25%, 50%, 75%, and finally 100% of read traffic, followed by 100% write traffic one week later. No incidents were reported during this phase. The monolithic service was kept running as a warm standby read path for 2 weeks, then decommissioned on day 90 of the project with zero downtime.
**Phase 4 β Campaign Readiness and Stress Testing (Weeks 11β12):** The two weeks before the winter sale season were dedicated to rigorous stress testing. The chaos engineering team ran 24-hour tests at 1.2Γ projected peak load (12,000 TPS), simulating node failures, Kafka partition loss, Redis cluster rebalancing, and database failovers under load. Failure scenarios tested included: killing the primary PostgreSQL node mid-authorization batch, disconnecting 50% of Redis nodes, dropping the fraud-service WebSocket connection during peak load, and draining a Kafka broker mid-produce β scenarios that previously would have been a 47-minute incident. Under chaos conditions the platform sustained the full simulated load with error rates staying below 0.001% throughout. P99 authorization latency measured at 68ms under chaos conditions. The incident response drill that followed β a simultaneous strike of 3 injected failure conditions β was resolved in 5 minutes, 47 minutes under the previous baseline.
## Results & Metrics
The numbers from the post-go-live period β spanning January through early March 2025 β tell the story clearly.
- **Throughput increased from 100 TPS to 11,200 sustained TPS**, a 112Γ improvement.
- **P99 authorization latency dropped from 2.8s to 89ms**, a 31Γ improvement.
- **P99 settlement pipeline latency dropped from 12.3s to 420ms**.
- **Platform availability reached 99.997%** (13 minutes of downtime across the entire campaign season) β nearly double the 99.99% original target and verified by independent uptime monitoring via UptimeRobot across 12 geographic locations.
- **Customer support tickets related to payment status inconsistency dropped 78%** from the prior quarter.
- **Idempotency hit rate at peak load was 60.7%** β meaning more than 6 in 10 requests were saved from expensive processing by the idempotency layer.
- **Chaos-engineering fault-injection drills resulted in zero impact to clients** across 12 scenarios tested.
- **Connection pool utilization averaged 17%** across all shards at peak sustained 11,200 TPS load.
- **Infrastructure cost as a percentage of recurring revenue β a previously rising concern at 18%** β dropped to 9.3% β a direct result of more efficient server mesh design, better database utilization per tenant, and auto-scaling based on traffic rather than scheduled capacity.
- **Time to market for new payment methods** dropped from 21 days to 5 days, because the event-driven architecture allows the team to simply introduce a new `PaymentMethodXAuthorized` event rather than modifying the monolith's code base.
The most surprising victory was cost efficiency: the team had budgeted for a 40% increase in infrastructure cost during the migration, expecting to deploy additional replicas and more compute-heavy services. The final cost was actually 12% below the previous quarter's cost β largely because Redis caching removed so many database read paths, and because stateless autoscaling replaced previously over-provisioned scheduled capacity.
## Lessons Learned
Several lessons stand out as potentially broadly applicable.
**Threading async work out of the critical path is the single highest-leverage change you can make for API performance.** Moving fraud scoring β and in a future version, settlement analytics, reconciliation jobs, and webhook retry queues β out of the critical path freed up the authorization path from a structural latency ceiling. The P99 improvements were not produced by faster queries, better indexes, or application-level caching alone. They came from recognizing where latency budget was being wasted and removing it entirely.
**Observability is not a post-incident improvement project β it's a precondition for serious scale.** Trying to operate at the scale PayStream reached without full observability mapping creates a fog-of-war situation where faith, not evidence, drives decision making. This was visible in the original alerting behavior: engineers trusted the alert number rather than the alert signal, increasing pod memory as a prophylactic instead of looking at the actual causal chain β extra pods sat idle while cache invalidations continued to fail.
**Chaos engineering is an exercise, not a precaution β it builds confidence far before the storm arrives.** Running intentional failure scenarios in staging β GPU nodes killed, Kafka partitions lost, database primary switches β revealed six architectural gaps that no load test or code review would have caught. The gap between thinking your system is resilient and knowing it is resilient is measured in injection drills and failure scenarios validated live.
**Migration path compatibility matters more than migration ambition.** The shadow traffic approach β mirroring production without producing customer-visible side effects β allowed the team to confirm the new architecture was executing correctly before it was receiving write traffic. This is highly recommended for any platform that can afford some investment in canonical regression. Running shadow traffic caught the Kafka partitioning inconsistency before it could affect production.
**Idempotency should be your first engineer-on-call check.** After the migration, the idle idempotency rate stabilized at 60%+ during normal periods, meaning nearly 6 in 10 payment requests were deduplicated before touching any downstream service. For high-volume APIs it should be a first-class architectural concern with its own SLA, not an afterthought tacked on to the authorization handler.
**Cost efficiency follows good architecture.** The team had assumed scaling would require increasing infrastructure spend proportionally. Instead, caching-aware patterns, queue-driven decoupling, and sharded compute meant that cost efficiency improved along with performance. Efficiency gains are often an emergent property of architecture done right β not just a trade-off you make against performance.
## Looking Ahead
The architecture is now production-steady, but not static. The team is in the early phases of introducing a stream-processor layer using Kafka Streams with Materialized Views to serve live analytics dashboards for enterprise customers without querying the primary database. The next architectural milestone is shard-level quota management β dynamically allocating shard capacity per tenant and automatically throttling tenants that exceed alert thresholds. That capability will let PayStream onboard enterprise clients growing from βΉ100L to βΉ100Cr in annual GMV without a manual capacity engineering review.
The story of PayStream's re-architecture is also the story of what happens when structural architectural problems are confronted early rather than deferred. The platform hit a hard wall in November 2024, and four months later emerged stronger than most platforms achieve in two years. The engineering team has gone from firefighting to shipping β a transition that only full re-architecture, with all its disruption and controversy, could catalyze.