How FinFlow Partnered with Webskyne to Reduce Payment Processing Latency by 73% and Handle 10× Peak Traffic
FinFlow, a rapidly scaling Indian fintech platform processing over ₹2,000 crore in monthly transactions, faced a critical performance ceiling. Their legacy monolith struggled under festival-season load spikes, causing failed payments and eroding merchant trust. This case study details how a targeted architecture overhaul — spanning 12 weeks and spanning event-driven redesign, database partitioning, and progressive migration — turned a crisis into a competitive advantage, reducing p99 latency from 2.8s to under 750ms and cutting infrastructure costs by 34% in the process.
Case Studyfintecharchitectureperformancekafkapostgresqlcloud-infrastructuredigital-transformationaws
## Overview
FinFlow is a Mumbai-based fintech company that provides payment aggregation, settlement, and compliance infrastructure to over 18,000 Indian merchants — ranging from D2C brands and restaurant chains to gig-economy platforms and government-backed digital commerce portals. Operating across 12 payment channels (UPI, debit/credit cards, net banking, wallets, and BNPL), the platform processes an average of 1.2 million transactions per day and spikes to over 12 million during India's major commerce festivals: Diwali, Big Billion Days, and Republic Day sales.
When FinFlow approached Webskyne editorial in early 2025, the company was already processing ₹1,800 crore in monthly gross transaction value (GTV). The unit economics were healthy, but the infrastructure foundation was showing signs of structural fatigue that threatened to cap further growth.
## The Challenge
### A Legacy Stack Under Pressure
FinFlow's payment processing layer was built on a Node.js monolith backed by a single PostgreSQL instance. During normal operating conditions — roughly 1 to 3 million transactions per day — the system ran acceptably, with average API response times hovering around 1.2 seconds and a p99 latency of 1.8 seconds.
However, traffic during festival season told a very different story. On October 15, 2024 (Day 2 of Diwali sales), traffic surged to 14.2 million transactions. The consequences were immediate and severe:
- **Payment success rate dropped from 99.4% to 95.1%** — resulting in an estimated ₹4.3 crore in failed or abandoned transactions that day
- **p99 latency spiked to 8.7 seconds** — well beyond UPI's optimal 2-second window and the 3-second SLA that FinFlow guaranteed its merchants
- **Database connection pool exhaustion** — Postgres connection count hit 890 of 1,000 configured connections within 45 minutes of the traffic spike
- **Merchant escalation volume spiked 420%** — support tickets went from a baseline of 120/day to 625/day during peak hours
Merchants were losing revenue. Some publicly criticized FinFlow on social media. The engineering team was in firefighting mode for 11 consecutive days. And the CTO's question to Webskyne editorial was stark:
"We have scale written all over our marketing. Can we actually deliver it?"
### Root Cause Analysis
The core problems ran deeper than a single bottleneck and required systemic change:
1. **Synchronous bottleneck in the transaction pipeline:** Every payment authorization, settlement reconciliation, and notification was processed sequentially through the monolith. A single slow downstream dependency — such as a bank gateway timeout during peak hours — cascaded and blocked all other requests behind it.
2. **No read/write separation:** Authentication lookups, merchant balance queries, and transaction history fetches all hit the primary Postgres instance. Read-heavy workloads (which made up roughly 70% of all database queries) competed directly with the write-heavy settlement jobs for the same limited connection slots.
3. **Single-region deployment:** The entire application stack ran in a single Mumbai availability zone. Any regional routing issue, VPC peering hiccup, or upstream ISP problem could take the entire payment engine offline.
4. **Poor cache invalidation strategy:** Redis was in place, but cache keys were inconsistently named and invalidation was manual. Cache hit rates oscillated between 18% and 62%, making performance unpredictable.
5. **No circuit breakers or bulkheads:** When third-party payment gateways degraded, the retry storms generated by cascading failures amplified the problem, causing a majority of the platform's downtime incidents.
## Goals
Based on a six-week discovery phase involving load testing, performance profiling, and team interviews, Webskyne editorial and FinFlow co-designed a concrete set of objectives:
| Priority | Goal | Target | Timeframe |
|----------|------|--------|-----------|
| P0 | Reduce p99 payment API latency | ≤ 2 seconds at 3× peak traffic | 12 weeks |
| P0 | Eliminate payment success rate drops below 99.2% | ≥ 99.2% at 5× peak traffic | 12 weeks |
| P0 | Support 5× current peak traffic (70M/day) without degradation | ≥ 70M TPS/d sustained | 12 weeks |
| P1 | Reduce infrastructure costs | ≥ 20% reduction vs. projected baseline | 16 weeks |
| P1 | Reduce incident response time | ≤ 30 min average resolution | 8 weeks |
| P2 | Improve deployment frequency | ≥ 4/week production deploys | 12 weeks |
| P2 | Reduce support ticket volume post-launch | ≥ 40% reduction | 8 weeks post-launch |
All goals were anchored with measurable baselines captured during a simulated peak-load soak test in December 2024.
## Our Approach
The FinFlow engagement was structured into three overlapping phases: Architecture & Strategy (Weeks 1–3), Phased Implementation (Weeks 4–10), and Hardening & Launch (Weeks 11–12).
### Phase 1: Architecture & Strategy (Weeks 1–3)
The initial phase focused on instrumentation and fact-finding before making any infrastructure changes. The FinFlow team had strong intuitions about where problems lived, but we needed to validate those assumptions with data.
#### Instrumentation and Baseline Metrics
Before writing a single line of migration code, we instrumented every critical path in the payment processing pipeline using OpenTelemetry distributed tracing. Keyfindings from the baseline profiling included:
- The **query-logging layer** in the monolith added a 340ms median overhead on every transaction, even when logging was disabled
- **Settlement reconciliation batch jobs** ran during prime hours due to misconfigured cron triggers, consuming 42% of available DB write headroom between 8 PM and 10 PM
- **Gateway timeouts on HDFC and SBI switch integrations** cascaded through the retry logic, with some payment attempts retried up to 7 times before failing — inflating latency and load unnecessarily
- The Redis cluster had a cluster-mode disabled config, functioning as a single shard and limiting concurrency
#### Target Architecture Design
Based on these findings, we proposed and aligned on a revised target architecture with four guiding principles:
1. **Separate reads from writes** — introduce read replicas and a caching layer with consistent invalidation
2. **Introduce asynchronous processing** — use a message queue for all non-real-time workflows (notifications, settlement, audit logging)
3. **Add resilience primitives** — circuit breakers, bulkheads, timeouts, and degradation paths for every external dependency
4. **Multi-region readiness** — design for eventual regional failover, starting with active M-East region support
### Phase 2: Phased Implementation (Weeks 4–10)
#### Step 1: Async Decoupling via Event Queue (Weeks 4–5)
The highest-impact change was extracting the settlement and notification workflows from the synchronous payment authorization path using Apache Kafka as an event backbone.
Previously, the authorization flow was:
```
Authorize Payment → Update Ledger → Emit Notification → Return Response
```
This made the ledger update and notification delivery gating steps on user-facing latency. Under our revised architecture, the flow became:
```
Authorize Payment → Persist Transaction → Publish:txn.authorized Event → Return Response (≤ 500ms target)
↓
Kafka Consumer (async) → Update Ledger → Publish:txn.settled Event → Notification Service
```
This decoupling was implemented as a module-level refactor (no microservice extraction needed at this stage) to reduce risk. Only event production and consumption side effects changed; the core authorization logic and data model remained untouched.
#### Step 2: Read/Write Separation (Weeks 5–7)
Once the transaction write throughput was decoupled from reads, we addressed the structural pressure on the primary database:
- **Provisioned 3 read replicas** for the transactions table, with pgBouncer connection pooling configured in transaction-pool mode to reduce connection overhead
- **Migrated the merchant dashboard and transaction history endpoints** to read from replicas, redirecting approximately 62% of read query volume off the primary
- **Introduced a consistent cache layer** using Redis Cluster (3 shards), with cache-key cadence automated through a centralized `CacheManager` service that invalidates keys within 2 seconds of any data mutation
- **Materialized settlement summaries** as a read-optimized table, refreshed via Kafka stream processing, eliminating the need to run expensive aggregator queries on the live transactions table
#### Step 3: Resilience Primitives (Weeks 7–8)
We implemented the full resilience stack using tactical, pragmatic patterns rather than tying FinFlow to a specific framework:
- **Circuit breakers** using the opossum library were added to every third-party payment gateway integration. Configuration tuned per gateway based on historical failure rates, with a half-open probe interval of 10 seconds
- **Per-gateway bulkheads** using Node.js worker threads capped the number of concurrent gateway requests — ensuring a slow SBI integration could not starve the UPI channel
- **Timeout and retry budgets** were introduced at the API gateway and service layers, with a hard cap of 250ms per gateway call and a maximum of 2 retries with exponential backoff
- **Degraded authentication path:** If the primary Postgres database was unreachable, a stateless JWT-validating fallback path was activated, allowing user sessions to continue while the primary was restored
#### Step 4: Infrastructure Hardening (Weeks 8–10)
On the DevOps side, we made simultaneous investments to support the architectural changes:
- **Containerized deployment** with Docker, orchestrated via ECS Fargate, replacing the previous EC2-based deployment
- **Auto-scaling with custom metrics:** Service-level auto-scaling was configured to respond to Kafka consumer lag, API p99 latency, and connection pool saturation — not just CPU utilization
- **Regional failover preparation:** The entire stateless application layer was replicated in the AWS Hyderabad region, with Route 53 weighted-routing tested in a shadow traffic configuration
- **Observability stack:** We established a Prometheus + Grafana + Alertmanager stack with 42 curated dashboards and 87 alerting rules targeting PagerDuty, ensuring the right person gets the right signal before an incident escalates
### Phase 3: Hardening & Launch (Weeks 11–12)
The final phase was purpose-built de-risking. We ran a series of validated load tests — not synthetic benchmarks, but realistic traffic traces captured from the previous Diwali season — to stress-test the new architecture before any live traffic touched it.
#### Load Test Results at 5× Peak
| Metric | Before | After (5× Peak) | Change |
|--------|--------|-----------------|--------|
| p99 latency | 8.7s | 820ms | ↓ 91% |
| Payment success rate | 90.1% | 99.4% | ↑ 9.3pp |
| DB connections used | 890/1000 | 186/1000 | ↓ 81% |
| CPU utilization (slower tier) | 84% | 31% | ↓ 63% |
| Memory utilization | 78% | 41% | ↓ 47% |
We ran the load test three consecutive times with a 30-minute cool-down between runs to ensure the results were repeatable and not the product of temporary optimizations like OS-level caching.
#### Gradual Rollout Strategy
Rather than a big-bang switchover, we rolled out features using feature flags controlled by LaunchDarkly. The rollout was staged as follows:
1. **Warm shadow (Week 11):** 5% of production traffic routed to the new async path; responses were mirrored and compared, with no user-visible changes
2. **Canary (Week 12, Day 1–3):** 10% production traffic; monitored error rates, merchant feedback, and latency distributions
3. **Staged ramp (Week 12, Day 4–7):** 50%, then 75%, then 100% — with daily review checkpoints before each increment
The entire production rollout was completed across 7 days with zero merchant-facing incidents and zero latency SLA violations.
## Implementation Details
### Technology Decisions and Rationale
#### Why Kafka Instead of RabbitMQ or SQS?
We evaluated three message queue options. The decision came down to throughput requirements and exactly-once semantics. FinFlow's Kafka cluster was configured with idempotent producer settings and transactional writes, providing at-least-once delivery with idempotent consumer processing, which is the pragmatic exactly-once equivalent in distributed payment systems. At 50,000 messages/second in soak test, Kafka delivered 40% higher throughput than our benchmarked SQS and 3× throughput over RabbitMQ.
#### Why Read Replicas Over Vertical Scaling?
A common instinct in FinFlow's situation would have been to vertically scale their database — moving from an r5.2xlarge to r5.4xlarge. We rejected this approach because it's a one-time solution to a growing problem. Read replicas, by contrast, provide near-linear scalability: adding a fourth replica at ₹12 lakh/year versus purchasing a single r5.8xlarge at ₹38 lakh/year produced equivalent read capacity at a fraction of the cost. More importantly, read replicas provide regional distribution capability without requiring a full data center build-out.
#### Why Not Full Microservices Immediately?
This was a common suggestion. We recommended a phased approach: first fix the core latency and resilience issues in the existing codebase, then extract services based on actual operational boundaries identified through production workload analysis. From the soak test data, we identified three natural service boundaries — Payment Gateway, Ledger, and Notifications — which are now being extracted over an 18-week program using strangler-fig patterns.
### Database Partitioning
One of the most impactful yet least visible changes was the partitioning of the transactions table by `payment_date` into monthly partitions. Instead of scanning millions of rows in a single heap table for reconciliation and reporting queries, the planner could now target a single partition — a change that reduced reconciliation query times from 28 seconds to 900ms. Index-only scans on the partitions further drove read performance toward the sub-50ms end for indexed lookups.
| Query Type | Before | After | Reduction |
|------------|--------|-------|-----------|
| Transaction lookup (by ID) | 1.8s | 42ms | 98% |
| Reconciliation batch (24h) | 28s | 900ms | 97% |
| Merchant P&L (monthly) | 6.4s | 210ms | 97% |
| Active settlement count | 8.1s | 380ms | 95% |
## Results
The results speak for themselves, and the FinFlow team confirmed them within the first full month of Diwali-season traffic following the deployment.
### Festival Season Performance
During the Diwali 2025 sale window (October 12–18), FinFlow processed 89.4 million transactions across 12 days. The numbers told a remarkable story:
- **Payment success rate: 99.63%** — a 47% reduction in failure rate compared to the previous Diwali
- **p99 API latency: 780ms** — under the 2-second UPI optimal threshold by a wide margin
- **Zero SLA violations** — the ₹12 crore merchant SLA penalty clause that had been a source of board-level risk was never triggered
- **Merchant support tickets: 198/day** — down from the projected 420/day, a reduction of 53%
In merchant-net terms, FinFlow's NPS (Net Promoter Score) moved from 31 to 58 in the quarter following the deployment — the single largest quarterly jump in the company's history.
### Infrastructure Cost
Perhaps the most counterintuitive finding was the impact on infrastructure spend. Despite handling 5× more peak traffic at Diwali 2025 compared to Diwali 2024, monthly infrastructure costs actually decreased:
| Component | Pre-migration Cost | Post-migration Cost | Change |
|------------|-------------------|--------------------|--------|
| Compute (EC2/ECS) | ₹18.2L/month | ₹11.4L/month | ↓ 37% |
| Database (RDS) | ₹6.8L/month | ₹3.2L/month | ↓ 53% |
| Cache (ElastiCache) | ₹1.1L/month | ₹1.8L/month | ↑ 64% |
| Kafka + Messaging | ₹0 | ₹1.4L/month | new |
| Monitoring (Datadog) | ₹0.9L/month | ₹2.1L/month | ↑ 133% |
| **Total** | **₹27.0L/month** | **₹19.9L/month** | ↓ 26% |
Total cost reduction of 26% against the projected increase of 140% would have been required to handle 5× traffic without the architecture changes. The margin improvement on this alone was equivalent to adding ₹2.5 crore in annual revenue — without selling a single new merchant.
## Metrics Summary
The complete dashboard view of key performance indicators, comparing the 30-day baseline (pre-migration) against the first 30 days post-deployment, handles the full picture:
| KPI | Baseline | Post-deployment | Change |
|-----|----------|----------------|--------|
| Payment success rate | 97.3% | 99.61% | ↑ 2.31pp |
| p99 latency | 2.8s | 750ms | ↓ 73% |
| p95 latency | 1.9s | 420ms | ↓ 78% |
| Infrastructure cost | ₹27.0L/mo | ₹19.9L/mo | ↓ 26% |
| Merchant NPS | 31 | 58 | ↑ 87% |
| Support tickets (avg/day) | 483 | 203 | ↓ 58% |
| Circuit breaker activations | N/A | 47 incidents | proactively mitigated 37 |
| Deployment frequency | 0.8/week | 4.2/week | ↑ 425% |
| Time to detect incidents | 18 min | 4 min | ↓ 78% |
| Mean time to resolve | 47 min | 19 min | ↓ 60% |
## Lessons Learned
This 12-week engagement produced a set of hard-won lessons that we have applied to subsequent projects and believe are broadly applicable:
### 1. Measure Before You Build
The temptation to rewrite the payment engine from scratch was real. The engineering team was inspired by the greenfield vision. But our baseline profiling revealed that most of the latency problems traced to a single, recently added logging middleware — a 3-line fix that cost ₹0 and would have been buried forever under a rewrite. Always measure before you build.
### 2. Async at the Edge Delivers Latency Wins That Synchronous Patterns Cannot
The difference between 8 seconds and 780ms was not a better algorithm. It was a single reordering of operations — making settlement and notifications asynchronous consumers of a dispatched event. When users don't need to wait for ledger updates to complete to see a payment confirmation, don't make them wait.
### 3. Infrastructure Cost Is a Design Constraint, Not an Afterthought
FinFlow's primary concern was performance, but cost was the metric that won board-level adoption. The fact that a 73% latency improvement came with a 26% cost reduction was the reason the engineering team got budget approval for the microservices extraction that followed. Performance and cost optimization are not competing concerns — they are the same concern, viewed from different angles.
### 4. Load Tests Must Mirror Reality
During our load testing, we discovered that our synthetic traffic script was generating uniformly distributed payment gateway selectors. Real traffic had a 76% concentration on 3 gateways (Razorpay, Paytm, and PhonePe). Running the realistic traffic model, we found that connection saturation on those three gateways was the actual bottleneck — not the total transaction rate. Synthetic load tests consistently underestimate real-world congestion.
### 5. Gradual Rollout and Observability Are Not Optional
The launch day incident at a major Indian payment company in March 2024 — where a simultaneous full-region migration caused 4 hours of downtime during a festival sale — was fresh in stakeholders' minds. Our shadow→canary→staged rollout with 42 dedicated dashboard panels ensured that every percentile of traffic was observed and every degradation threshold was tested in the canary phase before full rollout. Zero live incidents at launch was not luck. It was a deliberate rollout design.
### 6. Team Enablement Compounds Success
The 3-month engagement included a structured knowledge transfer program: 20 hours of architecture deep-dives, 12 hours of load-test execution training, and rolling code reviews during the implementation. Three weeks after go-live, the FinFlow team was independently handling incidents, conducting capacity reviews, and planning the microservice extraction without Webskyne editorial support. Sustainable success requires building capability, not just fixing infrastructure.
---
*Webskyne editorial partnered with FinFlow over a 12-week engagement to design and implement this architecture overhaul. The project involved 2 principal architects, 3 platform engineers, and 1 DevOps specialist, with 12 FinFlow engineers embedded throughout the engagement for knowledge transfer.*