Scaling Fintech at Velocity: How FinFlow Processed 3M+ Daily Transactions Without Breaking a Sweat
When FinFlow's payment platform hit 2.9 million daily transactions in just 18 months, the engineering team knew they needed more than a patch — they needed an entirely new architecture. This case study walks through the end-to-end transformation: from a monolithic monolith buckling under its own success to a cloud-native, event-driven system that now handles 4.7 million transactions per day at 99.98% uptime. Every decision, metric, lesson learned, and misstep is documented.
Case StudyFintechMicroservicesSystem DesignPayment ArchitectureCloud EngineeringObservabilityEvent-Driven ArchitectureDevOps
## Overview
FinFlow is a mid-sized digital payments platform that processes peer-to-peer transfers, merchant payouts, bill payments, and wallet top-ups across 12 countries in Southeast Asia and the Middle East. Founded in 2019 by a team of ex-bank engineers and fintech product builders, FinFlow's stated mission has always been straightforward: make money movement as frictionless and invisible as sending a text message.
By early 2024, FinFlow had onboarded 4.8 million active users and 180,000 registered merchant partners. Their product suite included wallet accounts, virtual cards, scheduled payouts, invoice payments, cross-border remittances, and a developer-friendly Payouts API for businesses integrating FinFlow into their own platforms.
The company was growing fast. Too fast.
Transactions had grown from roughly 400,000 per day in January 2023 to 2.9 million per day by June 2024 — a 7× increase in 18 months. Yet the engineering team had barely touched the core architecture since launching the MVP. What had started as a pragmatic Ruby-on-Rails monolith had slowly accumulated layers of emergency patches, on-call fire drills, and post-incident writeups that were never fully acted upon.
By May 2024, FinFlow's platform was in a state engineers internally called “controlled chaos” — and they needed to fix it before it broke for real.
This case study tracks that transformation in full: how a team of 14 engineers diagnosed the problem, designed a new architecture, migrated 12 million user accounts and 4.7 million transactions per day without downtime, and emerged with a platform that now runs at 99.98% uptime and processes more than 4.7 million daily transactions with sub-100ms latency.
## Challenge
FinFlow's problems were not a single point of failure. They were a constellation.
The first and most urgent issue was **database contention**. The primary PostgreSQL instance was shared across every service call — user reads, wallet balances, transaction writes, audit logs, webhooks, reconciliation jobs. During peak hours (roughly 7–10 PM local time across active time zones), query latency spiked from a baseline of 12ms to over 2,400ms. The database server's CPU utilization regularly hit 94–97% for 3–4 hours every evening. Connection pool saturation was common; the ORM layer frequently returned “too many connections” errors that cascaded into retry storms and cascading failures.
Second was **transaction reliability**. The Rails monolith processed every payment through a single code path using synchronous HTTP calls to 7 different downstream services and banking partners. Payment-success rates were hovering around 96.2% — well below the 99.9% needed to maintain merchant and regulatory compliance. Failed transactions were rarely idempotent, meaning retries sometimes created duplicate payouts. The manual reconciliation team was spending 60+ hours per week untangling these duplicates.
Third was **observability and debugging**. The three-person on-call rotation was receiving an average of 17 alerts per night. Of those, engineers estimate that only about 2 were genuine actionable incidents; the rest were false positives from poorly configured alerting. Root-cause analysis sessions regularly took 3–6 hours because traces, logs, and metrics simply weren't connected. Mean Time To Recovery (MTTR) on actual incidents was averaging 4.2 hours — unacceptable in a payments platform.
Fourth was **deployment risk**. Deploys happened weekly, often on Thursday evenings, and took 45–90 minutes due to database fixtures, cache warming, and the need to manually verify feature flags. Every deploy carried genuine fear. Rollbacks were manual. The team had developed a culture of avoiding deploys and stacking changes in ever-larger batches, making every deployment exponentially more dangerous over time.
And finally, **regulatory pressure** was mounting. FinFlow was expanding into new countries with stricter reporting requirements. The legacy codebase had audit trails that were partially written to text files and partially stored in Postgres, making regulatory reporting increasingly painful and potentially non-compliant as requirements sharpened.
In short: the platform was at real risk of an incident that could damage user trust or trigger regulatory scrutiny — and there was no clear path forward.
## Goals
The leadership team and the engineering leads met over two full days in June 2024 to agree on the objectives before any technical work began. The goals were deliberately ambitious because the alternative — firefighting forever — was simply not sustainable.
**Reliability targets:** Achieve 99.9% payment success first, then 99.95%. Reduce MTTR from 4.2 hours to under 30 minutes. Eliminate duplicate transactions entirely. Achieve 99.99% uptime SLA.
**Performance targets:** Reduce p99 transaction latency from 3,400ms to under 200ms. Sustain 5 million daily transactions without degradation. Handle 10× peak load without cascading failures.
**Operational targets:** Reduce on-call alerts from 17 per night to fewer than 3 actionable alerts per shift. Move from weekly deploys to multiple deploys per day. Achieve zero-downtime deploys.
**Compliance targets:** Generate daily regulatory reports automatically, auditable and immutable, within 10 minutes of the end of each day.
**Team targets:** The 14-engineer team needed to be able to ship independently, own their services end-to-end, and not block each other waiting for infrastructure or coordination.
The key realization during those two days was that these targets were deeply interdependent. You could not reduce MTTR without observability. You could not deploy multiple times per day without a robust, independently-versioned service architecture. You could not achieve compliance targets without restructuring the audit trail layer. The work was a connected whole — not a checklist.
## Approach
The solution was not to patch the monolith. It was to systematically replace it — service by service, boundary by boundary — without taking the platform offline.
### 1. Domain-Driven Design and Service Boundaries
The first month was a deliberate period of reconnaissance before any code changes. The 14-person engineering team — augmented temporarily by a senior DDD consultant for two workshops — held an intensive event-storming exercise over 8 days, mapping all business events, aggregates, domains, and user journeys across the entire payment lifecycle: onboarding, wallet creation, peer-to-peer transfer, merchant payout, inter-country remittance, and reconciliation.
Six core domains emerged clearly: **Identity** (KYC, user accounts, authentication), **Wallet** (balances, ledger, settlements), **Transact** (payment initiation, idempotency, status tracking), **Partner** (banking integrations, payout routing), **Observability** (metrics, logs, traces, alerting), and **Platform** (billing, configuration, feature flags).
The guiding rule from the beginning was: one service, one responsible team. The Identity and Wallet services would be owned by the core infrastructure team (3 engineers). The Transact service — the most critical path — would be owned by the payments team (4 engineers). The Partner integration service by the banking partnerships team (3 engineers). The Observability and Platform services would span all teams but owned primarily by the platform team (2 engineers).
Each service boundary was drawn around a clear business capability, not around convenience or history.
### 2. Event-Driven Architecture with a Transaction Outbox
One of the trickiest problems in a payment system is how to avoid the dual-write problem: writing to a transaction log table while simultaneously trying to publish an event to a message broker. Both operations need to succeed or both need to fail — otherwise you have inconsistent state.
The team settled on the **transaction-outbox pattern** as the canonical event publishing mechanism for all production services. Every service writes an event row inside its own database transaction and commits both the business state change and the event row atomically. A small, lightweight polling service (built in Go, minimal footprint) reads unprocessed outbox rows and publishes them to a Kafka cluster with strict ordering guarantees per aggregate ID.
This pattern meant that event delivery was guaranteed exactly-once or at-most, and the system could recover from any combination of database and broker failures without manual intervention.
### 3. Database Per Service and the CQRS Split
Every new service received its own PostgreSQL instance. For the most latency-sensitive read paths — real-time balance checks, transaction status lookups, merchant payout dashboards — the team implemented **CQRS** (Command Query Responsibility Segregation), separating the write model (normalized PostgreSQL) from the read model (denormalized materialized views refreshed within 5 seconds using CDC — Change Data Capture via Debezium connectors).
Read-heavy dashboards now hit the read replica with sub-50ms response times, while the write path remained clean and normalized.
### 4. Idempotency Everywhere
The duplicate payment problem was the single most painful legacy issue. The final year of the legacy monolith was dominated by incident reviews, each trying to untangle what had happened with a particular failed transaction.
The deliberate decision was made early: every new service endpoint would accept an optional `Idempotency-Key` header. If a request arrived with a key that matched a previous transaction, the stored result would be returned immediately, no re-execution. The database schema included the idempotency key as a unique constraint at the database level, giving absolute structural protection.
The results were dramatic: after the migration, duplicate payment incidents dropped to zero across a 9-month observation period, and no subsequent incident report cited duplicate transactions as a contributing factor.
### 5. Observability as a First-Class Concern
Observability was not an afterthought appended at the end of the migration. It was designed alongside each service.
The platform adopted **OpenTelemetry** for distributed tracing, with every HTTP call and every Kafka message instrumented from day one. Correlation IDs flowed end-to-end from the user-facing API gateway all the way through to banking partner webhooks and back. Any failed transaction could be traced across 6 services in under 60 seconds by searching the correlation ID — a dramatic improvement over the 3–6 hour root-cause sessions.
For metrics, **Prometheus** collected system and application-level gauges, counters, and histograms. **Grafana** dashboards were standardized across every service — adoption was high partly because every new service shipped with its own pre-built dashboard.
For logging, **Loki** collected structured JSON logs (every field typed at serialization time with schema enforcement) with labels for service name, environment, and outcome. Query isolation between production and staging environments was enforced at the infrastructure level.
Alerting was reconfigured aggressively. High-cardinality alerts were eliminated. Alerts were solely constructed around **symptom-based** conditions: “payment success rate below 99.5%” rather than “database CPU above 90%”. On-call alerts dropped from an average of 17 per night to a stable 1.8 per night over the nine months following the migration — and those 1.8 alerts were genuinely actionable most of the time.
### 6. CI/CD, Deploy Pipelines, and Feature Flags
The deployment story was rebuilt from the ground up using **GitHub Actions**, with each service having its own pipeline triggered by PR or tag events. Staging environments were run per service for automatic integration testing. After merging, a canary deployment ran 5% of production traffic to the new build and automatically rolled back if success rate declined by more than 2% or latency increased by more than 50ms at p99.
**Feature flags** (using the open-source Unleash platform) were introduced at every boundary where behavioral changes could affect users. This gave the team the ability to turn features off in production instantly — without a deploy — if something went wrong.
## Implementation Timeline
The full migration took 10 months, from July 2024 to April 2025. The project was managed in a rolling fashion, with two sprints per month, never pausing production development entirely.
**Months 1–2 (July–August 2024):** Event-storming workshops, boundary definition, observability platform setup, Kafka cluster provisioning. No production code changes yet — but the foundation was being laid.
**Months 3–4 (September–October 2024):** Identity and Wallet services extracted from the monolith using the **strangler fig pattern**: critical Identity and wallet-read paths were removed from the monolith and added as the first new services. The first deployment of a greenfield microservice. Initial payment transactions were still processed through the monolith, but wallet top-ups and KYC calls now hit the new services via a proxy layer.
**Months 5–6 (November–December 2024):** The Transact service — the core payment orchestration engine — was built and deployed as a new service. The monolith was reconfigured to route all new payment initiation requests to the Transact service via the API Gateway. A dual-write mechanism kept the old PostgreSQL replica in sync during a parallel running period.
**Months 7–8 (January–February 2025):** The Partner and banking integrations service was migrated. The audit-trail and compliance layer was replaced with an immutable log store using event streams for backfill. Regulatory reporting automation tools were built and stood up for all 12 countries.
**Months 9–10 (March–April 2025):** The final sunset. Remaining read-only monolith queries were routed to the new read replicas and CQRS views. The monolith was taken to read-only mode. After two weeks of zero-alert operations, the monolith was permanently decommissioned.
Throughout the entire period, the team ran **chaos engineering** sessions every two weeks using custom fault injection to test graceful degradation, circuit breaker behavior, and recovery under real conditions. Internal post-mortems were written and shared in full — no blame, only learning.
## Results
The numbers tell the story.
| Metric | Before | Target | Actual |
|--------|--------|--------|--------|
| Daily Transactions | 2.9M | 5M+ | 4.7M+ |
| Payment Success Rate | 96.2% | 99.9% | 99.93% |
| p99 Transaction Latency | 3,400ms | <200ms | 127ms |
| MTTR | 4.2 hours | <30 min | 18 min |
| On-Call Alerts / Night | 17 | <3 | 1.8 |
| Monthly Deploy Count | 2 | 20+ | 36 |
| System Uptime | 99.5% | 99.99% | 99.98% |
| Duplicate Payment Incidents | 3–4 / month | 0 | 0 |
| Regulatory Report Generation | Manual, 4–6h | <10 min automated | 6 min |
**Reliability and performance wins** were the most immediately visible. p99 latency dropped from 3,400ms to 127ms — a 26× improvement. Payment success rate climbed from 96.2% to 99.93%, well above the 99.9% target. MTTR dropped 87% from 4.2 hours to 18 minutes. Monthly deploys went from 2 to 36 — a 1,700% increase in release velocity. Duplicate payment incidents went from 3–4 per month before the migration to exactly zero in the 9 months following it.
The **operational burden** on the team was dramatically reduced, and this had a noticeable impact on team morale. Engineers who had been burning out from page-silence alarm fatigue reported feeling calm again. On-call became something to tolerate rather than dread.
**Regulatory compliance** was achieved for all 12 currently-active jurisdictions. The immutable event log store and automated daily reports meant regulatory auditors received fully-auditable data within minutes of the end of the day. The compliance lead estimated saving 120 hours per month previously spent on manual report generation and validation.
The business impact was real: merchant acquisition accelerated by 40% in the two quarters following the migration, in part because the platform could now support high-volume enterprise merchant contracts that required strict SLA commitments. Monthly active users grew from 4.8 million to 7.2 million — a 50% increase in nine months — and the team was able to ship cross-border remittance features that had been sitting in the roadmap for over a year, blocked by infrastructure capacity.
## Lessons Learned
Rewriting a core payments platform is a high-risk undertaking that should not be entered lightly. The FinFlow team walked a full 10 months of careful, measured, intentional change — and still caught lessons along the way.
**Lesson 1: Domain maps are worth the effort.** The 8-day event-storming exercise felt expensive at the time — 14 people in a room for nearly two weeks, no shipping. In retrospect, no investment was more valuable. The resulting domain boundaries were the single most important decision made, and without the mapping exercise, they almost certainly would not have gotten it right across six domains simultaneously. The guideline that worked was: draw every event on a physical sticky note, move it around together, debate ferociously, and do not decide until everyone agrees the services map makes sense. Uncomfortable in the moment; indispensable forever.
**Lesson 2: Idempotency is not optional — it is the responsibility layer.** The legacy duplicate-payment problem was one of the single most expensive problems in terms of engineering hours, user trust, and regulatory scrutiny. Making idempotency a first-class requirement on every new endpoint — including a database-level unique constraint — eliminated an entire category of incidents. Treat idempotency as a structural/systems problem, not a “just add a header” problem.
**Lesson 3: Observability must be designed with the service, not appended later.** Stitching together distributed tracing, structured logging, and standardized dashboards *after* all services have shipped is exponentially harder than building it in as each service is designed from scratch. The team found that every service that shipped with full instrumentation required 30% less time to provision and configure than services that inherited it later. Bake observability into the service template — not the service.
**Lesson 4: Strangler fig beats big-bang migration.** The team tried a full big-bang cutover once during the initial planning and abandoned it when the risk assessment showed it would require a 72-hour deployment window with no rollback path. The strangler fig pattern — slowly routing traffic to new services incrementally — meant the team could ship services, validate them in production, and take a step back if a service was not performing as expected. It also meant that if the migration had failed at any point, they could have stopped, reverted, and tried again — without losing the production platform.
**Lesson 5: Chaos engineering is a confidence multiplier.** Running controlled fault injection sessions every two weeks during the migration meant that the team trusted their own system. When real incidents eventually happened during those months, the team had practiced recovering from the exact same scenarios. The feeling of knowing you can handle failures before they happen pays enormous dividends in execution speed.
**Lesson 6: Compliance needs to be a service concern, not a compliance team concern later.** Building the immutable event log and automated reporting into the service architecture as it was being built — not retrofitted onto an existing system six months later — made the compliance handoff a formality rather than a multi-month project. Designing for auditability from day one saves vastly more time than retrofitting it.
**Lesson 7: Culture is the hardest migration.** The code migration was hard. The infrastructure migration was hard. The operational tooling migration was hard. But the cultural shift — from one person feeling personally responsible for everything to shared ownership across independently-functioning teams — was the most important and most difficult change the organization made. Blameless post-mortems, generous documentation, formal SR (Service Review) processes, and team-level autonomy were as important as any technology decision. A platform that is technically sound but culturally dysfunctional will not last. The team made sure the people part was prioritized alongside the technical part from the very beginning.
Moving forward, the team is already working on the next phase: stream processing with materialized views for real-time analytics dashboards for merchants, further expansion into additional countries, and a platform SDK to make it easier for third-party developers to integrate directly with the new service APIs without touching the legacy integration layer.
The FinFlow story is a reminder that in fintech architecture, as in most things: structural problems need structural solutions. When you are growing at 7× in 18 months, “just ship it” is not a strategy. But ship it you must. The middle path — deliberate, well-planned, risky but carefully managed migration — is the one that has turned FinFlow from a growing startup with problems into the kind of mature platform the rest of the industry studies when it needs to understand what real scale looks like. Versioned service boundaries, guaranteed idempotency, observability by default, and an event-driven backbone: that is the architecture that held up under 100× growth and kept every transaction reliable.