When FinVerse hit 180,000 concurrent users on a monolithic Node.js backend that had barely changed since Series A, leadership made a difficult call: pause product development for five months and rebuild the entire infrastructure. This case study documents exactly how the engineering team deconstructed a breaking monolith, introduced event-driven microservices, built a resilient data layer, and rolled out a CDN-edge caching strategy — ultimately enabling FinVerse to process 1.4 million financial transactions per day with 99.97% uptime. The full story includes the mistakes made, the technology choices justified, and the measurable results that followed.

Overview

FinVerse, a Bengaluru-based neo-finance startup founded in 2021, offers a unified mobile application that lets users invest in mutual funds, track personal credit scores, automate savings goals, and generate real-time portfolio analytics — all tied to a single wallet account. By early 2024, the company had raised $18 million across two funding rounds and employed 140 people across engineering, product, compliance, and customer success.

The platform had attracted 50,000 active monthly users through organic growth and a smart referral programme, running almost entirely on a single monolithic Node.js Express backend backed by a single MongoDB Atlas cluster. Core functionality — wallet operations, portfolio rebalancing, KYC verification, and notification delivery — all lived within the same codebase, shared the same database connection pool, and were deployed as a single PM2 process on two EC2 t3.medium instances behind an Application Load Balancer.

In February 2024, a viral social media campaign doubled FinVerse's user base in ten days. The tech team, which had grown from four to twelve engineers, watched as the application buckled under the new load. API response times jumped from a controlled 220 ms to over 14 seconds. WebSocket connections dropped at a rate of one in every three new wallet-creations requests. The incidents log filled with alerts before the monitoring system itself collapsed. That was the moment FinVerse's CTO, Aisha Khalil, commissioned what would become the most ambitious infrastructure overhaul in the company's history.

The Challenge

The problems were deep and interconnected. Every new feature — from gamified savings challenges to third-party investment fund listings — had to be merged into a codebase where changing a single route risked cascading failures across unrelated subsystems. Database migrations required full downtime windows, typically handled at midnight by a tired DevOps engineer. The KYC subsystem, for example, ran PDF processing in the same Node.js event loop as real-time wallet notifications, so a slow OCR scan could hold up hundreds of transaction confirmations.

Customer success tickets relating to "slow transactions" grew at a compound rate of 35% week-over-week during the viral traffic surge. The NPS, which had held steady at 52, dropped to 31. The existing infrastructure team had no standardised deployment pipeline: every engineer tested locally and pushed directly to staging, which mirrored production exactly. There was no automated rollback. Failed deployments had to be manually reversed. In the ten days following the campaign, three deployments caused partial outages, each taking between 47 minutes and two hours to recover from.

The most immediate technical bottleneck was the MongoDB cluster. FinVerse was storing wallet transaction history, user profiles, KYC documents, notification queues, and analytics logs all in the same cluster. The primary node sat at 78% CPU utilisation even during normal hours, and the secondary node was three days behind on replication. The team had been running the cluster in read-preference mode, which meant any read operation hitting the lagging secondary would return stale user data — a serious compliance risk for a regulated financial platform.

The engineering leadership asked a fundamental question: could this architecture even be incrementally improved, or did the entire foundation need to be replaced? The answer came from a two-day internal architecture audit conducted by a consultant from Google Cloud's Partners programme. The verdict was unequivocal. The cost of patching the monolith to scale was projected to exceed the cost of a greenfield rewrite, and the time to market for critical features would no longer be competitive.

Goals

FinVerse set six clear, measurable objectives for the infrastructure programme. First, the platform needed to sustain 200,000 concurrent authenticated users with API p99 latency below 300 ms — a tenfold improvement over the worst-case measurements during the surge. Second, zero-downtime deployments had to be the default, not the exception: any engineer should be able to ship a feature without coordination from the DevOps team and without the risk of a production outage. Third, compliance requirements had to be reinforced: full transaction audit trails, immutability guarantees for KYC records, and automatic data retention policies were non-negotiable for the financial regulator.

Fourth, the analytics pipeline needed to support real-time portfolio performance dashboards computed entirely at the edge, without querying the core transactional database. Fifth, customer support agents needed a self-service observability dashboard that could answer any user-enquiry without requiring an engineer to query production directly. Sixth — and perhaps most ambitiously — the engineering team aimed to ship new features at a two-week sprint cadence with a full CI/CD pipeline from commit to production, a feat that had been impossible with the old monolith. Each goal was backed by explicit acceptance criteria and a named owner who reported progress in weekly leadership reviews.

Approach

The team chose an event-driven microservices architecture as the replacement foundation, justified by three factors: the natural service boundaries already existed in the codebase (wallet, KYC, portfolio, notifications, analytics), the different services had clearly distinct scaling requirements, and the regulated nature of the platform demanded strict isolation between user-facing and compliance-critical components. The architecture team, led by Aisha Khalil and Lead Engineer Raghav Menon, defined eleven bounded services — each with its own repository, deployment pipeline, database, and monitoring dashboard.

The data layer received as much design attention as the application layer. The team evaluated PostgreSQL, CockroachDB, Redis, ClickHouse, and AWS DynamoDB across 23 criteria before selecting a hybrid approach: a CockroachDB cluster for all transactional and user-facing data — chosen for its built-in multi-region replication and strong consistency guarantees — a Redis Cluster for real-time session and rate-limiting state, and ClickHouse for the analytics time-series layer. The three systems were connected through Kafka event streams, which served as the backbone for all asynchronous inter-service communication.

The team made a deliberate choice to implement the new platform using NestJS across all backend services, TypeScript across the entire codebase, and React with TanStack Query for the frontend. This eliminated JavaScript/TypeScript boundary friction and allowed shared type definitions to be used by both client and server teams. The CI/CD pipeline was built on GitHub Actions, with automated unit tests, integration tests, end-to-end contract tests, dependency vulnerability scanning, and Image Vulnerability Assessment against every PR before code could be merged to main.

Implementation

The implementation was executed in four staggered phases over five months. Phase one, completed in six weeks, established the foundational infrastructure: a Kubernetes cluster on Amazon EKS, the CockroachDB distributed cluster, the Kafka cluster, the Redis Cluster, the ClickHouse cluster, and the complete Terraform-based infrastructure-as-code repository. The team also provisioned a full observability stack including Prometheus, Thanos for long-term metrics storage, Grafana for dashboards, OpenTelemetry for distributed tracing, and Jaeger for trace visualisation. At the end of Phase one, the infrastructure team ran a 48-hour load test against the bare Kubernetes cluster to validate auto-scaling behaviour under sustained pressure.

Phase two, the longest phase at eleven weeks, implemented the four most critical services in parallel: the User Identity and Authentication service, the Payments and Wallet service, the KYC and Compliance service, and the Notifications service. The Payments service was migrated first using the strangler fig pattern: a proxy layer intercepted requests based on URL path and routed a percentage of traffic to the new service while sending the remainder to the legacy monolith. Over three weeks, the traffic split was gradually increased from 5% to 100% while the team monitored for discrepancies between the new and old backends.

A particularly difficult engineering problem was identifying which wallet transactions were affected by the duplicate-payment edge case that had gone undetected in the monolith for six months. The team built an offline data reconciliation job in Spark that compared the new service's immutable transaction ledger against the old MongoDB write history and found 127 cases that required investor restitution and a full notification to the financial regulator. This incident — discovered and resolved during the rewrite — would have continued undetected in the old architecture.

Phase three, running for seven weeks, delivered the Portfolio Analytics service and the Notification Delivery service, along with the full edge caching layer implemented with Cloudflare Workers. The analytics pipeline inverted the data-reading pattern entirely: instead of the application querying the database for performance data, Kafka events from the Portfolio service were streamed into ClickHouse, where real-time aggregations were computed and cached at the edge. The dashboard pages loaded in under 120 milliseconds even during peak hours, a transformation from the four-to-eight second loads in the old system.

Phase four was a two-week hardening and migration sprint. The team ran a three-day infrastructure game day simulating worst-case traffic spikes, database failovers, and partial regional outages. Two significant issues were surfaced and resolved: the Kafka consumer-group rebalancing behaviour under topic-partition reassignment had not been correctly tuned, and the CockroachDB zone survivability configuration was adjusted for multi-region resilience. After the game day, the migration was executed during a scheduled Sunday maintenance window with a targeted rollout by user cohort. Zero production incidents followed the launch, and post-launch monitoring showed all services settling within expected resource boundaries.

Results

Ninety days after the migration window, FinVerse's platform was serving nearly 2.1 million monthly active users, processing 1.43 million financial transactions daily, and maintaining 99.97% API uptime — compared to 97.2% uptime recorded in the three months before the migration. API p99 latency had dropped from 14,000 millisecond down to 189 millisecond. The new analytics dashboard loaded in 112 milliseconds compared to the previous 7,200 milliseconds. Deployment lead time — the time between a commit being merged and code running in production — fell from an average of 72 hours to under 45 minutes across all services.

The engineering team grew from twelve to 35 engineers over twelve months, and the CI pipeline — fully automated from pull request validation to blue-green production deployment — made it possible for new engineers to begin shipping product features on their second day. Customer support ticket volume related to slow transactions dropped by 82%, and the NPS, which had dipped to 31 during the incident period, recovered to 59 within the first quarter post-launch.

FinVerse's infrastructure cost, which had risen to ₹2.3 lakh per month during emergency scaling efforts in the pre-migration period, stabilised at ₹1.42 lakh per month in the post-migration period — a 38% reduction — because Kubernetes auto-scaling, right-sized instance selection, and ClickHouse's high compression ratio together yielded substantially better resource efficiency than the emergency Amazon EC2 and MongoDB Atlas over-provisioning that had been patched in response to the surge.

Key Metrics

Metric	Before Migration	After Migration (90 days)
Monthly Active Users	50,000	2,100,000
Daily Transactions	~14,000 / day	1,430,000 / day
API p99 Latency	14,000 ms	189 ms
API Uptime	97.2%	99.97%
Deployment Lead Time	72 hours	45 minutes
Dashboard Load Time	7,200 ms	112 ms
Monthly Infrastructure Cost	₹2.3 lakh	₹1.42 lakh (−38%)
NPS Score	31	59
Slow-Transaction Support Tickets	420 / month	75 / month (−82%)

Lessons Learned

The first and most important lesson was that infrastructure quality is a product feature. Users do not see Kubernetes clusters or CockroachDB replication factors, but they do notice that wallet transactions confirm in under two hundred milliseconds instead of stalling for fourteen seconds. Engineering decisions that feel abstract from the inside — database selection, service boundaries, deploy automation — are directly felt by the customer and directly affect the brand. Every sprint retro at FinVerse now includes a question about whether the infrastructure served the product well during the period; this discipline is credited with keeping service quality steadily high through rapid organisational growth.

The second lesson was about data ownership and the hidden cost of shared access. The real blocker in the old monolith was not the codebase complexity itself but the single MongoDB cluster holding every service's data together. When services are forced to share a database, they are simultaneously empowered to read each other's datasets and exposed to each other's schema changes. The strict per-service database pattern — Wallet DB owned by the Wallet service, Compliance DB owned by the Compliance service, with no cross-service direct reads — eliminated an entire class of deployment risks after the migration that the team had been managing for years without realising it was a cross-service coupling.

The third lesson was about the strangler fig migration pattern and the discipline required to make it work. The proxy-based gradual traffic migration approach was superior to a big-bang cutover in every measurable way, but it also required far more sustained engineering discipline than the team had initially estimated. For three weeks, every release to the new service was followed by a data-reconciliation job comparing its outputs to the legacy monolith. This work was tedious and slowed velocity to a near-halt during the migration window, but it surfaced real bugs early — including the 127-case duplicate-payment defect — before they could reach customers.

The fourth lesson concerned platform engineering as an investment in team velocity, not an overhead cost. The complete CI/CD pipeline, service blueprints, developer documentation portal, and local development environment all took three to four weeks of dedicated platform engineering effort at the start of the migration, and there was strong pressure to skip this work in favour of shipping features faster. Skipping the platform investment would have meant individual services launched without automated testing or deployment safeguards, which would have produced the same fragile velocity that the old monolith represented — results now rather than reliability later. The platform investment paid for itself within the first post-migration quarter.

FinVerse continues to iterate on the platform that was born from that five-month rewrite. The team is currently implementing a feature-flagging system to support canary releases, improving observability with AI-powered anomaly detection for proactive incident response, and exploring a gradual move toward a service mesh architecture to handle inter-service communication as the number of services continues to grow. The journey from a fragile monolith to a resilient, resilient, regulated microservices platform took five months and required a level of organisational courage that not every engineering team is able to muster. The results — 4,200% user growth in eighteen months, world-class platform reliability, and a team that has grown from twelve to thirty-five engineers without a single knowledge-transfer bottleneck — suggest it was worth every difficult decision made along the way.

How FinVerse Scaled from 50K to 2 Million Users in 18 Months Using a Microservices Architecture Rewrite

Overview

The Challenge

Goals

Approach

Implementation

Results

Key Metrics

Lessons Learned

Related Posts

From Static Catalog to Dynamic Marketplace: How a Specialty Tea Merchant Rebuilt Their E-Commerce Platform

Scaling E-Commerce Checkout: How ReduxCo Increased Conversions by 43% Through Headless Architecture and Real-Time Personalization

How We Reduced API Response Times by 76% for a FinTech Scale-Up