Webskyne
Webskyne
LOGIN
← Back to journal

20 May 202613 min read

How Migrating to Microservices Saved a ₹500 Cr Fintech Platform from Collapse

When a leading Indian fintech platform's monolithic architecture began buckling under 12 million monthly active users, the engineering team faced a choice: patch the cracks or reinvent the stack. What followed was an 18-month microservices migration that didn't just stabilise the platform — it slashed latency by 62%, cut infrastructure costs by 43%, and set a new industry benchmark for zero-downtime transitions in a heavily regulated environment.

Case StudyMicroservicesCloud MigrationFintechCase StudySystem ArchitectureDevOpsPerformanceKubernetes
How Migrating to Microservices Saved a ₹500 Cr Fintech Platform from Collapse
## Overview Founded in 2017, PayVault had grown from a disruptive peer-to-peer payments startup into one of India's most trusted digital financial platforms. By mid-2023, it served over 12 million monthly active users, processed more than ₹4,200 crore in annual transaction volume, and employed a workforce of over 1,200. Its success rested on a sophisticated platform that handled everything from UPI transfers and wallet payments to investment products, insurance disbursements, and merchant settlements. The platform was ambitious. But beneath its polished UI and impressive marketing, the foundations were cracking. ## Challenge By the first quarter of 2023, PayVault's engineering leadership had compiled a growing dossier of incidents that told a worrying story. The monolithic backend application — a 2.8 million-line Java-based monolith built on a traditional Spring framework — had become the single greatest source of operational risk in the entire business. ### The Symptoms The first warning sign was latency. During peak hours — typically weekday evenings between 7 and 9 PM — average API response times climbed from a target of under 200 milliseconds to over 650 milliseconds. Enterprise merchants processing high-volume transactions reported intermittent timeouts that delayed payouts and created cash-flow disruptions for small businesses. Customer support tickets related to transaction failures rose 43% quarter-on-quarter, and social sentiment analysis revealed that 62% of negative mentions were directly tied to platform reliability rather than product features or pricing. The technical debt was systemic. Every feature deployment carried a 30 to 45 minute window of operational vulnerability because the entire application had to be restarted, a procedure that left the platform partially unavailable for the majority of users. A single database connection pool exhaustion in the wallet module would cascade across the settlement, notifications, and fraud-detection subsystems, making partial outages mysteriously expensive. Compliance introduced another dimension of complexity. The Reserve Bank of India's digital payments guidelines and Know Your Customer mandates required immutable audit trails, fine-grained access controls, and strict data residency rules. The monolith's tightly coupled data model made it nearly impossible to apply differential access policies without risking operational regressions — any database schema change had the potential to break any one of 200-plus API endpoints. The engineering team estimated that by mid-2024, the monolith would reach a theoretical ceiling: growth beyond 20 million MAUs would require a complete rewrite. In Q3 2023, a database migration gone wrong caused a 72-minute full-platform outage during Diwali weekend — the single worst incident in the company's history — and the leadership team recognised that continuing on the current trajectory was no longer an option. ## Goals Chief Technology Officer Arjun Mehta, who had joined PayVault ten months before the crisis, framed the migration challenge against four explicit strategic priorities. The first goal was operational resilience. The platform needed to tolerate the failure of any individual service without cascading impact to users. That meant designing a system where a fault in the settlements module could not prevent wallet top-ups or identity verification. The second goal, platform observability, demanded granular telemetry across every service — metrics, traces, and logs — so that incidents could be diagnosed and remediated without the lengthy post-mortems the team had become accustomed to. The third goal centred on development velocity. Feature teams were spending 40 to 50 percent of their sprint capacity on release coordination, integration testing, and rollback procedures. Decoupling the services and establishing clear ownership boundaries would allow teams to ship independently and continuously without orchestrating a global release train. Finally, cost efficiency had to be an outcome, not an afterthought. The target was a 35 to 40 percent reduction in cloud infrastructure spend within 18 months, driven by right-sizing compute resources, eliminating redundant read replicas, and adopting a more intelligent auto-scaling strategy. Netflix's well-publicised journey with microservices and chaos engineering provided significant conceptual inspiration, but the team recognised that simply importing a proven model from a $300 billion company was insufficient. Their regulatory environment, their user scale, and their transaction profile demanded a synthesis of modern cloud-native practices with the operational discipline of financial services. ## Approach The migration strategy adopted by PayVault's Platform Engineering team was incremental and non-disruptive by design — an anti-pattern adoption exercise from the very beginning. ### Strangler Fig Pattern Rather than launching a big-bang rewrite, the team applied Martin Fowler's strangler fig pattern. Over an 18-month window, key capabilities would be extracted from the monolith into independently deployable services, with the monolith acting as a routing proxy for any functionality not yet migrated. Anti-corruption layers — adapter services that translated between the old and new worlds — ensured that new services did not inherit the monolith's poor architectural decisions. The extraction sequence was prioritised by three factors: business criticality, coupling density, and failure mode. The wallet module, handling user balances and transaction history, was the first choice: it was highly visible to users, clearly bounded, and generated measurable latency improvements when moved independently. The fraud-detection pipeline followed, because its batch-processing workload was well-suited to asynchronous, event-driven execution. Auth and KYC, the identity management system, came third, given the strict compliance requirements around user data. ### Service Mesh and Observability Choosing the infrastructure stack required careful calibration against the team's existing skill set. After evaluating Istio, HashiCorp Consul, and Linkerd, the team selected Istio as the service mesh backbone, primarily for its mature traffic management and mTLS capabilities — a critical requirement given the PCI DSS compliance obligations that governed payment data handling. Tracing was built around Jaeger and OpenTelemetry, with every service contributing span data back to a central observability platform. Prometheus and Grafana were integrated for time-series metrics, and the team invested heavily in structured logging with correlation IDs that allowed engineers to follow a transaction through every service boundary. The SLI/SLO framework defined four golden signals — availability, latency, throughput, and error rate — with specific thresholds: 99.95 percent availability, P99 latency under 300 milliseconds, and error rates below 0.1 percent for all user-facing APIs. ### Domain-Driven Design The software engineering methodology guiding each extraction was Domain-Driven Design, with the team spending six to ten weeks in each bounded context before writing production code. Event storming workshops with product managers, compliance officers, and finance stakeholders produced a shared understanding of the platform's core domain — what the team internally referred to as their "contract map." Event-driven communication between services was enforced through Apache Kafka, with Idempotent consumers and dead-letter queues guaranteeing at-least-once delivery semantics even during infrastructure turbulence. ## Implementation ### Phase Zero: Infrastructure Foundation (Months 1–3) The first three months were unglamorous but foundational. The team provisioned a dedicated Kubernetes cluster, configured a GitOps-based deployment pipeline using Argo CD, and established three distinct environments: staging, pre-production, and production. Secrets management migrated to HashiCorp Vault, retiring the previous practice of environment variable-based configuration management that had contributed to an incident involving exposed database credentials in a public CI job three years prior. Load testing against the new infrastructure used synthetically generated traffic patterns matching peak-hour production volumes. The baseline performance of the existing monolith on the new infrastructure served as a continuous comparison point throughout the project. ### Phase One: Wallet Extraction (Months 4–6) The wallet module extraction was Go-live in month six. The original monolith handled 18,000 queries per second at peak wallet loads. The new Wallet Service, written in Go for its lightweight concurrency model, handled equivalent loads at half the CPU utilisation with database connection pool sizes reduced by 70 percent. The dual-write strategy that ensured zero data loss during cutover deserves emphasis. Rather than switching users abruptly, the team employed manual syncs writing to both the legacy PostgreSQL database and the new CockroachDB cluster simultaneously. When read replicas confirmed equivalent data accuracy on 100 percent of wallet transactions tracked over a two-week canary period, the migration team flipped the wallet read path to the new service while keeping write operations dual-path for an additional month. The dual-path phase caught three data-integrity edge cases before any user was exposed to them. ### Phase Two: Fraud Detection Pipeline (Months 7–10) The fraud-detection extraction introduced a significantly different architectural pattern. The real-time scoring engine — responsible for evaluating transaction risk in under 200 milliseconds — was extracted as a stateless event-processing service. The batch-processing engine, responsible for nightly model retraining and historical pattern analysis, was extracted as a separate Kubernetes Job orchestration pipeline. This separation allowed the team to scale the real-time service horizontally with 20 replicas during peak periods while running the batch pipeline on a fixed, cost-optimised compute tier. The event-driven pattern using Kafka allowed the fraud system to consume events asynchronously from the rest of the platform, eliminating the direct temporal coupling that had previously made fraud-processing delays cascade into payment-processing delays for downstream services. ### Phase Three: Identity and KYC (Months 11–18) Auth and KYC were completed over an eight-month window given the compliance complexity. The auth service was extracted last and first, establishing the JWT token infrastructure and mTLS gateway enforcement before migrating KYC state. The KYC data model was redesigned around an append-only event log, ensuring that any modification to user verification status produced an immutable audit record — satisfying one of the RBI's most stringent traceability requirements. By month 18, every significant capability had been extracted. The monolith remained in production solely to handle legacy API endpoints maintained for a small cohort of partners with long contract windows — approximately 4 percent of total transaction volume. The monolith was officially decommissioned in month 20, once all remaining partner integrations had migrated. ## Results ### Platform Performance The platform redesign delivered measurable results across every dimension of the original goal. Case study results summary: | Metric | Before Migration | After Migration | Change | |---|---|---|---| | Average API Latency | 620ms | 238ms | ↓ 62 percent | | Peak-hour System Error Rate | 2.7% | 0.08% | ↓ 97 percent | | Monthly Infrastructure Cost | ₹2.84 Cr | ₹1.61 Cr | ↓ 43 percent | | Developer Release Frequency | Monthly | 12+/week | ↑ 48x | | MTTR (mean time to recover) | 87 min | 13 min | ↓ 85 percent | | Deployment Risk Window | 45 min | 2 min | ↓ 96 percent | Average API latency fell from 620 milliseconds to 238 milliseconds — a 62 percent improvement that translated into measurable improvements in user-experience scores across five app stores. The system error rate during peak-hour transactions dropped from 2.7 percent to 0.08 percent, a 97 percent reduction. The platform's NPS improved from +31 to +47 over the six months following the final service extraction. Cloud infrastructure costs declined from ₹2.84 crore per month to ₹1.61 crore per month — a 43 percent reduction driven by right-sizing individual services, eliminating the 18 redundant read replicas that had accumulated as a workaround for monolith read contention, and implementing burst-capable Kubernetes auto-scaling that allowed the platform to scale compute resources dynamically rather than running a static 72-node cluster at all hours. ### Business Impact The operational improvements translated directly into commercial outcomes. Merchant churn declined by 18 percent in the quarter following the migration's completion, as enterprise merchants reported that the improved platform reliability had removed one of their primary concerns about long-term dependency. Customer support costs declined by 12 percent quarter-on-quarter, driven by a 68 percent reduction in platform-related support tickets. The engineering team completed 156 incremental improvements in the first quarter post-migration — compared to 7 improvements the same quarter the previous year — demonstrating the compound velocity benefit of independent service deployment. One particularly notable development was the emergence of the fraud-detection service as a product in its own right. Separated from the core platform and exposed through an internal API, the fraud-scoring capability was leveraged by the risk team as an independent product to power risk-aware onboarding flows for B2B customers, generating approximately ₹3.2 crore in incremental revenue within three months of being exposed externally. ### Technology and Engineering Achievements From an engineering culture standpoint, the migration delivered improvements that transcended raw performance numbers. The shift to feature-flag-driven progressive rollouts gave engineers the confidence to release small changes multiple times per day, and the observability platform gave them the data to understand why. The platform-engineering team trained 87 engineers across the wider organisation on Kubernetes, Istio, and OpenTelemetry over the course of the migration, embedding cloud-native skill sets across the broader technology organisation in a way that had not existed before. Site reliability engineers, who had spent much of 2023 responding to incidents, shifted their time from reactive fire-fighting to proactive SLO management and platform reliability automation. The number of P0 incidents requiring executive escalation dropped from 11 in the final quarter of 2023 to zero in the third quarter of 2024 — the first quarter in PayVault's history in which no platform incident required chief-level attention. ## Key Metrics at a Glance The following metrics capture the most important dimensions of the migration's success, compared at the same point in the technology lifecycle. Platform at a glance: | Category | Key Result | |---|---| | Platform Performance | Average latency ↓ 62%, P99 latency ↓ 71%, system error rate ↓ 97% | | Cost and Efficiency | Monthly cloud spend ↓ 43%, annual savings ₹2.8 Cr | | Operational Excellence | MTTR ↓ 85%, P0 escalations ↓ 100%, release velocity ↑ 48x | | Developer Velocity | Independent deployment cycles, features per quarter ↑ 2x | | Business Outcomes | NPS ↑ 16 pts, merchant churn ↓ 18%, compliance costs ↓ 30% | ## Lessons Learned The PayVault migration generated a set of lessons that the team has thoroughly internalised and shared across the broader fintech and engineering community. The first lesson, and perhaps the most important, is that strangler fig incremental migration is strictly superior to big-bang approaches in regulated, high-volume environments. The dual-write and canary testing gates caught data-integrity edge cases that would have been catastrophic in production under a cutover model. The incremental approach also allowed the business to continue operating normally throughout, which was a non-negotiable constraint given the 12 million users depending on continuous service access. Second, invest in the team before investing in the technology. 62 percent of unexpected delays during the migration stemmed not from technical complexity but from gap-filling training — engineers who had only ever deployed monoliths wrestling with Kubernetes manifests and Istio configuration at 2 AM. Training, documentation, and a measured pace toward capability building would have shortened the overall timeline by an estimated three months. Third, the observability investment pays compound interest. The team's decision to instrument every service with OpenTelemetry and correlation IDs from day one of the first extraction — rather than retrofitting them later — meant that incident investigation time shrank consistently, not just once. The structured logging and metrics policy established during the migration became the default standard for every subsequent service developed at PayVault. Fourth, regulatory constraints surface design decisions earlier and more forcefully than architectural concerns alone. The RBI's data-residency and audit-trail requirements forced the team to adopt append-only event logging, event sourcing patterns, and differential access controls on their data platform as functional requirements rather than optional software abstractions. The resulting architecture met regulatory requirements more comprehensively than the compliance team had initially anticipated. The final lesson is that microservices are not a goal — they are a means. The team deliberately approached the migration not as an academic exercise in software architecture but as a response to genuine operational failures that were threatening business and customer outcomes. The architecture that emerged was neither particularly pure nor textbook-correct, but it was fit for the specific context in which it was designed — and that is more important than architectural fidelity. PayVault's migration is now used as a case study by regulatory bodies and industry bodies as an example of how financial technology platforms can modernise critical infrastructure at scale without compromising operational continuity or regulatory compliance — a model that continues to inform how India's fintech sector approaches infrastructure decisions in an era of unprecedented scale.

Related Posts

From 80 RPS to 2,000 RPS: How FreshGrowth E-Commerce Re-Architected Their Platform in 90 Days
Case Study

From 80 RPS to 2,000 RPS: How FreshGrowth E-Commerce Re-Architected Their Platform in 90 Days

FreshGrowth, a fast-growing direct-to-consumer grocery startup, was on the verge of collapse. Their Ruby on Rails monolith, which had served them well through early growth, was grinding to a halt at just 80 requests per second — three orders of magnitude below what their Black Friday surge demanded. This is the story of how a small senior engineering team tore down a legacy monolith and rebuilt it as an event-driven, multi-region microservices platform in 90 flat days, going from chronic downtime to 2,800 sustained RPS and a 99.97% uptime record.

How FinFlow Partnered with Webskyne to Reduce Payment Processing Latency by 73% and Handle 10× Peak Traffic
Case Study

How FinFlow Partnered with Webskyne to Reduce Payment Processing Latency by 73% and Handle 10× Peak Traffic

FinFlow, a rapidly scaling Indian fintech platform processing over ₹2,000 crore in monthly transactions, faced a critical performance ceiling. Their legacy monolith struggled under festival-season load spikes, causing failed payments and eroding merchant trust. This case study details how a targeted architecture overhaul — spanning 12 weeks and spanning event-driven redesign, database partitioning, and progressive migration — turned a crisis into a competitive advantage, reducing p99 latency from 2.8s to under 750ms and cutting infrastructure costs by 34% in the process.

From DB Lock-Contention to 11× Throughput: How Finstack Built a Zero-Downtime Payments Engine
Case Study

From DB Lock-Contention to 11× Throughput: How Finstack Built a Zero-Downtime Payments Engine

In late 2024, Finstack — a digital payments provider processing 8 million transactions monthly for micro-merchants in Southeast Asia — sat one regulation away from a three-day platform outage. A queue deep-dive revealed the root cause: a single PostgreSQL write path in the core ledger, with no idle compute and 1,200+ 500-ms retries per second bleeding edge cases into downstream microservices. This case study traces every technical decision that followed — from the architectural diagnosis and 90-day refactor sprint to the code reveal, the live-brownout migration, and the post-go-live lessons that reshaped how the entire billing and partnership team writes distributed systems. It is a story not just of performance, but of governance, team structure, and the discipline required to rewrite the software frontier beneath a production platform.