Webskyne
Webskyne
LOGIN
← Back to journal

23 May 2026 β€’ 11 min read

How FinPulse Migrated 2.4 Million Users Off a Monolith in 90 Days Without Downtime

When FinPulse's legacy billing platform started buckling under Black Friday traffic, the engineering team faced a classic dilemma: patch the monolith and hope it holds, or bet the company on a zero-downtime migration. This is the full story of how they pulled it off β€” on time, under budget, and without losing a single transaction.

Case Studymicroservicesarchitecturemigrationfintechsoftware-engineeringkuberneteszero-downtimelegacy-systems
How FinPulse Migrated 2.4 Million Users Off a Monolith in 90 Days Without Downtime

Overview

FinPulse is a digital banking and payments platform serving 2.4 million active users across Southeast Asia. In 2024, their billing and recurring-payment engine β€” a 12-year-old Rails monolith β€” processed an average of 180,000 transactions per day and spiked to over 400,000 during major promotional windows. The monolith had become not just a performance liability but a genuine business risk.

This case study documents the 90-day migration of FinPulse's core billing platform from a monolithic architecture to a production-grade microservices stack, executed with zero service interruptions and measurable improvements in user experience and operational cost.

The Challenge

By mid-2024, every stakeholder at FinPulse agreed that the legacy billing monolith had reached the end of its useful life. The problems were multiple and compounding.

Performance Under Load

During the 2023 Black Friday period, the platform experienced three distinct cascading failures. Each failure originated in a single database lock within the monolith's payment scheduling module, propagated through shared connection pools, and caused checkout failures for approximately 18,000 users over a four-hour window. Post-incident reviews revealed that the root problem β€” an unindexed JOIN query that had never been optimized β€” was masked because the code path was exercised only during peak load.

Deployment Velocity

Any deployment to the billing monolith required a full regression test cycle lasting approximately 72 hours, plus a manual sign-off from three department leads. This meant that critical bug fixes were held in a queue, and urgent compliance patches (regulatory requirements for payment audit trails changed twice in 2024) took up to three weeks to ship. Meanwhile, the engineering team was growing rapidly, but deployment coordination overhead was growing faster.

Data Inconsistency

The monolith handled five distinct payment workflows β€” card payments, bank transfers, e-wallet top-ups, subscription billing, and merchant settlement β€” within a single database. These workflows shared tables for user profiles, transaction records, and audit logs without clear ownership boundaries. Over time, subtle race conditions had emerged, producing a monthly reconciliation discrepancy of approximately Β₯48 million that was resolved manually by the operations team. The manual work consumed roughly 40 person-hours per month.

Goals and Non-Negotiables

Before any technical work began, the leadership team codified four non-negotiable goals that would govern every architectural decision throughout the project.

Zero downtime. The migration had to occur without any user-visible interruption. Given that FinPulse processes payments continuously, even a 30-minute outage during off-peak hours would have exceeded regulatory SLA thresholds and incurred penalty clauses with two enterprise payment partners.

Complete data integrity. Every transaction processed during the migration window had to be captured, validated, and stored with a verifiable audit trail. The team explicitly ruled out any approach that involved bulk data transfers or overnight cutovers, given the risk of dropped transactions.

90-day delivery window. The deadline was driven by a regulatory filing deadline: FinPulse's new payment processing license required submission of updated infrastructure documentation within 100 days of license approval. The 90-day window left a 10-day buffer for compliance review.

No increase in operational headcount. The migration had to be completed using the existing 22-person engineering team, which was also maintaining the live product and responding to customer escalations. There was no option to hire a dedicated migration team.

Architectural Approach

The team chose a strangler-fig migration pattern over a big-bang cutover. Rather than attempting to replace the monolith in a single event, new services would be built alongside it, traffic would be gradually routed to the new services, and the old code would be decommissioned incrementally as confidence grew. This approach aligned naturally with the zero-downtime requirement.

Service Decomposition

The first architectural decision was which services to extract and in what order. Using a combination of change-frequency analysis and dependency mapping, the team identified five bounded contexts to extract as independent microservices: Transaction Orchestrator, Payment Gateway Integration, Subscription Billing, Merchant Settlement, and Audit & Compliance Tracker. The Transaction Orchestrator was selected first because it had the clearest API boundary, the least shared state, and the highest impact on the cascade-failure problem.

API Gateway and Service Mesh

Kong was selected as the API gateway, sitting in front of both the legacy monolith and the newly extracted services. This provided a single interception point for authentication, rate limiting, and request routing β€” allowing the team to write routing rules that could gradually shift traffic percentages from the monolith to the new service without any code change on the client side. Istio was introduced as a lightweight service mesh for observability (distributed tracing and service-level metrics) rather than enforcement, keeping the operational overhead manageable for the existing team.

Data Migration Strategy

Data migration was the most technically complex part of the project. Given the no-downtime requirement, the team could not perform a bulk data transfer followed by a cutover. Instead, they implemented a dual-write pattern: during the extraction phase, every write would be persisted to both the legacy database and the new service's database. Read operations would initially be served from the legacy database, and only after a validation period confirming identical read results would readers be switched to the new database. The dual-write logic was contained in a thin abstraction layer that was later removed entirely once both services were independently serving reads and writes.

Implementation

Weeks 1–4: Foundation and Service Extraction

The first sprint was focused on building the foundational infrastructure: the Kubernetes cluster, the Kong gateway, the Istio mesh, and the CI/CD pipeline for the new services. Simultaneously, the Transaction Orchestrator service was extracted from the monolith. This service had clear input-output contracts β€” it accepted payment initiation requests and returned transaction status β€” which made it a clean first extraction. The new service was written in Node.js with TypeScript, connected to a dedicated PostgreSQL instance, and instrumented with OpenTelemetry spans from day one.

By the end of week four, the team had validated the Transaction Orchestrator in a production-like shadow environment, running 10% of live traffic through it while the monolith continued to serve all responses. No response discrepancies were found over approximately four million shadow requests.

Weeks 5–8: Gradual Traffic Cutover

With shadow validation complete, the team began the gradual traffic cutover using Kong's weighted routing plugin. Traffic to the Transaction Orchestrator endpoint was increased incrementally: 5%, then 20%, then 50%, then 80%. At each milestone, the team monitored key metrics β€” error rate, P99 latency, transaction throughput β€” for a full 24-hour window before proceeding to the next step. This proved critical when, at the 50% mark, a subtle connection pool exhaustion issue emerged in the new service's PostgreSQL driver configuration that had not surfaced in testing. The issue was resolved and the cutover proceeded without customer impact.

Weeks five through eight also saw the extraction of the Payment Gateway Integration service. This service was more complex because it required careful handling of external API contracts with six different payment providers. The team built a provider abstraction layer that standardized request/response shapes regardless of the underlying payment processor, then gradually migrated each provider's integration.

Weeks 9–10: Remaining Services and Dual-Write Maturation

The Subscription Billing and Merchant Settlement services were extracted in weeks nine and ten. These were the most data-intensive extractions because both services maintained complex relational state that had been deeply embedded in the monolith's database. The dual-write abstraction layer proved invaluable here β€” it allowed the team to build and validate the new services without modifying the monolith's database schema or risking data integrity.

By the end of week ten, all five target services were deployed, routing production traffic independently, and writing to their own data stores. The monolith was still running but had been stripped of the five extracted workflows and was handling only legacy user self-service operations.

Weeks 11–12: Compliance, Documentation, and Decommission

The final two weeks were dedicated to compliance review, audit trail validation, and complete monolith decommission. The operations team worked through every transaction record from the migration window, cross-referencing legacy database entries against the new Audit & Compliance Tracker service. Zero discrepancies were found. The monolith was taken offline on day 90, marking the official end of the migration.

Results

The migration delivered measurable improvements across every dimension the team had defined as critical.

Performance: Peak checkout latency dropped from a P99 of 3,200ms on the monolith to 420ms on the new Transaction Orchestrator service β€” a 87% improvement. The cascade failure that had plagued Black Friday was eliminated; during the 2024 Black Friday period, the platform processed 520,000 transactions in 24 hours without a single service-layer failure.

Deployment velocity: Post-migration, the team achieved an average of 8.2 deployments per week across all five services, compared to approximately 2.4 deployments per month on the legacy platform. Critical compliance patches that previously took three weeks to ship now completed in under 48 hours on average.

Operational cost: The reconciliation discrepancy that had required 40 person-hours per month of manual work was reduced to zero. The operations team reported that month-end close, which had previously required a four-day sprint, now completes in a single automated report generated by the new Audit & Compliance Tracker service.

Infrastructure efficiency: Autoscaling on the new Kubernetes-based platform reduced infrastructure allocation costs by approximately 34% compared to the overprovisioned monolith servers, which had been sized for peak holiday capacity but spent most of the year at under 30% utilization.

Key Metrics

MetricBefore MigrationAfter MigrationChange
Peak P99 Checkout Latency3,200ms420ms↓ 87%
Daily Transaction Volume180,000 avg / 400K peak520,000 avg / 720K peak↑ 189% peak capacity
Weekly Deployments~0.68.2↑ 1,267%
Critical Patch Lead Time~21 days~2 days↓ 90%
Monthly Manual Reconciliation40 person-hours0↓ 100%
Infrastructure CostsBaseline-34% vs. baseline↓ 34%
Service-Level Uptime99.82%99.98%↑ +16bps

Lessons Learned

The FinPulse migration was completed in 90 days, on budget, with zero user-facing downtime and zero data discrepancies. The team walked away with several hard-earned lessons that are worth sharing for anyone considering a similar journey.

Strangler Fig Works When You Commit to It

The strangler-fig pattern is well-known in architecture literature, but it only works if you genuinely commit to incremental cutover rather than letting the old system linger indefinitely. FinPulse's team set explicit cutover milestones and enforced them β€” the monolith was genuinely decommissioned on day 90, not silently kept running in parallel for months after migration was "complete."

Shadow Traffic Is Worth More Than Staging Environments

The week of shadow traffic testing, running live production requests through the new service alongside the monolith, caught three bugs that six weeks of staging testing had missed β€” including the connection pool exhaustion issue that emerged at the 50% cutover mark. Shadow environments exercise real traffic patterns that staging cannot replicate.

The Dual-Write Abstraction Layer Was the Critical Investment

Investing in a dual-write abstraction layer in weeks two and three β€” when the pressure to ship features was highest β€” paid for itself many times over. The layer cost roughly 80 engineering hours to build but eliminated all risk of data inconsistency during the entire migration window. Without it, the team would have faced a high-risk bulk data transfer or a risky cutover with potential transaction loss.

Observability Is Not Optional on Migration Day

The instrumentation plan β€” OpenTelemetry spans distributed across every service, a centralized metrics dashboard, and alerting thresholds tied to every migration milestone β€” made it possible to detect the connection pool issue at exactly the moment it emerged rather than hours later when users would have started reporting failures. The investment in observability observability paid for itself in that single incident.

Communication Beats Architecture

The project lead attributed 60% of the migration's success to regular, transparent communication with stakeholders β€” not a particularly glamorous engineering insight, but an accurate one. The compliance team, the operations team, the customer success team, and engineering leadership received a weekly status digest from day one. When the 50% milestone required an unplanned 48-hour hold, stakeholders understood the context, the risk profile, and the remediation plan within one hour of the decision being made. That level of buy-in was built over 12 weeks of consistent communication, not manufactured at a crisis moment.

FinPulse's billing platform is now running on a five-service microservices architecture, processing more transactions at lower latency than at any point in its history, with a team that ships changes faster and with far less operational anxiety than it did during the monolith era. The migration proved that large, complex, no-downtime infrastructure transitions are not just possible β€” they can be executed cleanly and delivered ahead of expectations with the right pattern, the right team discipline, and a clear commitment to incremental confidence over dramatic risk.

Related Posts

Building a Real-Time Battery Intelligence Platform for a 12,000-Vehicle Electric Fleet
Case Study

Building a Real-Time Battery Intelligence Platform for a 12,000-Vehicle Electric Fleet

When India's largest shared mobility platform approached us with a dire problem β€” their 12,000-vehicle EV fleet was haemorrhaging money through unplanned breakdowns at 38% above pre-electric benchmarks, a support team drowning in battery-related tickets, steadily rising range anxietqueries, and 41% fleet layover meaning nearly every second vehicle sat idle β€” we knew this was no ordinary engineering assignment. Solving it required a six-month sprint to build a real-time battery intelligence platform that would touch every layer of the distributed stack, from edge firmware normalisation on an ageing heterogeneous fleet to an ML forecasting engine predicting degradation ninety days out. Two years of historical telemetry data was too noisy, three vendors had built the IoT firmware stack independently, and every layer demanded its own hard trade-offs and quiet lessons before it could ship to production. The result β€” 44% fewer breakdowns, 70% faster swap layovers, 71% fewer range complaint tickets, and 86% revenue leakage reduction β€” came not from one silver bullet but from obsessive rigour across every layer simultaneously.

From Data Deluge to Actionable Insight: How FinFlow Analytics Built a Real-Time Bi-Weekly Customer Insights Dashboard
Case Study

From Data Deluge to Actionable Insight: How FinFlow Analytics Built a Real-Time Bi-Weekly Customer Insights Dashboard

When FinFlow Analytics found itself losing pivotal enterprise clients amid static, assembly-heavy bi-weekly reporting, the company's leadership made a difficult but honest diagnosis: world-class data pipelines paired with manual human-grade report assembly is a recipe for stale insights, consultant burnout, and lost trust at the highest revenue tier. FinFlow engaged Webskyne as the lead product and engineering partner on Project Lighthouse, a fixed 90-day sprint to consolidate five fragmented data sources, eliminate human assembly from the reporting lifecycle, and deliver a real-time, self-service dashboard capable of serving every CSM and strategist with live, consultant-grade intelligence β€” without a consultant involved. This case study walks through the architectural decisions that made it possible, the data-normalization challenge that nearly stalled the project entirely, and the measurable outcomes that followed: NPS returned to green, renewal rates climbed to 95%, and roughly 1.6 hours of consultant time were reclaimed every cycle β€” redirected not to spreadsheets, but to real client relationships.

From 40% Attrition to Industry-Leading Engagement: How LearnPath Built a Platform That Students Actually Stayed On
Case Study

From 40% Attrition to Industry-Leading Engagement: How LearnPath Built a Platform That Students Actually Stayed On

When LearnPath launched in 2021 as an upskilling platform targeting mid-career professionals in Southeast Asia, the early numbers were brutal but honest β€” 40% of new users vanished within their first week, only 7% completed any course, and the median student spent a total of 47 minutes on the platform before churning. The founding team ran three flawed experiments β€” adding courses, bundling mentoring, deploying a referral program β€” before a behavioral scientist interview project revealed that the platform was not bad or jarring: it made commitment feel overwhelming and invisible. Over 18 months, a systematic rebuild driven by that insight and executed by a lean team of eight engineers turned those numbers around. Course completion sits at 51% today, 74% of graduates earn a promoted or new role within six months of finishing, and the platform serves 127,000 active learners across six countries. This case study walks through every diagnostic insight, architectural decision, technical experiment, and hard-won organizational lesson behind that transformation β€” from onboarding redesign and session-integrity architecture to a full data-layer migration and a notification engine rebuilt on behavioral science rather than generic scheduling.