17 June 2026 ⢠5 min read
Zero-Downtime Migration: How We Rebuilt a 200K-User Fintech Platform on NestJS, Flutter & AWS
When our client's monolith hit 200,000 concurrent users, every deployment turned into a five-hour risk event. This case study documents how we architected a zero-downtime migration path, broke the monolith into independent NestJS services, rebuilt the mobile experience in Flutter, and absorbed peak traffic without a second of downtime. The result: 40% faster deployments, 99.99% uptime, and a platform ready for five-times scale.
Overview
In early 2025, a payments platform processing transactions across India and Southeast Asia was quietly becoming a victim of its own success. With 200,000 active users, transaction throughput had tripled year-over-year, but the underlying architecture was still a single Rails monolith backed by a tightly coupled SQL database layer. What began as fast deployments had morphed into after-deployment firefights. Engineering leadership did the right thing and brought in our team to design a migration path that didn't involve a rewrite, a 'big bang' cutover, orâmost dangerouslyâdowntime during high-traffic windows.
Challenge
The monolith served three primary concerns: payment processing, customer identity, and notifications. Each was owned by a separate team, but they all deployed as one artifact. A bug in the notification module meant a full pipeline halt. Database connection pools were exhausted during flash sales. Latency on the mobile app spiked because synchronous blocking calls inside the monolith serialized parallel requests across unrelated features.
Management demanded two things that seemed in tension: ship new features without regression and don't touch the production system directly. Investors were observing the period closely. The compliance team added another constraint: every schema migration required a weekend window and three sign-offs.
Goals
- Zero unplanned downtime throughout the migration, including cutover weekends.
- Independent deployment pipelines for payments, identity, and notifications.
- Latency reduction of 30% on mobile checkout flows.
- Database decoupling so teams could own their schemas without broadcast approvals.
- Mobile parity: feature-complete transition to Flutter while maintaining the existing React Native app.
Approach
We adopted an incremental strangler-fig pattern. Instead of rewriting the monolith, we intercepted traffic at the API gateway and routed feature-specific requests to new NestJS services. Each bounded context was extracted with its own PostgreSQL instance, its own caching layer backed by Redis, and its own event contract published to Kafka.
For mobile, we migrated the checkout and KYC flows firstâthe highest-value, highest-friction areasâand shipped the Flutter rewrite behind a remote-config flag. The React Native client remained live for users outside the rollout percentage, giving us instant rollback capability.
Implementation
Phase 1: Strangling the Monolith
We introduced an Envoy-based API gateway and began routing new feature requests to a NestJS identity service. The gateway used header-based canary routing, sending 5% of requests to the new service and observing error rates before increasing load. This gave engineering confidence to ship changes without touching the monolith's deploy pipeline.
Phase 2: Event-Coupled Payments
The payments service consumed existing order events and emitted new settlement events. Because the monolith still wrote to the same PostgreSQL tables during cutover, we ran both systems in parallel for a full billing cycle, comparing balances and settlement records nightly. Any mismatch triggered a pagerduty alert to the finance team.
Phase 3: Flutter Mobile Takeover
The Flutter application was built using a shared kernel module for cryptographic operations and network clients. This allowed the same code to be unit-tested across mobile and a soon-to-be-released tablet kiosk experience. We used feature flags from LaunchDarkly to enable pixel-perfect rollout segments: 5% of users, then 25%, then 100%.
Phase 4: Observability & Runbooks
Every NestJS service emitted structured logs to OpenSearch, metrics to Prometheus, and distributed traces to Tempo. We built migration-specific dashboards showing database replication lag between primary and read replicas, schema-migration drift detection, and mobile APM contrast between React Native and Flutter builds.
Results
The migration completed in 14 weeks with no user-facing incidents. Deployment frequency went from twice per month to twice per day for independent services, and mean time to recover dropped from 90 minutes to under 12 minutes. Mobile checkout latency improved by 38%, well past the original 30% target. The new Flutter app received a 4.8-star rating on iOS and 4.7 on Android, with support tickets related to checkout dropping by 22%.
Metrics
- Uptime: 99.98% during migration window (vs. 99.75% baseline)
- Deployment frequency: 2x per month â 2x per day per service
- Mobile latency: 38% reduction on checkout flow
- Support tickets: -22% post Flutter full rollout
- Schema approval lead time: Reduced from 5 days to < 2 hours for independent services
- Transaction throughput: 3x increase handled without provisioning new database instances
Lessons Learned
1. Strangler fig beats rewrite. Attempting to rebuild the entire platform from scratch would have introduced unknown unknowns. Incremental replacement let us validate business logic at each stage.
2. Mobile migration needs the same rigor as backend. We initially underestimated the Flutter effort because we treated it as another UI layer. Treating shared kernel modules as first-class engineering artifacts saved us weeks of rework when deadlines shifted.
3. Observability is a feature. Building bespoke dashboards before the first traffic shift meant we could spot schema drift before our finance team did. Transparency builds trust during migration fatigue.
4. Parallel-run is non-negotiable for payments. Ledger correctness is not negotiable. Running both monolith and service side-by-side for a full billing cycle protected us from silent data corruption.
5. Design for rollback before design for rollout. Remote config flags, canary percentages, and feature gates let us recover in minutes instead of days. Time spent on rollback design is insurance that pays out.
