Migrating 2.8M Users to a Serverless Architecture: How We Cut Infrastructure Costs by 72% Without Downtime

When a high-growth fintech platform hit a 2.8M-user milestone, its monolith — a five-year-old NestJS monolith running on a burst-capacity EC2 fleet — was no longer keeping pace. In just nine months, our team restructured that monolith into four serverless bounded contexts, reduced monthly AWS spend by 72%, dropped API p99 latency from 3,200 ms to 210 ms, and walked away with three formal compliance attestations — all without a single minute of customer-visible downtime. Here is how we did it, step by step.

![Fintech cloud migration dashboard card spanning three lines of KPI and changelog tables on a neon-verbose dark-analytics panel](https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80) ## Overview LegacyCore, a Serie B fintech platform serving 2.8 million registered users across 12 countries, found itself at a crossroads. After five years of sustained growth — 340% year-over-user increase — the engineering team was spending more time firefighting infrastructure issues than building features. Database p99 latency had drifted past 3,200 ms during peak hours. Monthly AWS spend had climbed to $47,000. And the next round of funding required a bulletproof, auditable solution before the board-level review. The five-person architecture team at Webskyne was brought in to audit, design, and execute a full-stack migration — without triggering a single minute of unauthorized downtime for a platform that processes $180M in transaction volume every month. --- ## The Challenge The problems ran deeper than slow queries and rising bills. Three compounding factors made this migration uniquely difficult: **1. Regulatory lock-in.** LegacyCore operates under three overlapping regulatory frameworks: PCI DSS for credit-card processing, GDPR for EU users, and MAS for Singapore operations. Every architecture decision needed to survive compliance scrutiny from the legal and security teams simultaneously. **2. Zero-downtime mandate.** The board explicitly forbade any deployment window longer than 30 seconds — and the legacy load balancer had no graceful-drain support. A naive cutover would have orphaned in-flight transactions mid-swap. **3. Knowledge silos.** The original monolith was built by a team that no longer worked at the company. Four of the six engineers who understood the data-pipeline internals had left in the previous six months. The migration needed to be surgical. --- ## Goals Before writing a single line of migration code, the team defined four explicit, measurable goals — each with gates signed off by the board's technical committee. | Goal | Target | Significance | |---|---|---| | Infrastructure cost reduction | ≥ 60% in 9 months | Burn-rate management, investor confidence | | P99 API latency | Drop below 250 ms | NPS, reviewer acquisition funnel | | Deployment cycle time | 90th percentile < 8 min | Team velocity, engineering morale | | Zero unplanned downtime | Through migration | Legal, regulatory, and brand risk | Every migration decision was evaluated against these four metrics. --- ## Approach ### Phase 1 — Observability Before Architecture The team began with a **strangler-fig audit**, instrumenting every existing API endpoint, queue consumer, and cron job with OpenTelemetry spans across 90 production instances. The goal: understand the live system's actual behaviour rather than relying on legacy team folklore. The observability sprint revealed three surprises: - **62% of API calls** were served by read-only endpoints that were being hit at 40 samples/second but contributed nothing to transaction logic. - The **primary write bottleneck** was an unintentional N+1 query in the transaction audit log, introduced by a 2022 schema change. - **34% of EC2 spend** was serving idle background workers that had not processed a message in 72+ hours. These findings completely changed the migration order of operations — read-ends would be migrated first, not last. ### Phase 2 — Domain-Driven Strangulation Rather than attempting a risky "lift-and-shift," the team applied **Domain-Driven Design** to identify four bounded contexts within the monolith: 1. **Transactions** — core write path, PCI-scoped 2. **Users** — identity, GDPR-scoped 3. **Ledger** — immutable audit logs (legal hold) 4. **Notifications** — e-mails, SMS, push Each bounded context was wrapped behind an API gateway acting as an anti-corruption layer. Calls to the context were proxied to the new service or the monolith based on feature flag — allowing traffic to shift gradually. ### Phase 3 — Serverless by Context The team chose a context-by-context serverless migration over microservices container orchestration for three concrete reasons: simpler compliance attestations for PCI-scoped services, per-execution billing aligning costs directly with loads, and automated scaling from zero removing the need for speculative over-provisioning. | Service | AWS Tech | Peak TPS | Container Replacement | |---|---|---|---| | Transactions | Lambda + DynamoDB | 1,200 | 8 burst EC2 instances | | Users | Lambda + DynamoDB | 850 | 4 EC2 instances | | Ledger | S3 + Lambda + EventBridge | 60 | 3 EC2 instances | | Notifications | SNS + SQS + Lambda | 4,000 | 12 EC2 instances | Traces were replayed against staging environments 8,000 — 50,000 times per service before promotion. --- ## Implementation ### The Migration Orchestra The implementation spanned **22 weeks** across four parallel workstreams: **Week 1–6 — Observability and Feature Flagging** The feature flag system (LaunchDarkly) was instrumented across all four contexts. A shadow traffic mode (100% duplicate read traffic mirrored to new services, 0% production effect) was enabled before any service went live. **Week 7–12 — Transactions Service** PCI scope required VPC isolation with dedicated network policies. DynamoDB Streams + Lambda provided the write-sequencing guarantee that the legacy monolith's synchronous transaction log had relied on. The team spent four weeks replaying six months of historical production traffic through the new write path before enabling canary traffic. **Week 13–16 — Users and Ledger Services** The users service — GDPR-flagged from day one — went live with per-request data residency routing (EU tokens routed to Paris, ASEAN tokens routed to Singapore). The ledger service moved to an append-only S3 bucket with per-partition EventBridge triggers, eliminating the last direct database write from the monolith. **Week 17–20 — Notifications Service and Load Reduction** The notifications service was migrated to SNS/SQS/Lambda fan-out, replacing 12 long-running message queue consumers. This was where the largest infrastructure cost reduction occurred. **Week 21–22 — Decommission** Synchronous batch jobs were rerouted, metrics validated in production, and the final monolith instances deprovisioned with zero-drop connections — the load balancer draining all remaining sessions before termination. --- ## Results ### 49% Regression Test Pass Rate at Week 12 Migration tests hit a catastrophic 49% pass rate during the transactions service canary phase. The root cause was a DynamoDB on-demand billing ceiling that caused implicit exponential back-offs under load — something staging traces with synthetic load could not reproduce. The workaround: applying DynamoDB adaptive capacity and request-unit pre-warming during planned traffic ramp-ups. Week 14 saw test coverage reach 98%. ### Infrastructure Cost Reduction (Final Numbers) Within **9 months of launch**, the weekly infrastructure snapshot told a clear story: | Metric | Pre-Migration | Post-Migration | Change | |---|---|---|---| | Monthly AWS spend | $47,000 | $13,000 | -72% | | Peak p99 latency | 3,200 ms | 210 ms | -93% | | EC2 instances | 90 | 14 (Lambda-managed) | -84% | | Deployment cycle (p90) | 42 min | 6 min | -86% | Beyond the headline numbers, the sleeping-worker probe alone eliminated $9,400/month of waste — a straight operational fix enabled by the new observability layer. ### Compliance Wins Three formal compliance attestations were obtained in the post-migration audit (Q3 2026) that had either failed or been deferred during 2024 and 2025: - **PCI DSS AOC** for the Transactions service - **GDPR Art. 35 DPIA** for the Users service's data residency router - **MAS TRM attestation** for Ledger append-only immutability The container replacement of the Notifications service alone eliminated a critical violation in the 2025 MAS audit — the queue consumers had been running with stale IAM credentials in a manner that would not have been acceptable for an upcoming MAS enhancement. --- ## Metrics ### Pre-Migration Baseline (April 2025) Average daily API p95: 1,840 ms Monthly AWS bill: $47,000 Deployments per engineer per week: 0.8 Unplanned downtime incidents per quarter: 4.2 ### Post-Migration Snapshot (Q1 2026) Average daily API p95: 187 ms Monthly AWS bill: $13,000 Weekly deployments per engineer: 3.4 Unplanned downtime incidents per quarter: 0 (zero) ### Engineering Velocity Gains Post-migration pull-request merge rate increased by 41% in the two full quarters following go-live — attributable primarily to reduced review friction from small, domain-scoped service changes replacing risky, cross-cutting monolith PRs. --- ## Lessons Learned ### 1. Observability is the migration plan The team's instinct to instrument before architecting proved decisive. The discovery that 62% of API calls were read-only — revealed only because tracing was in place before source analysis began — changed the entire migration order. No migration should start without live-traffic traces running with full context. ### 2. Shadow traffic is not optional The shadow-traffic layer was implemented out of abundance of caution. In practice, it caught three high-severity correctness bugs and one PCI-compliant encoding regression in week 12 that would have reached production in a standard canary rollout. Multiplied by the scale of the system, those four bugs would have cost the team weeks of incident response. ### 3. Feature flags pay for themselves before coding starts LaunchDarkly licensing was $4,800/year. The first regression-detection call alone returned an estimated $87,000 in avoided incident costs — on the legacy system, a rollback of a failed deprecation had cost $110,000 in transaction reversals and customer support. ### 4. Compliance must lead architecture, not follow it Bringing the security team into structured architecture reviews starting week 2 — before any service was built — meant the transaction fee service could carry its PCI flag from initial design. Had compliance review been stacked at the end, the need to rebuild the VPC, subnet, and DynamoDB encryption key topology would have added 8–10 weeks and forced a re-review of the PCI AOC after GA. ### 5. Infrastructure cost savings are a team-building metric Sharing the monthly cost-slash dashboards with the platform team changed the org culture. Engineering managers found that team members were proactively seeking idle resources to remove — because the savings now had a visible face. The weekly cost-report became the most-read internal dashboard in the engineering organization. --- ## Final Thought The LegacyCore migration was not just an infrastructure story. It was a story about **trust** — the board's trust in the engineering team's ability to deliver without regulatory harm, the security team's trust in a new compliance posture, and the engineering team's trust in their own ability to build something better than what they had replaced. 22 weeks later, that trust had been earned, measured, and formally documented in three compliance attestations, a 72% bill reduction, and zero minutes of unplanned downtime. That return on engineering investment is measurable, rare, and absolutely reproducible.

Migrating 2.8M Users to a Serverless Architecture: How We Cut Infrastructure Costs by 72% Without Downtime

Related Posts

How GoRide Cut Incident Response Time by 73%: A DevOps Architecture Case Study

How NeoBank Digital Transformed Customer Onboarding: A Full Case Study

How ScaleOps Cut API Response Times by 83%: A Full Case Study