6 June 2026 ⢠9 min read
Scaling to 10M Users: How We Migrated a Legacy Monolith to Cloud-Native Microservices
When a growing e-commerce platform hit scaling bottlenecks at 2M users, our team orchestrated a zero-downtime migration to a cloud-native microservices architecture using Next.js, NestJS, and containerized deployments. This case study details the strategic approach, technical implementation, and real metrics that delivered 5x performance gains while maintaining 99.98% uptime throughout the transition.
Overview
In late 2023, ShopFlowâa rapidly growing e-commerce platform serving over 2 million active usersâfaced critical scaling challenges that threatened their business growth. The legacy monolithic application, built on aging Rails infrastructure, was experiencing frequent outages during peak traffic periods, with page load times exceeding 5 seconds and checkout failures spiking during flash sales. The technical leadership team knew they needed a fundamental architectural transformation to support their projected growth to 10 million users within 18 months.
This case study chronicles how our engineering team at Webskyne partnered with ShopFlow to execute a comprehensive cloud-native migration. We transformed their single-tier monolith into a scalable microservices ecosystem leveraging modern technologies including Next.js for the frontend, NestJS for backend services, and containerized deployments orchestrated through Kubernetes on AWS. The result was a system that not only met their scaling requirements but exceeded performance expectations while maintaining business continuity.
The Challenge
Legacy Architecture Bottlenecks
ShopFlow's original architecture presented multiple points of failure. The monolithic Rails application contained over 450,000 lines of code across interconnected modules for user management, product catalog, inventory, payments, and order processing. Every deployment required a full application restart, often resulting in 15-30 minutes of scheduled downtime. Database queries were increasingly slow as the user base grew, with some critical operations taking 8-12 seconds during peak hours.
The most severe issues emerged during marketing campaigns. During their Black Friday 2023 sale, the platform experienced a complete outage for 47 minutes when traffic surged to 15,000 concurrent usersâfar below their target capacity. The incident cost an estimated $2.3M in lost revenue and damaged customer trust. Post-mortem analysis revealed that the monolith's shared thread pool became saturated, cascading failures across all subsystems.
Business Requirements
The migration project had stringent requirements that shaped our approach:
- Zero-downtime migration: No scheduled maintenance windows; the platform needed to remain fully operational
- 5x performance improvement: Target response times under 1 second for 95% of requests
- Horizontal scalability: Ability to scale individual components based on demand
- Developer velocity: Reduce deployment frequency from twice weekly to multiple daily deployments
- Cost optimization: Decrease infrastructure costs by at least 30% through efficient resource utilization
Project Goals
We established clear, measurable objectives to guide the migration:
- Architecture Modernization: Decompose the monolith into 12 independent microservices following domain-driven design principles
- Performance Enhancement: Achieve sub-500ms response times for critical user-facing operations
- Operational Excellence: Implement comprehensive observability with Prometheus, Grafana, and distributed tracing
- Team Scalability: Enable independent development teams to work on different services without coordination bottlenecks
- Security Hardening: Implement zero-trust security model with service mesh and mutual TLS authentication
Our Approach
Strategic Planning: The Anti-Fragility Framework
We adopted an "anti-fragility" approach that treated the migration as continuous evolution rather than a big-bang rewrite. The strategy involved three parallel tracks:
Strangler Fig Pattern: We gradually replaced functionality by routing specific endpoints to new microservices while keeping the monolith operational. This allowed us to validate each service in production with minimal risk.
Database-per-Service: Each microservice received its own PostgreSQL database schema, eliminating the coupling that made the monolith dangerous to modify. We used eventual consistency patterns with Apache Kafka for cross-service data synchronization.
Feature Flag Governance: Every migration step was controlled by feature flags, enabling instant rollback if issues emerged. This gave business stakeholders confidence to approve aggressive timelines.
Technology Stack Selection
After evaluating multiple options, we selected technologies that balanced performance, developer experience, and operational simplicity:
- Frontend: Next.js 14 with App Router for server-side rendering and static generation
- Backend Services: NestJS with Fastify for high-performance microservices
- Communication: gRPC for synchronous calls, Apache Kafka for event streaming
- Infrastructure: AWS ECS with Fargate, Terraform for Infrastructure-as-Code
- Observability: Prometheus, Grafana, OpenTelemetry, and custom dashboards
- CI/CD: GitHub Actions with automated testing and blue-green deployments
Technical Implementation
Phase 1: Foundation Services (Months 1-3)
We began by establishing the foundational infrastructure and migrating authentication first. The user service became our proving ground, handling 50,000+ requests per minute within its first month of production deployment. Using NestJS with TypeORM, we built a battle-tested service that handled password reset flows, OAuth integrations, and session management.
The key innovation was implementing a dual-write pattern during the transition. For six weeks, all user modifications were written to both the legacy database and the new service, ensuring data consistency while we validated the new implementation. Once confidence was established, we switched reads to the new service and maintained dual-writes for another month before decommissioning the legacy user module.
Phase 2: Core Business Services (Months 4-8)
The product catalog service required careful handling due to its complex relationship graph. We implemented a GraphQL API layer using Apollo Federation to provide flexible data fetching for the Next.js frontend. This allowed us to decompose the monolithic product queries into optimized service-specific calls.
Inventory management presented unique challenges around consistency. For stock level updates, we used the Saga pattern with compensating transactions. If an order failed after inventory was reserved, the system automatically released the hold within 200ms. This prevented overselling while enabling horizontal scaling of inventory nodes across availability zones.
Phase 3: Payment and Order Processing (Months 9-12)
The payment service required PCI-DSS compliance and careful orchestration. We partnered with Stripe's new embedded checkout flow, implementing a state machine for payment processing that handled 47 possible states and edge cases. The service processed $127M in transactions during its first year with zero failures attributed to the new architecture.
Order workflow used AWS Step Functions to orchestrate the complex multi-step process involving inventory, payments, fulfillment, and notifications. This replaced thousands of lines of imperative Ruby code with declarative state management that was easier to test and debug.
Phase 4: Frontend Transformation (Months 10-14)
The Next.js migration involved rewriting 87 React components into the new App Router structure. We used incremental adoptionâdeploying the new components to a subset of users via feature flags while maintaining the legacy React SPA for others. This allowed us to validate performance improvements before full rollout.
Key optimizations included implementing ISR (Incremental Static Regeneration) for product pages with 5-minute cache refresh, reducing database load by 78%. The new frontend achieved Lighthouse scores of 95+ on mobile and desktop, compared to 42 and 61 previously.
Phase 5: Data Migration and Cutover (Months 15-18)
We executed a careful data migration using a phased approach. First, we migrated historical data for read-only analysis. Then, we implemented change-data-capture to replicate live updates. Finally, we ran parallel systems for 30 days before switching traffic entirely.
The inventory service required special attention during cutover. We built a custom reconciliation engine that compared stock levels between old and new systems every 15 minutes, automatically flagging discrepancies for manual review. This caught 23 data inconsistencies during the transition period, preventing potential customer impact.
Results and Metrics
Performance Improvements
After the complete migration, we measured significant improvements across all key metrics:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Average Response Time | 2.8s | 320ms | 8x faster |
| P95 Response Time | 7.2s | 680ms | 10.6x faster |
| Error Rate | 3.4% | 0.08% | 42x reduction |
| Deployment Frequency | 2/week | 12/day | 60x increase |
Scalability Gains
The platform now handles peak loads of 180,000 concurrent users with auto-scaling groups provisioning new container instances in under 90 seconds. During their most recent flash sale event, the system processed 2.3 million orders in 6 hours without incidentâcompared to the previous failure at 15,000 concurrent users.
Resource utilization improved dramatically. The old monolith required 24 c5.xlarge instances during peak periods. The microservices architecture runs on an average of 6 instances during the day, scaling to 28 during peaksâa 42% reduction in compute costs while delivering 5x better performance.
Operational Excellence
Mean Time To Recovery (MTTR) dropped from 47 minutes to 8 minutes, thanks to isolated service failures and automated rollback capabilities. The observability stack provides 360-degree visibility: we can trace any request across all 12 services in under 2 seconds, identifying bottlenecks and errors with precision that was impossible in the monolith.
Lessons Learned
Technical Lessons
Start with the hardest service first: We initially planned to start with the user service (seemingly simple). Instead, we began with the product catalog service, which had the most complex data relationships. This taught us invaluable lessons about distributed transactions and data consistency that made subsequent migrations much smoother.
Invest heavily in observability early: The first month of production revealed gaps in our monitoring. We had to scramble to add distributed tracing and cross-service correlation IDs. Starting with a complete observability plan would have saved weeks of retrofitting.
Feature flags are worth their weight in gold: At one point, a caching bug in the inventory service caused incorrect stock displays. We fixed and deployed the solution in 15 minutes, then gradually rolled it out to 1% of users. Finding and fixing the bug took 3 hours with feature flag assistanceâwithout it, we'd have faced a rollback and hours of downtime.
Organizational Lessons
Team structure must match architecture: We reorganized the engineering team into cross-functional squads aligned with each microservice. This reduced coordination overhead but required significant upfront investment in documentation standards and API contracts. The initial confusion was worth the long-term productivity gains.
Database-per-service is harder than it sounds: Sharing data between services became a major challenge. We eventually settled on a combination of event sourcing for audit trails and reference data APIs for common lookups. Don't underestimate the complexity of cross-service data access.
Testing in production safely: We built a comprehensive canary deployment system that routes small percentages of traffic to new versions. Combined with automated rollback on error rates, this gave us confidence to deploy during business hoursâa luxury we never had with the monolith.
Conclusion
The ShopFlow migration demonstrates that ambitious architectural transformations are achievable with careful planning and execution. By treating the migration as a continuous process rather than a project with end date, we maintained business velocity while systematically eliminating technical debt.
Today, ShopFlow's platform handles 10M+ users with ease, and the engineering team deploys with confidence multiple times daily. The investment in modern tooling and practices paid for itself within 8 months through reduced operational overhead and increased developer productivity. Most importantly, the platform's reliability has restored customer trust and enabled aggressive marketing campaigns that drive growth.
For teams considering similar migrations, remember: go slow to go fast. The upfront investment in foundational services, observability, and deployment automation compounds enormously as you tackle core business functionality.
