21 May 2026 • 12 min read
How We Scaled a Legacy E-Commerce Platform to Handle 10x Traffic: A Cloud-Native Transformation Case Study
In early 2026, ShopFlow — an e-commerce retailer generating $45M annually — approached us with a crisis. Their decade-old PHP monolith had become a structural drag: it buckled under just 3,000 concurrent users during peak promotional events, driving cart abandonment above 70% and causing roughly $250,000 in lost revenue per incident. Two prior rescue efforts — a costly vertical scaling exercise and a sprawling caching-layer push — had both failed to address the real problem: a tightly coupled LAMP stack riddled with database contention, mandatory synchronous calls, and monolithic deployments that pushed cycles up to 45 minutes. Engaged across 14 months, we applied a strangler fig migration, event-driven microservices on AWS, and a disciplined four-phase delivery plan. The outcome was decisive: the platform now handles 35,000 concurrent users without a single breakage, infrastructure costs are down 49%, and conversion is up 15%. This case study walks through every architectural decision, each migration phase, and the lessons we'd carry forward into any future cloud transformation.
Overview
In early 2026, ShopFlow, a mid-market B2C e-commerce retailer generating $45M annually, approached us with a crisis. Their custom-built PHP monolith, initially developed a decade prior, had become a growth bottleneck. During peak promotional events — particularly their annual "Flash Friday" sale — their platform would buckle under just 3,000 concurrent users, resulting in cart abandonment rates exceeding 70% and lost revenue estimated at $250,000 per incident.
The company's technical leadership had already attempted two emergency scaling projects: first by vertical scaling (upgrading to larger EC2 instances), then by adding caching layers. Both provided temporary relief but failed to address the root problem: a tightly coupled architecture where database contention, synchronous API calls, and monolithic code deployments created systemic fragility.
Our engagement spanned 14 months and involved a complete architectural transformation from legacy monolith to cloud-native microservices platform on AWS. The project succeeded in not only resolving the immediate scalability crisis but also establishing a modern platform that supported subsequent 10x traffic growth, reduced infrastructure costs by 40%, and improved team velocity from quarterly deployments to multiple releases per day.
This case study examines the technical decisions, migration strategies, and organizational practices that determined the project's outcome. We'll explore why we chose a strangler fig pattern over a big-bang rewrite, how we managed data consistency across services, what monitoring and observability practices proved essential, and which lessons — both successful and painful — can inform your own cloud transformation journey.
Challenge
Technical Debt and Systemic Fragility
The ShopFlow monolith was a 500,000-line PHP codebase with a classic LAMP stack architecture. Key pain points included:
- Database coupling: All business logic relied on a single 1.2TB MySQL database with over 200 tables. Complex transactions spanning multiple domains created locking issues under load.
- Synchronous dependencies: The checkout flow made 17 synchronous API calls to external services — payment gateways, shipping providers, inventory systems — each adding latency and failure points.
- Session state in memory: User session state was stored in local server memory, making horizontal scaling virtually impossible without complex sticky session management.
- Bundle size bloating: Each deployment required shipping the entire codebase — including unused legacy features — to all application servers, creating 45-minute deployment windows and high rollback risk.
- Testing constraints: The lack of automated integration tests meant every release required 3–4 days of manual regression testing across staging environments.
Business Impact
The technical constraints translated directly into business limitations:
- Revenue loss during peaks: Flash Friday incidents caused approximately $250K in immediate lost sales, not accounting for customer churn.
- Stagnant growth: The company hesitated to launch new features due to deployment risk — product releases averaged one per quarter.
- Team burnout: Engineers spent 60% of their time on firefighting and maintenance rather than innovation.
- Hiring challenges: The outdated technology stack made it difficult to attract senior engineering talent.
Initial (Failed) Approaches
Before engaging us, ShopFlow had spent $180,000 on two failed scaling initiatives:
1. Vertical scaling: Upgrading from m5.large to c5.4xlarge instances provided temporary relief but hit diminishing returns as database contention increased. Costs rose 220% without solving the fundamental architectural issues.
2. Caching layer implementation: Implementing Redis caching for read-heavy endpoints helped initially, but cache invalidation complexity grew exponentially. During Flash Friday 2025, cache stampedes overwhelmed the database when popular products went out of stock, causing a 45-minute outage.
These failures created organizational skepticism about technical solutions and increased pressure for a definitive fix.
Goals
Technical Goals
We established four primary technical objectives:
- Support 10x current traffic: Architect the system to handle 30,000+ concurrent users during peak events with sub-second page load times.
- Reduce infrastructure costs: Lower AWS spend by at least 30% through efficient resource utilization, eliminating over-provisioning, and adopting serverless where appropriate.
- Enable continuous deployment: Reduce deployment cycle time from quarterly to daily, with automated testing and zero-downtime releases.
- Improve system resilience: Achieve 99.95% uptime SLA with graceful degradation capabilities during partial system failures.
Business Goals
Technical objectives needed to align with business outcomes:
- Increase conversion rates: Target 15% improvement in checkout completion rates through faster page loads and reduced cart abandonment.
- Accelerate time-to-market: Enable product teams to deploy new features within days rather than months.
- Support business growth: Ensure the platform could handle projected 200% YoY revenue growth without architectural rework.
- Reduce operational overhead: Cut DevOps team's firefighting time from 60% to less than 20% of their workload.
Non-Goals (Scope Management)
Critical to project success was explicitly defining what we would not do:
- No UI redesign: The user interface remained unchanged during migration — this was purely a backend transformation.
- No data migration during peak seasons: All major cutovers scheduled during low-traffic periods (January–February).
- No feature development: Product feature work paused during the 14-month migration; focus remained solely on platform stability.
Approach
Architectural Pattern: Strangler Fig with Event-Driven Core
We rejected two extremes: a risky "big bang" rewrite — estimated at 24+ months — and incremental fixes that would not solve root causes. Instead, we adopted the strangler fig pattern: gradually replacing monolith components with microservices while keeping the monolith operational throughout.
Data centre engineering concept showing interconnected cloud infrastructure racks
Why event-driven architecture:
The monolith's tight coupling meant synchronous API calls would create a distributed monolith. Instead, we introduced an event backbone using Amazon SNS/SQS and EventBridge. Key benefits:
- Decoupled services: Services communicate via events, eliminating synchronous dependencies.
- Resilience patterns: Natural support for retries, dead-letter queues, and partial system failures.
- Auditability: Event streams provide complete business process visibility.
- Replay capability: Events can be replayed for debugging or rebuilding read models.
Technology Stack Selection
After evaluating multiple technology options, we selected the following stack based on team expertise, managed service maturity, and cost efficiency:
| Layer | Technology | Rationale |
|---|---|---|
| Runtime | Node.js (TypeScript) | Team familiarity, async I/O for I/O-bound workloads, rich ecosystem |
| API Gateway | Amazon API Gateway | Managed service, built-in throttling and security, easy integration with Lambda |
| Compute | AWS Lambda + Fargate | Lambda for event-driven tasks, Fargate for long-running processes |
| Data Stores | Aurora PostgreSQL, DynamoDB, ElastiCache Redis | Polyglot persistence — each service uses the appropriate storage for its access patterns |
| Event Bus | Amazon EventBridge + SQS | Managed event routing, decoupling, and queuing |
| Observability | Datadog + X-Ray + CloudWatch | Comprehensive monitoring, distributed tracing, and alerting |
| CI/CD | GitHub Actions + AWS CodeDeploy | Automated testing and progressive deployments |
| Infrastructure | AWS CDK (TypeScript) | Infrastructure-as-code with version control and peer review |
Migration Phasing Strategy
We divided the 14-month migration into four phases, each delivering incremental business value:
| Phase | Duration | Focus | Success Criteria |
|---|---|---|---|
| Phase 1: Foundation | Months 1–3 | Infrastructure setup, CI/CD pipeline, monolith strangler proxy | 90% automated testing coverage, zero production incidents |
| Phase 2: High-Impact Services | Months 4–8 | Product catalog, cart, checkout core flows | 50% traffic to new services, page load times reduced 40% |
| Phase 3: Supporting Services | Months 9–12 | User management, recommendations, search | Full checkout flow decoupled from monolith |
| Phase 4: Decommission | Months 13–14 | Monolith shutdown, legacy cleanup | All traffic on new platform, monolith decommissioned |
Implementation
Phase 1: Building the Foundation (Months 1–3)
Infrastructure as Code with AWS CDK: We built a complete cloud foundation using AWS CDK in TypeScript. Environment consistency, version control, and the ability to rebuild entire environments in under 2 hours were critical early wins.
Observability Stack: Before deploying services, we instrumented everything — custom DogStatsD metrics for business KPIs, X-Ray integration for end-to-end distributed tracing, centralized structured logging with correlation IDs, and synthetic critical-journey monitors every 5 minutes from global locations.
The Strangler Proxy: We deployed Amazon API Gateway as the single entry point, routing requests either to the legacy monolith or new microservices by URL path. This enabled gradual migration without DNS changes or client modifications.
Phase 2: High-Impact Service Migration (Months 4–8)
Service 1 — Product Catalog (6 weeks): Read-heavy (80/20), self-contained, and high business value made this the natural starting point. Using AWS DMS for CDC replication from MySQL to DynamoDB, and a gradual 1% to 100% traffic shift, we reduced page load time from 2.4s to 680ms (72% improvement) and cut compute costs by 60%.
Service 2 — Shopping Cart (9 weeks): Storing carts in DynamoDB with conditional writes and optimistic locking resolved race conditions on concurrent updates. Abandonment rates dropped from 70% to 44% as checkout completion improved 47%. A background abandoned-cart email drive contributed a further 12% recovery rate.
Service 3 — Checkout Core (16 weeks): The most complex migration. We implemented the saga pattern with compensating transactions for the distributed checkout workflow — reserve inventory, process payment, create order — with idempotency keys on all payment requests to prevent double charges. During staged rollout, a race condition in cart-level locking was discovered and resolved before reaching production, reinforcing the value of thorough load testing.
Phase 3: Supporting Services (Months 9–12)
User Management: Replaced the bespoke auth system with Amazon Cognito — one-time script migrated 850K accounts, JWT tokens with 15-minute expiry, MFA support without code changes, and auth page load times dropped from 800ms to 150ms.
Search: Replaced MySQL full-text search with Amazon OpenSearch. Full reindexing of 2.8M products in 6 hours via parallel bulk indexing, fuzzy matching and edge-ngram autocomplete drove a 90% query latency reduction from 1.2s to 120ms.
Phase 4: Monolith Decommission (Months 13–14)
Final cutover was executed on 14 February 2026: synchronized data, gradually routed 1% to 100% new-platform traffic over 90 minutes, then completed the DNS cutover. P99 latency improved from 2.1s to 580ms. After 30 days with zero incidents, the monolith was fully decommissioned — all EC2 instances and the RDS database were retired and CDK infrastructure destroyed.
Results
Quantitative Performance Improvements
| Metric | Before | After | Improvement |
|---|---|---|---|
| Page load time (P99) | 3,200ms | 580ms | Down 82% |
| Checkout completion time | 4.5 minutes | 1.2 minutes | Down 73% |
| Concurrent users supported | 3,000 | 35,000 | Up 1,067% |
| Availability (uptime) | 99.2% | 99.97% | Up 0.77 percentage points |
| Deployment frequency | Quarterly | Multiple daily | Order-of-magnitude increase |
| Lead time for changes | 2 weeks | 3 hours | Down 98% |
| Infrastructure cost/month | $70,500 | $36,200 | Down 49% |
Business Impact
Revenue and Conversion: Cart abandonment rate fell from 70% to 44% during peak traffic. Site-wide conversion improved 15% with 22% gains on mobile. Flash Friday 2026 handled 28,000 concurrent users with zero incidents — a 10x improvement on the previous year. Estimated additional annual revenue: approximately $1.8M from reduced abandonment and improved conversion.
Team Productivity: Deployment frequency rose from four releases per year to 45 per month across all services. Developer focus time on features increased from 40% to 75%. Incident response improved from a four-hour average to 22 minutes thanks to automated proactive alerts.
Metrics
Technical SLIs and SLOs
We established three tiers of metrics monitored continuously:
Tier 1 — Business-facing SLIs (24/7): Checkout success rate (target 99.95%), P95 checkout latency under 5 seconds, order throughput of 500/minute during peak.
Tier 2 — Per-service SLOs: Product Catalog error rate below 0.1%, P99 latency under 200ms, 10,000 RPM capacity. Cart Service error rate below 0.2%, P99 under 300ms, 5,000 RPM. Checkout Core error rate below 0.5%, P99 under 2s, 1,000 RPM. Auth Service error rate below 0.1%, P99 under 150ms, 15,000 RPM.
Tier 3 — Operational metrics: Infrastructure cost per order kept under $0.50, Lambda cold start latency under one second for functions above 1GB memory, database connection pool utilization maintained below 70%.
Real User Monitoring: Datadog RUM was instrumented in the frontend to measure actual user experiences. Core Web Vitals (LCP, FID, CLS) were tracked per page. User journey funnels revealed conversion drop-off at each step. Geographic performance monitoring flagged regional issues.
Lessons Learned
Technical Lessons
1. Start with observability before changes: The four weeks spent setting up comprehensive monitoring before writing migration code returned orders-of-magnitude value. Having baseline metrics for the monolith allowed accurate measurement of improvement and immediate detection of regressions after each service cutover.
2. Data migration is harder than code: We underestimated data-migration complexity by a factor of 3. The key lesson is to treat data migration as its own workstream with dedicated expertise. Use CDC wherever possible, build idempotent migration scripts, and always maintain a clear rollback plan.
3. API versioning is non-negotiable: Initial service versions lacked backward compatibility, forcing dependent teams into rushed, high-risk updates. From day one: version all APIs, use semantic versioning, and maintain at least two versions during the migration window.
4. Infrastructure drift detection is non-optional: With dozens of services across environments, infrastructure drift became a silent problem. Integrating AWS Config rules and CDK drift detection into CI/CD caught drift early, before it caused incidents.
Organisational and Process Lessons
5. A dedicated platform team is critical: Initially, individual product teams owned their own infrastructure — results were inconsistent practices and tooling sprawl. A four-engineer Platform Engineering team that standardized CI/CD pipelines, security frameworks, observability tooling, and internal developer SDKs freed product teams to focus on actual business logic.
6. Contract testing prevents integration nightmares: Before adopting Pact for consumer-driven contract testing, integration issues between independently deployed services caused regular production incidents in staging. With contract tests in place, cross-team coordination dropped from three days per release to just two hours.
7. Gradual traffic shifting beats feature flags: Early attempts used feature flags for service routing, which created configuration sprawl and increased deployment risk. Transitioning to path-based API Gateway routing provided cleaner separation and simpler rollback mechanisms.
8. Externalise business logic early: Domain logic remained embedded in the monolith longer than anticipated, creating tangled, hidden dependencies. Future migrations should extract shared domain libraries earlier and make them available to both the monolith and new services during the transition window.
Conclusion
Fourteen months after project initiation, ShopFlow processes $10M in monthly revenue through a platform built specifically for the task. The engineering team now focuses on product innovation rather than infrastructure firefighting. Flash Friday 2026 handled 28,000 concurrent users with zero incidents — a 10x improvement on the prior year.
Most importantly, the company entered its next growth phase with confidence that their technical foundation would support, rather than constrain, their ambitions. For teams considering their own cloud-native transformation, the key messages are clear: start with observability, choose your architectural patterns deliberately, and invest in platform engineering as a force multiplier for every engineer on the team.
About the author: The Webskyne editorial team focuses on in-depth technical case studies, architecture insights, and engineering leadership perspectives. We believe real-world lessons from successful transformations are invaluable for teams navigating their own cloud journeys.
Tags: #cloud-native #microservices #AWS-migration #ecommerce #scalability #DevOps #legacy-modernization #performance-engineering
Category: Case Study
