Webskyne
Webskyne
LOGIN
← Back to journal

6 June 202612 min read

Scaling to Millions: How We Migrated a Legacy E-commerce Platform to Modern Cloud Architecture

When a mid-sized e-commerce retailer approached us with their decade-old monolithic platform struggling under peak traffic, we faced a critical choice: incremental improvements or complete rearchitecture. This case study details our strategic migration from legacy PHP/MySQL to a microservices-based cloud architecture on AWS, reducing page load times from 8-12 seconds to under 200ms and enabling seamless scaling to handle Black Friday traffic spikes exceeding 50,000 concurrent users. From database sharding to event-driven architecture, we'll walk through the technical decisions, implementation challenges, and lessons learned during this 8-month transformation journey. Our phased approach using the Strangler Fig pattern allowed continuous business operations while we systematically replaced each component. The results speak for themselves: 87% improvement in conversion rates, 98% reduction in response times, and a platform that now handles 6x more traffic with complete reliability. This comprehensive transformation demonstrates how thoughtful architecture decisions can rescue a business from technical debt and position it for sustainable growth.

Case StudyCloud MigrationMicroservicesAWSPerformance OptimizationE-commerceDatabase ShardingNode.jsDevOps
Scaling to Millions: How We Migrated a Legacy E-commerce Platform to Modern Cloud Architecture
# Scaling to Millions: How We Migrated a Legacy E-commerce Platform to Modern Cloud Architecture ## Overview In early 2024, ShopMart, a regional e-commerce retailer with annual revenue of $45M, found themselves at a digital crossroads. Their legacy platform—built on a monolithic PHP 5.6 codebase with MySQL 5.7—was buckling under the weight of increasing traffic and modern customer expectations. The holiday season had become a recurring nightmare: site outages, abandoned carts, and frustrated customers. What started as a simple performance optimization request evolved into a comprehensive platform modernization initiative that would redefine their technical infrastructure and business trajectory. This case study explores how we transformed ShopMart's monolithic architecture into a scalable, cloud-native microservices ecosystem, delivering a 98% improvement in page load times and establishing a foundation for sustainable growth. ![Cloud architecture transformation diagram showing migration from legacy monolith to modern microservices](https://images.unsplash.com/photo-1551650975-87deedd944c3?auto=format&fit=crop&w=1200&q=80) ## Challenge ShopMart's challenges were typical yet severe. The company had experienced steady growth over eight years, expanding from a small online boutique to a regional powerhouse. However, their technical infrastructure had not evolved at the same pace, creating a dangerous gap between business aspirations and technical reality. **Performance Degradation**: Homepage load times averaged 8-12 seconds during peak hours, with product pages taking even longer. Conversion rates plummeted to 1.2% during high-traffic periods—well below industry benchmarks of 2.5-3.5%. The performance issues were particularly pronounced on mobile devices, where the monolithic frontend struggled to adapt to varying screen sizes and network conditions. Customers routinely abandoned their carts after waiting too long for the checkout page to load, representing millions in lost revenue annually. **Scalability Bottlenecks**: The monolith couldn't scale horizontally. Adding more servers yielded diminishing returns due to session affinity requirements and database connection pool exhaustion. The single database instance had reached its maximum capacity for concurrent connections, causing request timeouts during flash sales and promotional events. Vertical scaling had reached physical limits—CPU utilization consistently exceeded 90% during business hours, and memory pressure was causing frequent garbage collection pauses. **Maintenance Complexity**: Technical debt had accumulated over eight years. The original developers had moved on, leaving behind undocumented code and tribal knowledge that was slowly being lost. New features took months to implement because every change required understanding the intricate web of dependencies throughout the codebase. The lack of automated tests meant that any modification risked breaking existing functionality, creating a culture of fear around deployments. **Business Impact**: During Black Friday 2023, the site experienced 4.5 hours of downtime, resulting in an estimated $2.3M in lost revenue. Customer service complaints increased by 340%, and mobile conversion rates were effectively zero. The marketing team had invested heavily in driving traffic to the site, only to watch those efforts translate into error pages and abandoned sessions. The CEO demanded answers, and the engineering team knew that incremental fixes would not be sufficient. The client's initial request was for a caching layer to improve performance. However, our assessment revealed that the fundamental architecture was the constraint—not just the symptoms. Every hour spent optimizing the monolith would be technical debt that needed repayment when the inevitable rewrite occurred. ## Goals Our objectives were clear and measurable, backed by concrete business requirements: 1. **Performance**: Reduce average page load time to under 500ms for 95% of requests, with 99th percentile under 1 second 2. **Scalability**: Handle 50,000+ concurrent users without degradation, with automatic scaling response within 30 seconds 3. **Reliability**: Achieve 99.95% uptime with automated failover capabilities and <5 minute recovery time objective 4. **Developer Experience**: Reduce new feature deployment time from weeks to days, with rollback capability within 10 minutes 5. **Cost Efficiency**: Optimize infrastructure costs while improving performance, maintaining or reducing total cost of ownership Beyond technical metrics, we needed to ensure zero-downtime migration during business hours, preserve all historical data dating back to 2015 and maintain SEO rankings, maintain PCI compliance throughout the transition, and provide comprehensive training for the client's 12-person engineering team. We also committed to documenting all architectural decisions and creating runbooks for common operational scenarios. ## Approach We adopted a phased migration strategy, recognizing that a big-bang rewrite would introduce unacceptable risk. The approach balanced technical excellence with business constraints, ensuring continuous value delivery throughout the transformation. ### Phase 1: Discovery & Planning (Weeks 1-3) Our architectural assessment began with comprehensive monitoring using four different APM tools to get a holistic view. We deployed APM tools to trace every request through the monolith, identifying the top 20 endpoints consuming 80% of resources. Database performance analysis revealed missing indexes, inefficient queries, and a single table with 47 million rows lacking proper partitioning. The users table, orders table, and product_catalog had become unwieldy, with some queries taking over 30 seconds to complete. Infrastructure profiling showed that the existing setup relied on three on-premises servers running Ubuntu 18.04, each hosting multiple components in a fragile configuration. Network diagrams were outdated, and configuration drift meant that production, staging, and development environments differed in subtle but critical ways. The tech stack decisions emerged from this analysis: - **Frontend**: React 18 with Next.js for SSR/SSG capabilities, enabling better SEO and performance - **Backend**: Node.js microservices with TypeScript, providing better type safety and developer productivity - **Database**: PostgreSQL with Citus for horizontal sharding, replacing the aging MySQL 5.7 - **Infrastructure**: AWS with ECS Fargate, RDS, ElastiCache, and SQS for managed services - **CDN**: CloudFront with edge caching strategies for global performance - **Event Bus**: Redis Streams for inter-service communication and event-driven architecture - **Monitoring**: Prometheus for metrics collection, Grafana for visualization, and OpenTelemetry for tracing - **Security**: HashiCorp Vault for secrets management and Let's Encrypt for TLS certificates Each technology choice was evaluated against four criteria: maintainability, team expertise, total cost of ownership, and future-proofing potential. We conducted proof-of-concepts for each component, documenting performance baselines and operational considerations. ### Phase 2: Architecture & Pilot (Weeks 4-8) We designed the microservices boundaries around business capabilities, following domain-driven design principles. The goal was to create services that were neither too granular nor too coarse, striking the right balance for maintainability and performance: - User Service (authentication, profiles, preferences) - handling 2.3M user accounts and session management - Product Service (catalog, inventory, search) - managing 150K SKUs with variants and categories - Order Service (cart, checkout, order management) - processing 50K orders daily during peak periods - Payment Service (PCI-compliant payment processing) - integrating with 8 payment providers and gateways - Recommendation Service (personalization, ML-powered) - serving real-time product suggestions - Analytics Service (event processing, reporting) - handling 2M events per day for business intelligence - Notification Service (email, SMS, push) - managing customer communications at scale - Inventory Service (stock management, warehouse integration) - connecting to 3 fulfillment centers The pilot phase focused on the Product Service, representing the highest traffic endpoint with 60% of all requests. We built a parallel service, migrated read traffic first, then gradually shifted writes while maintaining dual-write consistency. This approach allowed us to validate our architectural assumptions with real production traffic while minimizing risk. We established a feature flag system using LaunchDarkly, enabling us to control traffic routing between old and new services with millisecond precision. Health checks and circuit breakers ensured graceful degradation if either system experienced issues. The pilot ran for two weeks with no customer-facing incidents, building confidence for broader rollout. ### Phase 3: Core Migration (Weeks 9-24) Migration proceeded service by service, using the Strangler Fig pattern inspired by Martin Fowler's architectural patterns. Each service had its own deployment pipeline, database schemas, and caching strategies. We implemented a Service Mesh using AWS App Mesh for traffic routing, retries, and circuit breaking, abstracting networking concerns from application code. Database sharding was particularly challenging. Historical order data dating back to 2015 needed to be migrated to the new sharded PostgreSQL cluster. The migration involved 2.3TB of data across 57 tables, requiring careful coordination to maintain referential integrity across shards. We developed custom migration scripts that could handle the 2.3TB dataset while maintaining referential integrity, using a combination of logical replication and custom tooling. Shard allocation was based on customer_id hash, ensuring even distribution while enabling customer-specific queries to remain efficient. Each shard was configured with read replicas for analytics queries, preventing reporting workloads from impacting transactional performance. The sharding gateway abstracted this complexity from application services, routing queries to appropriate shards based on request context. We implemented a comprehensive data consistency model using the Saga pattern for distributed transactions. Order creation, for example, involved multiple services: creating an order record, reserving inventory, processing payment, and sending notifications. If any step failed, compensating transactions would roll back the previous steps, ensuring data consistency across service boundaries. Cross-shard analytics presented another challenge. Business intelligence queries often required aggregating data across all customers, which meant querying all 16 shards and combining results. We developed a federated query engine that could parallelize these queries and merge results efficiently, reducing analytics query times from minutes to seconds. ### Phase 4: Optimization & Go-Live (Weeks 25-32) Final preparations included implementing comprehensive observability with Prometheus and Grafana, setting up automated scaling policies, and conducting load testing with 100,000 concurrent users using k6. We also migrated the admin dashboard, search functionality, and recommendation engine, completing the full platform transformation. Performance optimization focused on three areas: database query optimization, caching strategy refinement, and frontend bundle reduction. We implemented lazy loading for React components, reducing initial bundle size by 60% and improving Time to Interactive metrics. Database connection pooling was tuned based on production load patterns, and Redis cache warming scripts minimized cold cache penalties during cache refreshes. Load testing revealed several bottlenecks we hadn't anticipated. Redis Streams message processing became a bottleneck under extreme load, requiring us to implement batching for high-volume events. We also discovered that our initial auto-scaling policies were too conservative, leading to brief capacity issues during rapid traffic spikes. Security hardening completed our checklist, with penetration testing validating our implementation, and security scanning integrated into the CI/CD pipeline. The transition to the new platform required coordination with the client's PCI compliance auditors, who reviewed our updated architecture documentation and security controls. All changes were implemented without disrupting the existing compliance posture. ## Implementation Here are the key technical decisions that defined our success: ### Database Sharding Strategy We implemented horizontal sharding based on customer_id hash, distributing customers across 16 shards. Each shard contained approximately 2.3M customers, providing optimal query performance while enabling independent scaling. The choice of 16 shards was based on extensive load testing, finding the sweet spot between query efficiency and operational complexity. ```sql -- Shard distribution logic CREATE TABLE customer_shards ( shard_id SMALLINT PRIMARY KEY, hash_range TSRANGE NOT NULL ); SELECT shard_id FROM customer_shards WHERE customer_id_hash = hash($1) AND hash_range @> customer_id_hash; -- Cross-shard analytics queries with parallelization SELECT shard_id, COUNT(*), SUM(total_amount) FROM order_summary_current_date_shard_%s GROUP BY shard_id; ``` The sharding gateway abstracted this complexity from application services, routing queries to appropriate shards based on request context. Query routing logic considered both the data being accessed and the user making the request, ensuring optimal performance for both customer-facing and admin queries. ### Event-Driven Architecture Using Redis Streams as our event bus, we decoupled services and enabled asynchronous processing. Order events triggered inventory updates, recommendation model retraining, and analytics pipeline updates without creating tight coupling. The choice of Redis Streams over alternatives like Kafka was driven by operational simplicity and integration with our existing caching infrastructure. ```javascript // Event publishing pattern with idempotency const publishOrderEvent = async (eventType, orderId, payload) => { const eventId = `${eventType}:${orderId}:${Date.now()}`; const event = { id: eventId, type: eventType, payload: payload, timestamp: Date.now(), source: 'order-service' }; await redis.xadd('events:orders', '*', event); await redis.sadd('processed-events', eventId); return eventId; }; // Consumer with manual acknowledgment and dead letter queue const consumeOrderEvents = async () => { while (true) { const events = await redis.xreadgroup( 'ORDER_CONSUMER_GROUP', 'worker-1', 'BLOCK', 5000, 'COUNT', 10, 'STREAMS', 'events:orders', '>' ); for (const event of events) { try { await processEvent(event); await redis.xack('events:orders', eventId); } catch (error) { await redis.xadd('events:dead-letter', '*', { ...event, error }); await redis.xack('events:orders', eventId); } } } }; ``` Event schema versioning ensured backward compatibility during service upgrades. We implemented a schema registry pattern where each event type maintained a version history, allowing consumers to gracefully handle events from older service versions. ### Caching Layers Multi-tier caching significantly reduced database load: - **Edge Cache**: CloudFront for static assets and product pages (TTL: 1 hour), with cache invalidation webhooks on inventory changes - **Application Cache**: Redis for session data and frequently accessed objects (TTL: 24 hours), with LRU eviction for memory management - **Database Cache**: Query result caching for complex analytics (TTL: 1 week), with materialized view refresh strategies Cache warming scripts ensured that high-traffic pages remained hot in cache during peak periods. We monitored cache hit ratios closely, aiming for >85% for static content and >70% for dynamic content. Cache warming was particularly important for product pages, which received the majority of traffic during promotional events. We implemented a cache-aside pattern with automatic refresh for critical data. Product pricing, inventory levels, and promotional eligibility were cached with short TTLs, while product descriptions and images used longer TTLs. Cache invalidation was handled through event-driven updates, ensuring consistency between cached data and the source of truth in the database. ### CI/CD Pipeline GitHub Actions powered our deployment pipeline with automated testing, security scanning, and progressive rollouts. Each service maintained independent versioning while sharing common infrastructure components. The pipeline included stages for security scanning, unit testing, integration testing, performance testing, and gradual production rollout. Deployment strategies varied by service criticality. The User Service used blue-green deployments to eliminate downtime risk, while the Product Service used canary deployments with 5% traffic routing for the first hour. All deployments were gated by automated health checks that would automatically rollback if error rates exceeded thresholds. Infrastructure-as-Code using Terraform ensured consistent environments across development, staging, and production. Database migration scripts were version-controlled and idempotent, preventing configuration drift between environments. The pipeline also included automated rollback procedures, tested monthly to ensure reliability. ## Results The transformation delivered remarkable outcomes across all dimensions: **Performance**: Average response time dropped from 8-12 seconds to 180ms (98.3% improvement). Mobile page load times improved by 95%, driving a 23% increase in mobile conversions. Desktop conversion rates improved by 15%, with cart abandonment decreasing from 78% to 44%. The improvements were particularly pronounced for returning visitors, who benefited from cached session data and personalized recommendations. **Scalability**: The platform successfully handled Black Friday 2024 with 52,000 peak concurrent users and zero downtime—a stark contrast to the previous year's 4.5-hour outage. Auto-scaling responded within 45 seconds to traffic spikes, adding capacity seamlessly. The platform maintained consistent performance even during flash sales that generated 10x normal traffic in minutes. **Business Impact**: Conversion rates increased by 87% to 2.1%, generating an estimated $3.8M in additional annual revenue. Cart abandonment decreased by 42% due to improved checkout performance. Mobile revenue increased by 180%, finally capturing the mobile-first audience that had been underserved. The average order value increased by 12%, likely due to improved product discovery and recommendation quality. **Operational Excellence**: Deployment frequency increased from monthly to daily, with mean time to recovery dropping from 2.3 hours to 8 minutes. The on-call team received 60% fewer alerts, with infrastructure issues largely eliminated by managed services. Developer velocity improved significantly, with feature implementation time dropping from 3-4 weeks to 5-7 days on average. **Cost Efficiency**: Monthly infrastructure costs decreased by 13% despite improved performance and scalability. The managed services architecture reduced operational overhead, eliminating the need for 2 full-time DevOps engineers. Reserved instances and spot pricing strategies further optimized costs, though the client chose to maintain some on-demand capacity for flexibility. ## Metrics | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Avg Response Time | 9.2s | 180ms | 98.3% | | Peak Concurrent Users | 8,500 | 52,000 | 506% | | Conversion Rate | 1.2% | 2.25% | 87.5% | | Mobile Conversion | 0.1% | 1.8% | 1700% | | Monthly Infrastructure Cost | $8,200 | $7,100 | 13.5% reduction | | Deployment Frequency | 1x/month | 22x/month | 22x | | MTTR | 2.3 hours | 8 minutes | 95.1% | | Uptime | 98.7% | 99.97% | 1.27% | | Error Rate | 3.4% | 0.2% | 94.1% | | Cache Hit Ratio | 45% | 87% | 93.3% | Performance breakdown by endpoint type showed consistent improvements: - Homepage: 12.1s → 150ms (99.4% improvement) - Product pages: 15.3s → 195ms (99.2% improvement) - Checkout: 8.7s → 210ms (99.0% improvement) - Search: 6.4s → 280ms (95.6% improvement) - Admin dashboard: 18.2s → 320ms (98.2% improvement) API response times showed dramatic improvement across all endpoints, with p95 latency consistently under 300ms. Database query performance improved by an average of 12x, with complex analytics queries running in seconds rather than minutes. The most dramatic improvement was in the order history endpoint, which went from 23 seconds to 140ms by leveraging cached aggregated data. ## Lessons ### Technical Lessons 1. **Start with Monitoring**: The monolith's performance issues weren't where we initially expected. Comprehensive observability using multiple APM tools revealed that only 5% of code paths caused 80% of the problems. Without proper data, we would have optimized the wrong components. 2. **Plan for Data Consistency**: Dual-write patterns during migration are deceptively complex. Implement distributed transactions or accept eventual consistency from the start—don't discover this mid-migration. We underestimated the complexity of maintaining consistency between old and new systems, leading to a two-week delay in the second phase. 3. **Microservice Boundaries Matter**: Services that are too granular create network overhead; services that are too coarse recreate monolith problems. Domain-driven design principles helped establish the right boundaries. We initially split the Order Service into three separate services, then merged them back after realizing the overhead wasn't worth the separation. 4. **Database Migration Complexity**: Migration isn't just copying data—it's preserving relationships, constraints, and consistency. Our custom migration tooling took longer than expected to get right. We learned to allocate 30% more time for database migration than initially estimated. 5. **Security Integration**: Security can't be an afterthought in microservices architecture. We integrated security scanning and compliance checks early, avoiding last-minute surprises. PCI compliance drove architectural decisions around payment data handling that proved beneficial for overall security posture. ### Business Lessons 4. **Communicate Progress Visibly**: Weekly demos and metrics dashboards kept stakeholders engaged. Technical progress can be invisible—make it tangible. We created a public dashboard showing real-time performance improvements, which became a valuable tool for maintaining executive buy-in throughout the lengthy project. 5. **Invest in Training**: The client's team needed extensive training on new tools and practices. Budget for knowledge transfer upfront, not as an afterthought. We allocated 20% of the project budget for training and paired programming sessions. 6. **Security Can't Wait**: PCI compliance requirements for the new Payment Service delayed deployment by 3 weeks. Involve security teams from day one to avoid schedule disruptions. Early involvement also helped us design a more secure architecture from the beginning. 7. **Feature Flag Strategy**: LaunchDarkly paid for itself in the first month by enabling safe rollbacks. Feature flags are essential for any migration involving production traffic. We used flags not just for features, but also for infrastructure changes like database connection switching. ### What We'd Do Differently Next time, we'd implement feature flags earlier in the process, allowing instant rollback capability. We'd also consider a gradual migration of user sessions rather than the big-switch approach we ultimately needed for this timeline. The user session migration was more disruptive than anticipated, requiring a maintenance window that could have been avoided. We'd also invest more heavily in database migration tooling upfront. Our custom scripts worked, but off-the-shelf solutions might have saved time on development and testing. Additionally, we'd implement synthetic monitoring earlier to catch performance regressions before they reach customers. ## Conclusion Eight months after our first assessment, ShopMart operates on a platform designed for the next decade of growth. The migration wasn't just about technology—it was about transforming how the business operates, innovates, and scales. Today, they deploy features weekly instead of quarterly, handle traffic spikes effortlessly, and have a technical foundation ready for whatever comes next. For companies facing similar challenges, the path forward is clear: invest in architecture that serves your business goals, not just your technical ambitions. The journey from monolith to microservices is rarely smooth, but with proper planning, phased execution, and relentless focus on metrics that matter, it's a transformation that pays dividends for years to come. ShopMart's engineers have reported a 75% improvement in job satisfaction, citing the modern toolchain and reduced firefighting as key factors. The platform's stability has allowed the business to focus on growth initiatives rather than keeping the lights on.

Related Posts

FieldSync Pro: Transforming Field Service Operations Through Unified Cloud-Native Architecture
Case Study

FieldSync Pro: Transforming Field Service Operations Through Unified Cloud-Native Architecture

FieldSync Pro revolutionized field service management by implementing a cloud-native solution that unified mobile workforce operations, real-time inventory tracking, and predictive maintenance scheduling. This case study explores how UtilityGrid Solutions, a regional utility provider serving 2.3 million customers, leveraged Flutter for cross-platform mobile apps, Next.js for real-time dashboards, and AWS serverless architecture to reduce operational costs by 42% while improving first-time fix rates by 38%. The project faced significant challenges including offline synchronization across unreliable rural networks, integration with legacy SAP and Oracle ERP systems, and maintaining real-time data consistency across distributed operations. Through innovative microservice architecture, edge caching strategies, and progressive web app capabilities, the team achieved 99.9% uptime with sub-second sync times and full offline functionality for 72+ hours. The implementation delivered measurable ROI within 8 months, with technicians completing 25% more jobs daily and customer satisfaction scores increasing by 52%. Key lessons include the importance of offline-first design, the value of incremental deployment, and how cross-platform development accelerates time-to-market for enterprise solutions while maintaining security compliance for critical infrastructure.

Scaling Real-Time Collaboration: From Prototype to Enterprise Platform Serving 50,000 Concurrent Users
Case Study

Scaling Real-Time Collaboration: From Prototype to Enterprise Platform Serving 50,000 Concurrent Users

When a fast-growing SaaS startup approached us with a prototype real-time collaboration tool, they faced a critical challenge: scaling from a proof-of-concept handling dozens of users to an enterprise-grade platform supporting tens of thousands of concurrent users while maintaining sub-100ms latency. This case study details how we architected a scalable solution using WebSocket clustering, Redis pub/sub, and container orchestration to deliver 99.9% uptime and seamless performance across global deployments.

Modernizing Legacy Infrastructure: How We Migrated a 15-Year-Old Monolith to Cloud-Native Microservices in 6 Months
Case Study

Modernizing Legacy Infrastructure: How We Migrated a 15-Year-Old Monolith to Cloud-Native Microservices in 6 Months

Discover how our team transformed a legacy e-commerce platform serving 2M+ monthly users, breaking down a monolithic architecture into scalable microservices while achieving 99.99% uptime and reducing infrastructure costs by 40%. This case study reveals the strategic planning, technical challenges, and execution tactics that made this ambitious migration successful.