Webskyne
Webskyne
LOGIN
← Back to journal

18 June 20268 min read

How We Scaled a Legacy Retail Platform to Handle 10x Peak Traffic with Zero Downtime

A mid-sized retail chain was struggling with an outdated e-commerce platform that buckled under seasonal sales events. We detail the full technical overhaul — from monolith decomposition to cloud-native architecture — that cut response times by 70%, eliminated crash-related revenue loss, and set the foundation for sustained growth over the next three years.

Case Studycloud architecturemicroservicesperformance optimizationAWSe-commercedigital transformationNode.jssystem design
How We Scaled a Legacy Retail Platform to Handle 10x Peak Traffic with Zero Downtime
## Overview In early 2024, a leading retail chain with 120+ physical stores and a growing digital storefront approached us with a critical problem: their entire e-commerce platform crashed during major sales events. The Black Friday and Diwali sale windows had become an annual source of panic, with site downtime averaging 4.2 hours per peak event and direct revenue losses estimated at ₹2.8 crore annually. Worse, the existing codebase, built on a decade-old LAMP stack with a tightly coupled monolith, made incremental improvements nearly impossible without risking system-wide outages. The client needed a solution that could handle 10x normal traffic spikes while maintaining sub-second response times — not just for the next sale season, but for years of growth ahead. ## Challenge ### Technical Debt at Scale The platform's primary challenge was classic technical debt compounded by rapid business growth. The monolithic PHP backend handled everything from product catalog management and inventory synchronization to payment processing and customer authentication within a single codebase. A single bug in the recommendation engine could bring down the entire checkout flow. Database queries were often unindexed, leading to N+1 query problems that went unnoticed until load testing. Session storage relied on local filesystem, making horizontal scaling impossible. ### Inflexible Infrastructure On the infrastructure side, everything ran on a single 8-core VM in a traditional data center. There was no auto-scaling, no CDN for static assets, and no disaster recovery plan. The team used a manual FTP-based deployment process that required a maintenance window every release. Database backups were weekly and stored on the same physical server as the primary instance. ### Business Constraints The client could not afford a full rewrite that would take 18+ months and disrupt operations. They needed a migration path that preserved existing features, kept the platform running during cutover, and delivered measurable improvements within six months. The board had already approved a ₹1.2 crore budget ceiling for the entire transformation. ## Goals We established clear, measurable goals aligned with both technical and business outcomes: 1. **Availability**: Achieve 99.95% uptime during peak traffic events 2. **Performance**: Reduce average page load time from 3.8s to under 1.5s 3. **Scalability**: Support 10x concurrent users without degradation 4. **Recovery**: Reduce mean time to recovery (MTTR) from 4 hours to under 15 minutes 5. **Cost**: Keep infrastructure costs within 15% above the existing spend 6. **Velocity**: Enable zero-downtime deployments within three months ## Approach ### Phased Migration Strategy Rather than a big-bang rewrite, we adopted a strangler fig pattern. We would incrementally replace monolith components with microservices behind an API gateway, allowing the old and new systems to coexist. Each phase delivered a working, independently deployable service. ### Technology Evaluation We evaluated three architecture patterns: a full microservices approach, a modular monolith with clear boundaries, and a serverless-first design. Given the team's existing expertise and the need for predictable performance under variable load, we chose a hybrid approach: a modular monolith core with select microservices for high-traffic endpoints (product catalog, search, and checkout) and serverless functions for background jobs. The technology stack included: - **Frontend**: Next.js with edge caching and ISR for product pages - **API Gateway**: Kong with request throttling and circuit breakers - **Core Services**: Node.js/NestJS for checkout and cart services - **Search**: Elasticsearch with pre-computed synonyms for retail queries - **Data Layer**: PostgreSQL for transactional data, Redis for caching and session storage - **Infrastructure**: AWS ECS for containerized services, CloudFront CDN, RDS Multi-AZ ### Performance Budgeting We established a strict performance budget before writing any new code: First Contentful Paint under 1.5s, Time to Interactive under 3s, and API response times under 200ms for 95th percentile. Every pull request was validated against these budgets in CI/CD, making performance a non-negotiable requirement rather than an afterthought. ## Implementation ### Phase 1: Foundation and Observability (Weeks 1–4) Before making architectural changes, we instrumented the existing platform. We deployed distributed tracing with OpenTelemetry, centralized logging via Loki, and synthetic monitoring across all critical user journeys. This gave us a baseline and made it impossible to claim improvements without data. We also set up staging environments that mirrored production, allowing safe load testing and QA without customer impact. ### Phase 2: CDN and Edge Layer (Weeks 5–8) We migrated all static assets — product images, CSS, JavaScript bundles — to CloudFront, set aggressive caching headers, and implemented image optimization at the edge using Next.js Image components and Sharp. Product detail pages, which accounted for 60% of traffic, were statically generated at build time with ISR, falling back to SSR for real-time inventory updates. This alone reduced origin server requests by 40%. ### Phase 3: Extract Product Catalog Service (Weeks 9–12) The catalog service was the most read-heavy component. We extracted it into a standalone NestJS service backed by PostgreSQL with read replicas and an Elasticsearch index for full-text search and filtering. A Kafka stream kept the search index synchronized with inventory changes in near real-time. Redis was used for category-level caching with a 5-minute TTL, short enough to keep pricing fresh but long enough to absorb traffic spikes. ### Phase 4: Checkout and Cart Microservice (Weeks 13–16) Checkout was the highest-risk component: any downtime here meant direct revenue loss. We built a dedicated NestJS service with the following characteristics: idempotent payment processing, optimistic locking for inventory, and a queue-based email/SMS notification system using BullMQ. The service was deployed as a stateful ECS service with auto-scaling based on CPU and request queue depth. Payment callbacks were handled via webhook with retries and dead-letter queues to prevent duplicate charges. ### Phase 5: Database Migration and Caching Layer (Weeks 17–20) We migrated the primary database from a single on-premise MySQL instance to AWS RDS Multi-AZ PostgreSQL. Data was synchronized during cutover using Debezium CDC (Change Data Capture) to ensure zero data loss. Session state moved from local filesystem to Redis Cluster with persistence to RDB and AOF. Application-level caching was added at the NestJS layer using Redis decorators, with cache invalidation fired through domain events. ### Phase 6: Deployment Automation and Disater Recovery (Weeks 21–24) We replaced FTP deployments with a CI/CD pipeline using GitHub Actions. Each service had its own pipeline with automated tests, security scanning, and canary deployment support. Terraform managed all infrastructure, enabling reproducible environments. Disaster recovery was rehearsed monthly: RPO (Recovery Point Objective) was brought to under 5 minutes, and RTO (Recovery Time Objective) to under 30 minutes. ## Results The transformation exceeded every measurable goal within the six-month timeline: During the subsequent Diwali sale season, the platform handled 11.3x normal traffic with 99.97% uptime. Page load times dropped to an average of 1.1s, and checkout completion rates improved by 18% due to reduced latency. The auto-scaling infrastructure responded to traffic spikes within 45 seconds, absorbing demand without manual intervention. Mean time to recovery dropped to 8 minutes regardless of failure type, thanks to automated rollbacks and comprehensive monitoring. Infrastructure costs actually decreased by 8% compared to the previous year due to right-sized instances and reserved capacity planning. ## Metrics Here are the key metrics before and after the transformation: - **Uptime**: From 94.2% to 99.97% during peak periods - **Average Page Load**: From 3.8s to 1.1s (71% improvement) - **API p95 Latency**: From 420ms to 168ms (60% improvement) - **Peak Concurrent Users**: From 4,500 to 51,000 - **Checkout Completion Rate**: From 62.3% to 73.5% - **Revenue Lost to Downtime**: From ₹2.8 crore/year to under ₹8 lakh/year - **Deployment Frequency**: From monthly to 12+ per week per service - **Incident Response Time**: From 4 hours to under 10 minutes ## Lessons This engagement reinforced several principles that we now apply to every enterprise modernization project: **Incremental beats big-bang.** Every phase produced a shippable, testable improvement. We never had a moment where the platform was entirely broken or over a cliff of risk. The strangler fig pattern, while requiring more upfront architectural thinking, paid dividends in controlled risk delivery. **Observability is not optional.** You cannot improve what you cannot measure. Deploying distributed tracing and structured logging at the start gave us the confidence to make changes and the evidence to prove their value. It also dramatically reduced debugging time once new services were in production. **Performance must be a budget, not a benchmark.** By enforcing performance budgets in CI, we prevented regressions from ever reaching production. This is drastically cheaper than retrofitting optimization after the fact. **Business continuity drives technical decisions.** Keeping the old system running during migration wasn't just a constraint; it was the primary success criterion. This meant sometimes choosing more complex data synchronization patterns over simpler but disruptive migrations. Understanding the real business stakes — revenue, compliance, customer trust — is what separates a technical upgrade from a genuine business transformation. The client's CTO summarized the engagement well: "They didn't just build a faster platform; they gave us a migration path that didn't require us to pause the business. That was the real win." --- *This case study is based on an actual engagement. Client names and specific identifiers have been modified to protect confidentiality while preserving technical accuracy.*

Related Posts

How We Cut FleetTrack Pro's Delivery Operations Costs by 42% With a Unified Logistics Platform
Case Study

How We Cut FleetTrack Pro's Delivery Operations Costs by 42% With a Unified Logistics Platform

FleetTrack Pro was hemorrhaging time and money across fragmented legacy systems. In this case study, we break down how we designed and delivered a unified logistics platform that slashed operational costs by 42%, accelerated delivery throughput by 60%, and gave the company real-time visibility across its entire supply chain. From discovery through post-launch optimization, we walk through the technical decisions, architectural patterns, and organizational workflows that made the transformation stick.

How SkyPay Cut Payment Latency by 62% Without Touching Their Checkout Flow
Case Study

How SkyPay Cut Payment Latency by 62% Without Touching Their Checkout Flow

SkyPay was processing 2 million transactions per month across Southeast Asia when a 380ms p95 payment latency began silently eroding merchant revenue and support resources. The root causes were structural: database contention between fraud analytics and checkout queries, an incoherent Redis caching strategy, regional routing based on DNS ping rather than payment corridor latency, and message broker coupling that let batch reconciliation back up real-time payment streams. This case study details an eight-week strangler-fig migration that reduced p95 latency to 122ms — a 68% improvement — while maintaining 99.98% uptime through automated rollback triggers, dual-write settlement verification, and stratified canary traffic shifting. The team discovered that incremental migration beats big-bang replacement not by making fewer mistakes, but by designing the system so mistakes surface quickly, roll back cleanly, and turn into institutional learning. Three recurring pitfalls — deterministic canary sampling bias, hidden cache boundaries after decoupling, and rollback as a substitute for staging discipline — are examined with concrete mitigations that any platform team can adopt without a greenfield rewrite.

From Manual Chaos to Automated Excellence: How We Transformed a FinTech Startup's Backend in 90 Days
Case Study

From Manual Chaos to Automated Excellence: How We Transformed a FinTech Startup's Backend in 90 Days

A mid-sized FinTech startup was drowning in manual reconciliation processes, API outages, and slow release cycles. Within 90 days, we architected a scalable cloud-native backend that cut deployment time by 80%, reduced API downtime to 99.97% uptime, and saved the operations team over 200 hours per month. This is the full story, the hard decisions, and the lessons that made it possible.