How a Mid-Size E-Commerce Platform Scaled to 2M+ Monthly Users with a Full-Stack Cloud Migration

When a fast-growing e-commerce brand hit a performance ceiling that threatened its Black Friday sales, the engineering team embarked on a four-month transformation spanning infrastructure, architecture, CI/CD, and observability. This case study traces every decision — from the initial load-test failure that kicked it off, to the day the platform handled 142,000 concurrent shoppers without a blip. Along the way, we cover the missteps, the debates, the rollback plan that never needed to fire, and the specific infrastructure choices that made the difference. If you are running a growing platform and wondering whether a migration is worth the cost, this is the inside story of one team that bet big and came out ahead.

Overview

Verdant Supply Co. is a direct-to-consumer home goods brand that launched in 2019. By early 2025, the company had grown into a thirty-person engineering organization serving roughly 1.2 million monthly active users across two storefronts, a mobile app, and a B2B wholesale portal. Revenue had grown sixfold in three years. The technical platform, however, had not grown with it.

The backend stack in use at the start of 2025 was a monolith built on a managed LAMP stack, fronted by a Content Delivery Network that had been configured once during platform setup in 2020 and never revisited. The database was a single instance of MySQL on a virtual machine with 4 CPU cores and 16 GB of RAM — an extremely modest specification in 2025, especially given the transactional volume it was handling every day. The API layer had no formal rate-limiting strategy beyond what the web server offered out of the box, and the mobile app was consuming the same internal REST endpoints as the storefront directly, with no gateway or caching layer between it and the database.

What followed is a detailed account of the four-month initiative that brought Verdant's platform from a precarious state to one capable of handling 142,000 concurrent shoppers on the busiest shopping day of the year, reducing infrastructure cost per order by 34 percent, and cutting mean time to recovery from ninety minutes to under eight minutes. The story includes the early failures, the technical disagreements that consumed weeks of review time, the specific architecture decisions that mattered most, and the organizational changes that made the transformation stick.

The Challenge

The signal that something was genuinely wrong arrived on a Thursday afternoon in December 2024, during a load test the team had scheduled before the peak holiday season. The load test was meant to simulate 50,000 concurrent users browsing product pages, adding items to cart, and initiating checkout flows. The infrastructure buckled at 32,000 simultaneous sessions.

The failure mode was classic but brutal. At approximately 33,000 concurrent sessions, the application and MySQL database server entered an error-correction loop. The MySQL instance exhausted available connections as new queries arrived faster than it could complete existing ones. The web server responded with 502 errors as its upstream connection pool saturated. CDN edge nodes received mismatch signals and began queuing responses, compounding the problem. The result was a platform that became unusably slow for roughly twenty minutes, during which time the load test team had to terminate the run manually.

The week following the failed load test brought a series of alarming retrospects. The engineering team had been incrementally adding features — a new wishlist, inventory forecasting, a B2B bulk ordering module — without a coordinated infrastructure assessment. Each feature had been reviewed in isolation, and no someone had maintained a holistic view of what the combined load would do under peak traffic. The CI/CD pipeline had a median deployment time of forty-five minutes per environment, making every deploy a small commitment of engineering bandwidth and discouraging the kind of small, frequent iterations that keep a platform healthy.

Beyond the technical picture sat an organizational one. Verdant's engineering practice was organized as a single squads with full-stack ownership but no dedicated DevOps, SRE, or platform engineering roles. Infrastructure configuration lived in a shared administrative Google Drive folder, annotated with PDFs. Secrets management was a shared LastPass vault with nineteen entries and no formal access review process. Database migrations were transacted manually during deployment windows by the most senior engineer on call. No one disputed the urgency of the situation. The disagreement, which consumed almost three weeks of review cycles, was over exactly what should be done and in what order.

Goals

Before the team committed to a plan, it defined a set of measurable, time-bound goals that guided every subsequent decision. These goals were deliberately structured to avoid scope creep and keep the team focused on outcomes that could be quantified.

Performance Goal: The platform must sustain 100,000 concurrent user sessions with a p99 response time of less than 400 milliseconds for product browse endpoints and less than 1,200 milliseconds for the full checkout flow. This target was set conservatively: the company's top-ten Black Friday historically reached 68,000 concurrent sessions, and the 100,000 figure represented a 47 percent safety margin above that peak.

Reliability Goal: Infrastructure incidents that require manual intervention by an on-call engineer must not exceed twice per quarter, and mean time to recovery for any incident rated Sev-2 or higher must not exceed twenty minutes. At the time, Verdant was averaging approximately one Sev-2 incident per month, and MTTR was measured at ninety minutes across the previous twelve months.

Scalability Goal: The platform must be able to accommodate a 2x traffic increase without requiring a significant architectural overhaul. The team wanted a platform that could absorb rapid growth without repeating the kind of performance review that had preceded this initiative.

Cost Goal: Total monthly infrastructure cost must not increase above the pre-migration baseline of $18,500 per month, despite the substantially more capable architecture being built. The team considered this a non-negotiable constraint — professionalizing infrastructure should reduce cost, not expand it.

Operational Goal: The median CI/CD deployment time must drop from forty-five minutes to under five minutes, and deployments must require no manual approval gates during normal business hours. This was a productivity goal as well as a reliability goal: faster, lower-friction deployments reduce the risk of merge conflicts and configuration errors that accumulate during long-lived feature branches.

Approach

The team debated two overarching approaches. One faction, led by the principal architect, argued for a comprehensive replatforming to Kubernetes, citing it as the industry standard for platforms at Verdant's scale. The opposing faction, made up of mid-senior engineers, argued that Kubernetes introduced substantial conceptual and operational overhead that the team was not, at the time, equipped to manage reliably — particularly under peak-load conditions when the platform would be most stressed.

The debate lasted three weeks. It ended not in a decisive argument but in a medication: the team committed to the fundamentally simpler path of a well-structured managed infrastructure migration on AWS, with a container strategy on Amazon ECS that offered sufficient orchestration power without the full Kubernetes surface area. The reasoning, ultimately, was pragmatic. Verdant was designing a platform for a mid-size commercial operation, not an infrastructure engineering team at a hyperscaler. The right architecture was the one the team could operate confidently under pressure, not the one that checked the most boxes on a technology leaderboard.

The approach was then structured into four sequenced phases: Stabilize and Instrument, Decouple Services, Optimize for Scale, and Automate Operations. Each phase was designed to produce a shippable, independently valuable increment, so that the team could ship improvements continuously rather than accumulating risk in a single large release.

Phase 1: Stabilize and Instrument

The first phase focused on two things simultaneously: stabilizing the existing platform far enough to survive a peak-load event, and instrumenting the platform to measure the specific behaviors that would indicate whether subsequent changes were working.

On the stabilization side, the team implemented Redis-backed session management to offload session data from MySQL, applied connection pool tuning that reduced the maximum concurrent MySQL connections from 150 to 80 (a counterintuitive move that prevented the runaway connection storm at load), and introduced a comprehensive object-caching layer in front of product catalog queries using Redis, which had been already provisioned but used only trivially. These changes were production-validated, deployed via CD, and immediately reduced average response time by 38 percent on load-test profile runs.

On the instrumentation side, the team deployed a managed observability stack — Prometheus for metric collection, Grafana for dashboards, and infrastructure-aware alerting watches — and instrumented the application with structured logging using OpenTelemetry, producing a full request trace from API gateway through application code and into database query timing. Within the first week of full instrumentation, the team identified a cache-miss pattern in the inventory endpoint for high-demand products during peak hours that was generating unnecessary MySQL queries at three-second intervals from every refresh — a purely operational waste that had gone completely invisible before tracing was in place.

Phase 2: Decouple Services and Migrate to Managed Infrastructure

Phase 2 was the most architecturally complex phase. The team followed a strangler fig pattern — wrapping new services around the old and routing traffic gradually, rather than pursuing a big-bang migration. The major structural changes included:

API Gateway implementation: An Amazon API Gateway layer was introduced in front of all incoming traffic — both from the storefront and the mobile app. This gateway handled rate-limiting, request validation, and circuit breaking, eliminating the direct coupling between the mobile app's internal calls and the backend services. Crucially, it also allowed the team to deploy rate-limiting policies that throttled runaway request flows automatically before they reached the application tier.

Database migration to Amazon RDS: The MySQL instance was migrated to an Amazon RDS deployment on a db.r6g.2xlarge instance — 8 virtual CPUs, 32 GB of RAM — with a provisioned IOPS SSD volume configured for 16,000 IOPS, representing roughly a twenty-fold increase in database IO throughput over the old VM. The migration was handled via a binary log replication snapshot, with less than four minutes of observed downtime during the cutover window, which was executed at 3 AM local time on a Saturday.

Application migration to Amazon ECS: The application tier was containerized and deployed on Amazon ECS using Fargate compute — a serverless container runtime that eliminates the overhead of managing individual EC2 instances while providing the full isolation and scheduling benefits of container orchestration. The ECS cluster was configured with auto-scaling policies that targeted 65 percent CPU utilization, enabling the platform to scale compute capacity horizontally in response to load without human intervention.

Storage migration to Amazon S3: All static asset storage — product images, user-generated content, B2B catalog PDFs — was moved from the existing block-storage EBS volumes to Amazon S3 behind a CloudFront distribution, eliminating the pathing through the application layer that had contributed to slow image load times and reducing the direct storage throughput burden on the application servers.

One of the most important design decisions in Phase 2 was the introduction of an idempotency layer in the checkout service, using DynamoDB with a time-to-live of twenty-four hours on idempotency keys. This change addressed a class of failures that had previously produced duplicate charges during incident recovery — duplicated orders triggered by partial failures during the payment confirmation path, where the order confirmation had been committed but the payment callback had not. Before the idempotency layer, each such failure required a manual reversal and customer service reconciliation. After, the platform recovered silently without human intervention.

Phase 3: Optimize for Scale

With the migration complete and data flowing through the new infrastructure, Phase 3 focused on the performance tuning and architectural refinements that moved the platform from simply stable to genuinely scalable.

The team introduced a read-replica architecture for the database, provisioning two Aurora MySQL-compatible read replicas behind a proxy that balanced read query load automatically. Product catalog browse endpoints were migrated to consume read replicas directly, removing them from contention on the primary database instance and reducing write-latency pressure during peak-traffic events. Combined with improvements to query indexing that reduced the average execution time of the most frequently executed browse queries by 44 percent, this change alone reduced primary database CPU utilization from a peak of 87 percent during load tests to 41 percent.

The team also implemented a deduplication strategy for the inventory service. Under peak load, inventory queries for the same product originating from concurrent frontend instances were collapsing into a burst of redundant queries within a single second. By introducing a short-lived, in-memory deduplication cache keyed on product ID with a fifteen-second TTL, the service eliminated approximately 67 percent of redundant inventory lookups during traffic spikes without sacrificing inventory accuracy within the acceptable SLA.

The mobile app was refactored to consume the API gateway through a dedicated GraphQL aggregation layer rather than directly calling REST endpoints. This allowed the mobile team to request exactly the data fields needed for a given screen in a single round-trip, reducing the average number of API calls per mobile session from 4.3 to 1.2 and cutting mobile page load times by 52 percent. The GraphQL-layer was built using AWS AppSync, which removed the need to manage a dedicated GraphQL compute tier, reducing the operational overhead of the layer to the equivalent of a single managed service.

Phase 4: Automate Operations and Build Reliability Culture

The fourth phase addressed head-on the organizational and process dimensions that had contributed to the platform's instability. The team refactored the CI/CD pipeline from a forty-five-minute single-stage build to a two-stage pipeline — build and test on pull request, deploy with automated canary analysis on merge to main — reducing median deployment time from forty-five minutes to approximately three minutes while simultaneously reducing failure rates due to environment configuration drift.

Infrastructure as code using Terraform modules was introduced, replacing the shared-drive collection of PDFs and ad-hoc EC2 console edits with a version-controlled, peer-reviewed Terraform codebase. Every infrastructure change — from a new security group rule to a full environment redeployment — flowed through pull request review and applied automatically to a staging environment first before promotion to production. The secrets story was addressed by migrating all API keys, database credentials, application secrets, and third-party service tokens to AWS Secrets Manager with rotation policies configured on a thirty-day basis. This replaced a LastPass vault with no rotation policy and an unknown release.

Incident management was reorganized around the concept of blameless postmortems. The team committed to holding a formal postmortem within seventy-two hours of any Sev-2 incident, with a written document that specified a timeline, contributing factors, and at least two concrete improvement actions. Over the twelve months following the implementation of this practice, incident frequency dropped to under one Sev-2 incident per quarter, and MTTR fell to a documented single incident four minutes and forty-seven seconds at the time of writing.

Results

In June 2025, approximately three months after the migration was declared fully live, Verdant Supply Co. conducted a formal load test to assess the new platform's capacity. The target was 100,000 concurrent users. The platform reached 142,000 concurrent user sessions before the load test was called as a success. The p99 product browse response time at peak load was measured at 287 milliseconds. The p99 checkout flow response time at 120,000 concurrent sessions was 973 milliseconds — comfortably inside the 1,200 millisecond reliability target. The platform had exceeded every one of its performance targets by a meaningful margin.

The business impact was direct and immediate. At the Summer Sale event in July 2025, Verdant's platform sustained 89,000 concurrent shoppers — a new organizational record — with zero downtime and zero manual intervention. Customer support tickets related to checkout failures were down 62 percent year-over-year. Mobile app crash rate dropped from 2.8 percent to 0.3 percent following the GraphQL migration and CDN optimizations. None of these outcomes were predicted at the start of the project: they were discovered in the months that followed as customers experienced a visibly improved platform.

The infrastructure cost story surprised the team as much as anyone. The migration to managed services on AWS, combined with the right-scaling of compute and the migration of static assets to S3 and CloudFront, brought total monthly cloud spend from $18,500 in January 2025 to $12,200 by July 2025 — a reduction of 34 percent, or $6,300 in monthly savings. The geometry is straightforward but instructive: migrating to managed services often adds security and reliability value that costs nothing extra — the managed service provider handles the reliability burden — and right-sizing compute after migration frequently reveals over-provisioned capacities that were carried for years because no one had systematically reviewed them.

Metrics Dashboard

Metric	Before Migration	After Migration	Change
Max concurrent users supported	32,000	142,000	+344%
p99 browse response	1,420 ms	287 ms	-80%
p99 checkout response	4,800 ms	973 ms	-80%
Monthly infrastructure cost	$18,500	$12,200	-34%
Mean time to recovery (MTTR)	90 min	8 min	-91%
Sev-2 incidents per quarter	~3	<1	-67%
Median deploy time	45 min	3 min	-93%

Lessons Learned

The Verdant case study generates a number of lessons that the team thinks are broadly applicable to organizations in a similar situation. Each of them was learned through direct experience, not from reading technology analyst reports.

1. Stabilize before you innovate. The most valuable work done in Phase 1 — Redis session offloading, connection pool tuning, basic instrumentation — did not require any new technology or architectural change. It was just good engineering done carefully. The team discovered that a modest amount of disciplined optimization on the existing platform frequently reveals headroom that makes a complete migration less urgent, which in turn gives you the time to plan the migration well rather than rushing through it.

2. The governance cost of technical debt accumulates faster than infrastructure costs. The shared Google Drive folder containing all infrastructure configuration had grown, at its peak, to 74 PDF files and two spreadsheets. The cost of this arrangement in terms of configuration drift, human error in manual changes, and incident recovery difficulty was substantially higher than the immediate cost of replacing it with Terraform modules. Organizations tend to tolerate this kind of unmeasured cost for far longer than they should.

3. Observability is not an overhead investment — it is a prerequisite for scale. The first measure of a well-instrumented platform is whether you can answer the question "why is this slow" without manually parsing logs. The team doubts any migration of Verdant's complexity could have been completed safely — or honestly evaluated for its effectiveness — without a tracing and metrics layer in place before the migration work began. Build instrumentation as early as you can, before it becomes a crisis.

4. Incremental deliverability is safety. The strangler fig pattern — wrapping new services around old ones, migrating traffic gradually, shippping every phase as a production-validated increment — dramatically reduced the risk of the migration as a whole. Every phase delivered something that was immediately useful and independently justifiable. This meant that even if the team had run out of time or budget mid-migration, the work already done would have produced genuine business value.

5. The team you build during a crisis is the team that handles the next one. Perhaps the most important outcome of the migration project was not technical at all. The emergency load-test failure and the subsequent urgency of the migration effort forged a more cohesive engineering organization with shared context about infrastructure constraints, platform behavior, and what actually matters during production incidents. The postmortem culture, the on-call rotation, the Terraform review process — these practices outlasted the migration itself and became the permanent operating style of the engineering team at Verdant.

Verdant's platform today is not immune to failure. No platform is. But it is resilient to the kind of acute capacity failure that nearly cost the company its most important sales day of the year. The migration produced 142,000 concurrent users handled without incident, infrastructure costs reduced by a third, dependable observability, a trained on-call team, and a platform capable of absorbing further growth without the same level of architectural anxiety that preceded the December 2024 load-test failure. For a mid-size e-commerce operator facing a similar position, Verdant's experience is an existence proof that the path through is achievable — and that the work, properly scoped and sequenced, pays for itself faster than many engineers assume.