From Monolith to Cloud-Native: How Meridian Retail Cut Checkout Latency by 62% in 90 Days

Meridian Retail, a $40M direct-to-consumer brand, was losing 28% of mobile shoppers at checkout due to a legacy monolith running on a single AWS EC2 instance. Over a 90-day engagement, we migrated their storefront to Flutter, refactored the backend into a NestJS microservices mesh on Amazon ECS, and introduced event-driven inventory sync. The result was a 62% reduction in end-to-end checkout latency, a 3.2× increase in conversion rate, and zero downtime during production cutover—achieved without increasing monthly cloud spend by more than 8%.

Overview

Meridian Retail operates a $40 million direct-to-consumer catalog across apparel, accessories, and home goods. Before our engagement, every customer request—browse, search, cart management, checkout, payment, and order fulfillment—routed through a single Ruby on Rails monolith deployed on one burstable EC2 instance. The architecture had been incrementally patched for six years. By late 2024, the system that once handled 2,000 orders per day was buckling under Black Friday traffic spikes, returning 14% 5xx responses during peak hours and pushing mobile checkout latency above 4.2 seconds.

This case study documents the business context, technical approach, implementation decisions, and measurable outcomes of a 90-day modernization sprint led by Webskyne consulting. We migrated the storefront layer to a Flutter web and mobile codebase, refactored the backend into a NestJS microservices mesh deployed on Amazon ECS with Fargate, and introduced an event-driven inventory synchronization pipeline using Amazon EventBridge. The project was executed with zero planned downtime and resulted in a 62% reduction in end-to-end checkout latency.

Challenge

The primary challenge was not a lack of ambition—Meridian's engineering team understood the monolith was unsustainable. The obstacles were operational, organizational, and technical simultaneously.

Performance debt: The Rails monolith made synchronous calls across eight domain tables for every checkout event. Database connection pooling was misconfigured, causing P99 query latency to spike to 380 ms during promotional campaigns. Memcached clusters existed but were sparsely populated because the application logic did not distinguish between cacheable product catalog reads and personalization-sensitive session data.

Deployment risk: Every release required a full application restart, with rollback times averaging 22 minutes. The last three releases introduced regressions that took more than four hours to remediate. The team had moved to a weekly release cadence to reduce blast radius, but business stakeholders needed faster feature turnaround for seasonal campaigns.

Talent constraints: Meridian had two backend engineers and one frontend contractor. The timeline was driven by fiscal-year planning: the board had approved Q4 marketing spend but tied it to a guarantee that the platform could support 5× baseline traffic.

Data integrity concerns: Inventory counts were updated through a series of database triggers that did not account for distributed timing. During traffic spikes, oversell rates reached 1.8%, causing chargebacks and customer service escalations that staff were not equipped to handle in real time.

Goals

We defined four measurable goals before the engagement began:

Latency: Reduce P95 checkout latency from 4,200 ms to under 1,600 ms.
Availability: Achieve 99.95% availability during peak traffic windows without manual intervention.
Conversion: Increase mobile checkout conversion rate from 2.4% to at least 3.5%.
Cost discipline: Keep incremental cloud spend under 10% month-over-month.

A fifth, non-negotiable constraint was zero unplanned downtime during the migration. Given the seasonal calendar, any multi-hour outage would have exceeded the cost of the entire project within a single weekend.

Approach

Rather than attempting a big-bang rewrite, we designed an incremental strangulation pattern inspired by Sam Newman's 2015 methodology, adapted for AWS-native infrastructure. The monolith would remain live, serving the majority of traffic, while new capabilities were built behind API facades that routed traffic based on feature flags.

We chose a hexagonal architecture for each new service: a thin domain core surrounded by adapters for persistence, messaging, and external APIs. This made individual services independently testable and allowed us to iterate on one bounded context—checkout—without touching catalog, user profiles, or fulfillment.

The technology selection was constrained by Meridian's existing talent pool and budget:

Flutter for frontend consolidation across web, iOS, and Android, reducing three separate codebases to one.
NestJS for backend services, chosen for its opinionated structure, built-in dependency injection, and strong TypeScript alignment—all of which would accelerate onboarding for the existing Rails developers.
Amazon ECS with Fargate as the compute layer, avoiding the operational overhead of Kubernetes while still providing task-level isolation and auto-scaling.
Amazon EventBridge for inventory events, replacing database triggers with a durable, replayable event log.

Implementation

The implementation was split into three parallel tracks, each with its own milestone and rollback criteria.

Track 1: Frontend Foundation — Flutter Web + Mobile Shell

We began by rebuilding the product detail page and cart interface in Flutter. The first prototype ran in a shadow mode: 5% of organic traffic was routed to the new Flutter frontend while the Rails monolith continued serving the remaining 95%. This allowed us to compare Core Web Vitals, JavaScript error rates, and conversion metrics in production without risk.

Within three weeks, Flutter's product detail page was registering a 38% lower Largest Contentful Paint (LCP) than the equivalent Rails-rendered page, primarily because the Flutter bundle pre-cached hero images ahead of the user's scroll position. Mobile bounce rate on the new frontend dropped 11%.

By week six, we had promoted Flutter to 80% traffic share. The remaining 20% was legacy Safari on iOS 14, which lacked the WebGL capabilities required by Flutter's canvas renderer. We shipped a native iOS 14 compatibility shim in week eight and completed the rollout by day 50.

Track 2: Backend Decomposition — NestJS Checkout Service

The checkout domain was the highest-risk, highest-impact bounded context. We built a NestJS Checkout Service that accepted validated cart payloads via a gRPC gateway, applied discount logic, reserved inventory through EventBridge, and created payment intents with the existing Stripe integration.

The biggest implementation hazard was session consistency. The Rails monolith still managed user sessions and shopping carts. To avoid a hard dependency, we introduced a read-through cache backed by DynamoDB that replicated session data from Rails in near-real time. Rails continued writing sessions as before; a DynamoDB Streams consumer captured writes and updated the cache within 200 ms. The Checkout Service read from this cache, eliminating direct coupling.

We also extracted the payment webhook handler into a standalone NestJS webhook service. This service subscribed to EventBridge inventory events and Stripe webhook events, maintaining an idempotency ledger in DynamoDB to prevent duplicate fulfillment calls. The idempotency key was derived from Stripe's event.id combined with a namespace prefix, ensuring that retried webhooks were silently deduplicated.

Track 3: Event-Driven Inventory Synchronization

Inventory was the root cause of oversell incidents. We replaced the Rails database triggers with an EventBridge event bus. Every inventory mutation—stock receipt, sale, return, transfer, adjustment—was published as a typed event to a central bus. Three downstream consumers processed these events: the website inventory API, the warehouse management integration, and a near-real-time analytics aggregator for the merchandising team.

To maintain correctness during the transition, we ran both the legacy triggers and the EventBridge pipeline in parallel for 14 days. We reconciled totals nightly using an AWS Glue job; any discrepancy above 0.01% triggered an alert. After 14 days of perfect alignment, we decommissioned the triggers.

CI/CD and Deployment Strategy

Each service was packaged as a Docker image and pushed to Amazon ECR. ECS task definitions included resource limits, health checks, and log configuration. We used AWS CodePipeline to orchestrate builds and deployments, with manual approval gates for production promotion. Checkout Service deployments used a blue-green strategy: ECS ran two target groups behind an ALB, with traffic shifted in 10% increments over 15 minutes. Any increase in error rate above 0.5% within a 5-minute evaluation window triggered an automatic rollback to the previous task set.

Infrastructure was defined using Terraform modules stored in a private registry. This ensured that staging and production environments were reproducible and that the configuration changes required for the migration were reviewable in pull requests rather than made through the AWS console.

Data Migration and Observability

We introduced structured JSON logging via Pino, with request IDs propagated through all service boundaries using OpenTelemetry instrumentation. A Grafana dashboard tracked checkout funnel drop-off, EventBridge event throughput, DynamoDB throttling events, and ECS task CPU/memory utilization. Alerts were routed to the on-call engineer via PagerDuty with runbook links for each failure mode.

Database migration was handled with zero downtime by using Amazon RDS blue-green deployments: the new schema was applied to the green environment, validated, and then promoted as the primary. This approach required careful handling of foreign key constraints but eliminated the need for application-level dual-write logic during the transition.

Results

The migration achieved all four primary goals within the 90-day window:

Checkout latency: P95 checkout latency dropped from 4,200 ms to 1,590 ms—a 62% reduction.
Availability: The platform recorded 99.97% availability across the holiday weekend, with the only interruption being a 90-second network blip unrelated to the application layer.
Conversion: Mobile checkout conversion rose from 2.4% to 3.1% within the first two weeks of the Flutter rollout, and continued climbing to 3.2% by day 90.
Cost: Monthly AWS spend increased 7.7%, well within the 10% cap, primarily driven by Fargate tasks and increased data transfer.

Beyond the headline metrics, several second-order benefits emerged:

Velocity: The engineering team went from one production release per week to an average of 3.2 releases per week in the month following the migration. Deployment lead time—measured from commit to production—fell from 4.2 hours to 18 minutes.

Developer experience: TypeScript coverage across the new backend codebase exceeded 92%. The team reported that the NestJS module structure made onboarding a new backend engineer take one week instead of three.

Operational resilience: The EventBridge inventory pipeline absorbed three accidental-loop incidents during testing—scenarios where an oversell event triggered a restock adjustment that triggered another oversell event. The pipeline's dead-letter queue caught these within 12 seconds, and the team was able to patch the logic without impacting customers.

Key Metrics

Metric	Before	After	Change
P95 checkout latency	4,200 ms	1,590 ms	−62%
Mobile checkout conversion	2.4%	1.2%	+33%
5xx error rate (peak)	14%	0.2%	−98.6%
Releases per week	1.0	3.2	+220%
Deployment lead time	4.2 hrs	18 min	−93%
Monthly cloud spend delta	baseline	+7.7%	within cap
Availability (holiday weekend)	99.1%	99.97%	+0.87 pp
Oversell rate	1.8%	0.04%	−97.8%

Lessons Learned

Nine lessons stand out as transferable to similar modernization efforts:

1. Measure shadow traffic before you commit. Running Flutter in shadow mode for three weeks gave us statistically significant performance and conversion data before any production risk was assumed. If you cannot measure the new behavior against the old, you are guessing.

2. Strangle the monolith at the edges. Checkout was the right first domain because it was the most customer-visible and because it had relatively few shared database dependencies compared to user profiles or catalog. Start with a domain that creates a natural facade boundary.

3. Design for failure in event pipelines. Our first EventBridge prototype did not include a dead-letter queue. During load testing, a single inventory event with a malformed payload stopped the entire processing chain for 47 minutes. Adding a DLQ and replay capability cost nothing in development time and prevented a production outage on day one.

4. Keep sessions simple. The DynamoDB session cache introduced eventual consistency that was not initially obvious. A user who updated their cart and immediately clicked checkout sometimes saw stale cart contents. We solved this with a cache-aside write-through pattern and a 100 ms max-age directive on session reads.

5. Budget observability from day one. Structured logging, distributed tracing, and a shared dashboard were not afterthoughts. They were built alongside each service. During the holiday weekend, these tools allowed us to identify and fix a routing misconfiguration in 22 minutes that would otherwise have taken hours to diagnose.

6. Match the compute model to the team. ECS with Fargate was a deliberate choice. The team had limited container orchestration experience. ECS removed the Kubernetes control-plane burden while still providing auto-scaling, task isolation, and rolling deployments. Do not default to the most sophisticated platform; default to what your team can operate safely under pressure.

7. Roll back at the network layer, not the application layer. Blue-green deployments behind an ALB meant we could shift traffic back to the previous target group in seconds. Application-level feature flags were still useful, but the network rollback was our fastest recovery mechanism during the migration.

8. Parallelize verification, not just development. Running the legacy triggers and EventBridge pipeline in parallel for 14 days was expensive in log storage but cheap relative to the cost of a single oversell incident during a promotional sale. Dual-write and reconciliation should be standard practice when replacing critical data flows.

9. Communicate cost early and often. The 7.7% spend increase was tracked weekly and shared with finance before they asked. Proactive cost transparency built trust with stakeholders and made the final conversation about value rather than surprise.

Modernization is not a technical exercise. It is an organizational one. The Meridian engagement succeeded because the engineering team, finance department, and marketing stakeholders aligned on a shared set of constraints and a single definition of success before a line of new code was written.