Webskyne
Webskyne
LOGIN
← Back to journal

22 May 20267 min read

How Meridian Retail Replaced a Fragile Monolith with a Cloud-Native Microservices Platform

Meridian Retail spent 18 months migrating from a 350,000-line PHP monolith to an event-driven microservices architecture on AWS — led by Webskyne. Platform uptime jumped from 99.4% to 99.95%, deployment cycles fell from 4–6 weeks to under one week, and infrastructure costs dropped 42%. Here's the full story: the challenge, the architecture, the implementation phases, the results, and the hard-won lessons every engineering leader should read.

Case Studymicroservicescloud migrationAWSevent-driven architecturemicroservices architecturee-commerceplatform engineeringKafka
How Meridian Retail Replaced a Fragile Monolith with a Cloud-Native Microservices Platform

Overview

Meridian Retail, a mid-sized omnichannel fashion retailer operating across the UK and EU, managed an aging monolithic e-commerce platform built on PHP 7 and a legacy jQuery frontend. As the business scaled — spanning brick-and-mortar stores, a B2B wholesale channel, and a direct-to-consumer web store — the monolith became an increasingly expensive and risky source of technical debt. Over an 18-month engagement, Webskyne led a complete re-platforming to a modern, event-driven microservices architecture on AWS, enabling Meridian to scale seamlessly, improve platform reliability, and accelerate their feature velocity.

The Challenge

Meridian's monolithic platform was a single 350,000-line PHP codebase handling every business concern: product catalogues, search, cart, checkout, inventory management, loyalty, order processing, and reporting. The team of 12 engineers was spending over 60% of their Sprint capacity on maintenance — patching security vulnerabilities, resolving deployment conflicts, and manually debugging cascading failures. A single inopportune deployment to the checkout service could take down the entire website, a risk amplified by major campaigns and seasonal peaks like Black Friday, where downtime could cost upwards of £200,000 per hour.

The monolithic design also meant that scaling was all-or-nothing: the entire application had to be scaled together, resulting in 3–5× over-provisioning of compute during low-traffic periods and costly auto-scaling misses at peak. The technical debt accrued over five years had become self-reinforcing — the team was afraid to refactor because the test suite was incomplete and rollbacks were risky, which meant bugs accumulated, which made development slower.

Adding to the problem, Meridian's search engine ran during business hours and was so resource-intensive that it regularly coincided with checkout bottlenecks. Marketing campaigns demanded rapid feature launches, but each release cycle took 4–6 weeks, making Meridian reactive rather than proactive in the highly competitive fashion e-commerce space.

Goals

The engagement was framed around four primary outcomes:

  • Decouple the monolith into independently deployable microservices, starting with high-bloat, high-change areas: product search, checkout, and inventory.
  • Achieve 99.9% uptime during peak traffic — a reduction from the historically volatile 99.4% platform availability achieved on the monolith.
  • Reduce full-cycle deployment time from 4–6 weeks to under 1 week for most feature branches.
  • Cut infrastructure costs by at least 30% through right-sized autoscaling and leapfrog any further monolith over-provisioning.

Our Approach

Webskyne adopted a Strangler Fig pattern rather than a big-bang rewrite. Rather than shutting the monolith off overnight, we incrementally routed increasing traffic to new services while keeping the monolith running as a stable backbone. This approach minimised business risk while allowing the team to continuously ship value.

Our technical architecture was designed as follows:

  • Event-Driven Backbone: An Apache Kafka cluster on Amazon MSK formed the central nervous system of the new architecture. All domain events — OrderPlaced, StockReserved, ProductUpdated — were published as immutable events, enabling asynchronous, loosely-coupled inter-service communication.
  • API Gateway: Amazon API Gateway with a Lambda authoriser sat at the edge, routing incoming requests to the appropriate service or the existing monolith, depending on the migration phase.
  • Service Mesh: Istio on Amazon EKS managed inter-service traffic, provided automatic retries, circuit breaking, and distributed tracing via OpenTelemetry — all without touching application code.
  • CI/CD Pipelines: Each service was containerised with Docker and deployed via Argo CD following GitOps principles, with feature flags managed by LaunchDarkly for canary releases.

Implementation

Phase 1 (Months 1–3) focused on foundational infrastructure and the first extraction: the product search service. The existing search logic relied on a raw MySQL-joined-table approach with no faceting or relevance scoring. We introduced OpenSearch (AWS-managed Elasticsearch) and built a Node.js-based search API, implementing full-text search, attribute faceting, personalised ranking via a lightweight XGBoost model, and an asynchronous indexer that read Kafka product events. The search service achieved sub-150ms p99 latency during initial load testing with 10,000 concurrent users.

Phase 2 (Months 4–9) tackled the checkout and cart services, which were the most critical business capabilities. The checkout domain was decomposed into two services — cart-service (Session-based, Redis-backed cart state) and checkout-service (Orchestrating payment, shipping, and order finalisation). The biggest challenge was maintaining data consistency between the legacy monolith database and the new services during the dual-write transition period. We solved this using an outbox pattern: every state change emitted both a database event and a Kafka record, with idempotent consumers ensuring exactly-once delivery. Payment gateway integration (Stripe and Adyen) was handled via a saga-orchestration layer that could compensate on failure, eliminating the need for distributed two-phase commits.

Phase 3 (Months 10–15) delivered the inventory and order orchestration services. We replaced the old polling-based inventory reservation model with a Kafka Streams-based real-time deduplication engine, reducing inventory double-booking incidents to zero within 30 days of go-live. Order state management was moved to a durable event store (EventStoreDB), giving Meridian's support team an auditable timeline of every customer order through its entire lifecycle.

Throughout the project, we ran chaos engineering experiments using Gremlin on staging environments, validating that services could withstand the loss of individual nodes and message brokers without data loss. Load testing with Locust confirmed that the new architecture could sustain a 20× traffic spike during a simulated Black Friday event while keeping p99 latency under 300ms.

Results

Meridian's new microservices architecture went fully live 18 months after project kickoff, replacing the last monolith routes with service equivalents on Black Friday of the following year. The business impact was immediate and significant:

  • Platform uptime improved from 99.4% to 99.95% — equating to less than 4 hours and 22 minutes of planned and unplanned downtime per year, compared to over 51 hours under the old platform.
  • Feature deployment velocity increased 6×, from an average of 4–6 weeks per release to under 1 week, thanks to independent service deployability and feature flag controls.
  • Infrastructure costs reduced by 42%, primarily from a shift from static EC2 provisioning to Kubernetes autoscaling and spot instance usage for stateless background workers, saving an estimated £127,000 per year in AWS spend.
  • Checkout conversion rate increased 3.8%, attributed to reduced latency and the elimination of the previous indexing lock contention during peak hours.
  • Search abandonment rate fell by 28%, facilitated by faster, more relevant search results with facet filtering and real-time index updates.

Key Metrics

MetricBefore MigrationAfter MigrationChange
Platform uptime99.4%99.95%+0.55 pp
Mean deployment cycle4–6 weeks<1 week6× faster
Infrastructure cost~£302K/yr~£175K/yr-42%
Search p99 latency820ms148ms-82%
Checkout p99 latency1,240ms285ms-77%
Conversion rate2.1%2.18%+3.8%
Inventory booking errors~12/week~0/week-100%

Architecture Diagram: High-Level Service Mesh

The new architecture can be summarised at a high level as follows:

  • Edge Layer: CloudFront → API Gateway → Lambda Authoriser → Service Mesh (Istio/Ingress)
  • Services Layer: Product Search (Node.js + OpenSearch), Cart (Redis), Checkout (Node.js + Stripe/Sagepay), Inventory (Go + Kafka Streams), Order Store (EventStoreDB), Notification Service (Python)
  • Data Layer: Kafka MSK (event backbone), RDS Aurora (monolith read replica with persistence), OpenSearch (search index), Redis Cluster (cart session store), S3 + CloudFront (media assets)
  • Observability: Grafana + Prometheus + Jaeger/OpenTelemetry + CloudWatch

Lessons Learned

1. The Strangler Fig pattern beats big-bang rewrites every time.

Big-bang rewrites almost inevitably overrun schedules and budgets, and many never ship. By strangling the monolith progressively and shipping incremental value every Sprint, Meridian's leadership team could see ROI at every phase — not just at the finish line.

2. Data consistency between legacy and new systems is the hardest problem — solve it early.

The outbox event pattern was our most impactful architectural decision. It eliminated the dual-write race condition without introducing distributed transactions, and gave Meridian a complete, timestamped audit trail of every change.

3. Observability cannot be a Phase-4 afterthought.

We instrumented services from day one (OpenTelemetry + distributed tracing + structured logging), which meant we were never flying blind. This reduced mean-time-to-resolution (MTTR) for production incidents to under 15 minutes, compared to 3.5 hours on the old platform.

4. Feature flags unlock true progressive delivery.

LaunchDarkly flags on every service meant we could dark-launch features to internal users before any customer traffic, giving us rapid A/B testing velocity. The marketing team ran 12 feature experiments in the first quarter post-launch alone.

5. Test from chaos, not just from the happy path.

Chaos engineering sessions with Gremlin on staging revealed a subtle Kafka consumer rebalancing issue that would have caused a ~12-minute search degradation event under an unlikely but non-zero failure scenario. Catching it pre-launch avoided what would have been a Black Friday headline event.

Conclusion

Meridian Retail's migration from a fragile monolith to a resilient, event-driven cloud-native platform is an example of how deliberate architecture, discipline, and incremental delivery can transform a business's technical trajectory. The team is now working on autonomous ML-powered search ranking and a real-time personalisation engine — capabilities that were architecturally impossible on the old monolith. For any mid-market retailer facing a similar monolith burden, the Meridian playbook provides a repeatable, low-risk path to cloud-native modernity.

Related Posts

From Fragile Monolith to Resilient Microservices: How a Fintech Platform Cut Downtime by 95%
Case Study

From Fragile Monolith to Resilient Microservices: How a Fintech Platform Cut Downtime by 95%

When a regional fintech platform serving 2.3 million users faced escalating downtime and crippling release cycles, the engineering team made a bold bet: decompose the legacy monolith into a production-grade microservices architecture. Over eighteen months, that bet yielded not just system recovery — it delivered a 1,414% improvement in deployment velocity, a tenfold unit-cost reduction on infrastructure, and an ROI that paid for itself in six months. Here is the full story of what it took, what went wrong, and what every engineering team considering a similar path should know before they start.

From 400 TPS to 4,800 TPS: How FinPulse Rebuilt Its Payment Infrastructure to Orchestrate 47 Countries
Case Study

From 400 TPS to 4,800 TPS: How FinPulse Rebuilt Its Payment Infrastructure to Orchestrate 47 Countries

When FinPulse's payment orchestration platform buckled under 400 transactions per second during Black Friday 2024 — triggering 1,200 merchant escalations and $3.2M in SLA penalties — the company faced a critical decision: rebuild or accept permanent client erosion. With six enterprise renewals totalling $9.8M at risk, our team set a 10-week deadline. This case study documents the event-driven microservices rebuild that lifted throughput from 400 to 4,800 TPS, cut end-to-end latency P99 from 2,800ms to 380ms, and eliminated every SLA breach in the following peak cycle. We cover the architectural split of command and query paths using CQRS, the dual-write data migration strategy that preserved all 140M transaction records, the load-testing failures surfaced in week five that saved the cutover, and the post-launch operational lessons — from circuit-breaker design for 34 external bank APIs to an error-budget policy that dropped monthly incident count by 60 percent. Any team running a latency-sensitive financial platform will find actionable patterns here.

D:" Built By Cyber, their SaaS platform faced a quality crisis seemingly by accident, but the root cause ran deep through their entire software lifecycle. 2023, a growing Series B SaaS company serving mid-market retailers experienced a sharp escalation in customer support tickets and public user churn—all traced to a single product release. Our team at Webskyne was brought in to conduct a forensic postmortem and architect a holistic quality engineering transformation.
Case Study

D:" Built By Cyber, their SaaS platform faced a quality crisis seemingly by accident, but the root cause ran deep through their entire software lifecycle. 2023, a growing Series B SaaS company serving mid-market retailers experienced a sharp escalation in customer support tickets and public user churn—all traced to a single product release. Our team at Webskyne was brought in to conduct a forensic postmortem and architect a holistic quality engineering transformation.

How a software engineering firm pinpointed the root causes of a high-stakes product failure and architected a multi-layered quality engineering transformation that reduced production incidents by 78% and restored customer confidence within ninety days.