Webskyne
Webskyne
LOGIN
← Back to journal

20 May 20269 min read

From 80 RPS to 2,000 RPS: How FreshGrowth E-Commerce Re-Architected Their Platform in 90 Days

FreshGrowth, a fast-growing direct-to-consumer grocery startup, was on the verge of collapse. Their Ruby on Rails monolith, which had served them well through early growth, was grinding to a halt at just 80 requests per second — three orders of magnitude below what their Black Friday surge demanded. This is the story of how a small senior engineering team tore down a legacy monolith and rebuilt it as an event-driven, multi-region microservices platform in 90 flat days, going from chronic downtime to 2,800 sustained RPS and a 99.97% uptime record.

Case Studymicroservicesevent-driven architecturee-commercedigital transformationperformance optimizationsystem designe-commerce platformsoftware architecture
From 80 RPS to 2,000 RPS: How FreshGrowth E-Commerce Re-Architected Their Platform in 90 Days

1. Overview

FreshGrowth launched in 2019 as a direct-to-consumer grocery delivery startup. Their pitch was simple: fresh, organic produce delivered to customers' doors within hours of harvest, undercutting traditional grocery retail by cutting out the middleman entirely. By mid-2022, the company had scaled to 180,000 active subscribers across three European markets, was processing over 40,000 orders a week, and was on a trajectory to hit a projected $£120M ARR by the end of the fiscal year.

What no one outside the engineering team knew was that the platform underpinning that growth was a ticking time bomb. Built in early 2019 on Ruby on Rails 5.2 with a monolithic PostgreSQL database and a hand-rolled caching layer, the architecture had never been meaningfully revisited. Every subsequent feature, promotion, or market expansion had been bolted on top of the original foundation. By late 2022, the system was operating at a fragile edge that was becoming impossible to ignore.

What follows is a forensic, step-by-step account of how FreshGrowth’s four-person senior engineering team executed one of the most aggressive and technically demanding re-architecture programs in modern SaaS history — in 90 working days, under real commercial pressure, without a single week-long outage.

2. The Challenge

To understand the magnitude of the problem, it helps to appreciate the state of the platform at the start of the engagement.

2.1 The Monolith Under Pressure

The FreshGrowth monolith was, in architectural terms, what you get when you never draw a boundary line between services and let a Rails app grow organically for four years. It housed:

  • Customer identity and authentication (Devise, bespoke OAuth stores, session state)
  • Product catalog and inventory management (80,000+ SKUs across perishable and non-perishable lines)
  • Order lifecycle management (cart, checkout, payment, fulfillment, invoicing)
  • Delivery orchestration (last-mile routing, driver assignment, ETA prediction)
  • Subscription and billing engine (recurring charges, proration, voucher redemption)
  • Admin and analytics dashboards

All of this lived in a single Rails codebase, backed by a single PostgreSQL primary with read replicas, and cached by a Redis instance that had become, in many cases, the only thing preventing the database from flatlining.

The platform was peaking at approximately 80 requests per second (RPS) per region during normal traffic, and would spike to 350–400 RPS during promotional events. Beyond that threshold, latency ballooned from sub-200ms median to over 8 seconds, and database connections would saturate, leading to cascading 503 errors across the entire platform.

2.2 What Was Actually Breaking

The lead-up to Black Friday 2022 made the problem impossible to ignore. In three scheduled load tests, conducted in September and October, the platform hit hard limits at:

  • 406 RPS during the first test — database CPU spiked to 98% within 90 seconds
  • 412 RPS during the second test — Redis evicted 62% of the cache layer, forcing cold cache penalties on every subsequent request
  • 389 RPS during the third test — slow query log captured 47 separate queries executing in excess of 5 seconds, including a single checkout query that required jittering across 11 join tables

On November 3rd, during a flash promotional campaign, the platform crashed completely at 8:17 AM. Orders were failing in cart and checkout. Support tickets flooded in. The incident post-mortem ran to 38 pages and included the stark admission that “overall platform capacity is approximately 350 RPS and we have no credible path to increasing it in time for peak weeks.”

3. Goals

The engineering and executive teams agreed on four hard constraints before any line of code was rewritten.

3.1 Functional Goals

  • Support 2,500 sustained RPS with graceful degradation under peaks of up to 10,000 RPS
  • Maintain or improve operational overhead — no headcount increase for platform operations
  • Preserve data integrity — zero data loss, zero duplicate orders during the migration
  • Keep the platform live through the entire re-architecture period

3.2 Non-Negotiable Non-Functional Goals

  • Zero downtime migration — no planned outages during the transition
  • 100 days execution window aligned to the calendar of the next promotional cycle
  • Multi-region availability — deploy across two geographically isolated cloud regions for resilience
  • Event-driven core — services communicate via durable event streams, not synchronous HTTP calls

These goals were not unchallenged internally. A competing proposal to scale vertically on GPU-backed database instances was costed at 3.2x the capex of the microservices re-architecture over a 36-month horizon, and carried the same unknown failure surface in behavioral complexity as the existing system.

4. Approach

The team selected a strangler-fig migration pattern combined with an event-driven microservices architecture as their primary design paradigm.

The strangler-fig metaphor, popularized by Martin Fowler, describes a migration strategy in which a new system gradually grows around an old system, handling increasingly large slices of traffic until the old system is completely replaced and can be decommissioned. FreshGrowth would not attempt a big-bang replacement. Instead, each business domain would be extracted into a standalone service over a 90-day period, with the monolith progressively ghosted through feature flags until it could be shut down entirely.

The event-driven architecture (EDA) was chosen for three specific structural reasons: it introduced asynchronous decoupling between services, it allowed durable replay of events for observability and debugging, and it aligned naturally with the business’s existing operational workflows. Every meaningful action — order placed, payment confirmed, delivery scheduled — would produce a canonical event that downstream consumers could pick up at their own pace.

5. Implementation

The implementation plan was structured into three overlapping 30-day phases: Phase 1 (Foundation), Phase 2 (Core Services Extraction), and Phase 3 (Orchestration, Observability, and Cutover).

Phase 1: Foundation — Days 1–30

Week 1: Infrastructure and Event Backbone. The first week was devoted entirely to building the shared infrastructure layer that all subsequent services would depend on. The team:

  • Provisioned a Kubernetes cluster (EKS) in the primary AWS region (eu-west-1) with a warm standby in eu-central-1 for multi-region failover
  • Provisioned a Kafka cluster (MSK) with 3 internal brokers per region and topic replication across regions for failure isolation
  • Provisioned Redis Enterprise as the primary caching and session store, replacing the monolithic single-instance Redis deployment
  • Provisioned PostgreSQL 15 in a Highly Available configuration using AWS Aurora, with read replicas in both regions
  • Established full mTLS across all internal service mesh traffic using Istio Linkerd
  • Deployed OpenTelemetry collectors in each region to aggregate traces, metrics, and logs to a centralized Grafana Cloud observability stack
  • Built a CI/CD pipeline (GitHub Actions) with automated image scanning, integration test suites, and blue/green deployment support

The event backbone was the keystone of the entire architecture. Five foundational event schemas were defined using Apache Avro and registered in a Schema Registry:

  1. CustomerRegistered — fired when a customer account is created
  2. ProductInventoryUpdated — fires on any inventory change (stock received, reserved, expired, or sold)
  3. OrderPlaced — contains full order context, customer identity, line items, and pricing
  4. PaymentConfirmed — fired after payment gateway confirms funds received
  5. DeliveryScheduled — fires when a delivery slot is assigned to an order

These schemas were versioned with backward compatibility rules preventing destructive changes. Any schema breaking change required a new versioned topic and a staged co-existence period before deprecated topics could be retired.

Phase 2: Core Services Extraction — Days 31–60

Week 5–6: Identity and Product Catalog. The first two services to be extracted were the lowest-risk, highest-isolation candidates.

The Identity Service handled authentication, session management, JWT issuance, customer profile storage, and OAuth integration. The team extracted it by:

  1. Defining a read-through cache interface that queried the Identity Service with a fall-back to the monolith database during the transition
  2. Implementing CDC (Change Data Capture) using Debezium to stream customer table mutations into Kafka as CustomerRegistered and CustomerProfileUpdated events
  3. Replacing Devise session management with token-based auth (JWT + refresh tokens) served exclusively by the Identity Service
  4. Adding a feature flag to flip traffic from monolith auth to Identity Service at the edge, running both in parallel for 10 days while Google Analytics monitored client-side authentication success rates
  5. Once the error rate dipped below 0.02%, the monolith auth path was removed entirely

The Product Catalog Service followed a similar pattern but introduced the pattern of “snapshot tables” for expensive aggregations (per-category inventory counts, regional availability) to be pre-computed and cached, reducing catalog page load times by 73%.

Week 7–8: Order Orchestration Service. This was the first truly complex extraction and consumed more engineering time than identity and catalog combined.

The challenge with the order service was transactional correctness. Placing an order in the monolith was a single Rails transaction that spanned multiple tables and ensured that inventory reservation, order creation, payment initiation, and confirmation were atomic. Extracting this behavior into an event-driven service required redesigning the core ordering workflow as an event-sourced state machine, where each order proceeds through a series of well-defined states (Cart, ItemReserved, PaymentInitiated, PaymentConfirmed, FulfillmentScheduled, Dispatched, Delivered), with each state transition producing a durable event.

The implementation used Event Sourcing with the Axon Framework in Kotlin (selected for type safety, JVM maturity, and its deep integration with Kafka Streams). Every state transition was an append-only event written to Kafka. The current state of any order could be reconstructed at any point by replaying its event stream from the beginning.

A critical innovation was the introduction of an Idempotency Layer using a combination of request-scoped UUIDs and Kafka’s transactional producer guarantees. Any duplicate or replayed event would be silently de-duplicated at the consumer level using a deduplication index in DynamoDB. This eliminated the class of double-charge and double-fulfill errors that had caused recurring production issues during prior promotional events.

Phase 3: Orchestration, Observability, and Cutover — Days 61–90

Week 9–10: Delivery Orchestration and Billing. The remaining two services — the Delivery Orchestration Engine (DOE) and the Billing Service — were extracted using the patterns established in prior phases. The DOE consumed DeliveryScheduled events to route driver assignments, manage capacity planning, and push ETA updates via WebSocket to customer-facing real-time order tracking pages. The Billing Service consumed PaymentConfirmed events to trigger invoice generation, handle prorated subscription changes, apply voucher logic, and emit InvoiceGenerated events for downstream reporting pipelines.

Week 11: Feature flag decommissioning and monolith sun-setting. By Day 80, the monolith was no longer handling any live customer traffic. The remaining traffic passing through it was a slowly decaying legacy API path used exclusively by a third-party loyalty platform. The team set a deadline and communicated it to the loyalty platform vendor.

Week 12: Load testing, cutover vetting, and Black Friday rehearsal. The final 10 working days were entirely consumed by load testing and a dress rehearsal of the full Black Friday promotional flow with artificially inflated order volumes to validate that the system held up under 10,000 RPS (10x the sustained target).

  • The Kubernetes autoscaler was tuned with harsher scale-up thresholds and cooldown overrides
  • Circuit-breaker patterns were added to the Istio service mesh config for all inter-service HTTP calls with triple invocation caps
  • Non-essential API endpoints (analytics dashboards, admin search) were placed on secondary capacity pools during promotional events

Related Posts

How FinFlow Partnered with Webskyne to Reduce Payment Processing Latency by 73% and Handle 10× Peak Traffic
Case Study

How FinFlow Partnered with Webskyne to Reduce Payment Processing Latency by 73% and Handle 10× Peak Traffic

FinFlow, a rapidly scaling Indian fintech platform processing over ₹2,000 crore in monthly transactions, faced a critical performance ceiling. Their legacy monolith struggled under festival-season load spikes, causing failed payments and eroding merchant trust. This case study details how a targeted architecture overhaul — spanning 12 weeks and spanning event-driven redesign, database partitioning, and progressive migration — turned a crisis into a competitive advantage, reducing p99 latency from 2.8s to under 750ms and cutting infrastructure costs by 34% in the process.

From DB Lock-Contention to 11× Throughput: How Finstack Built a Zero-Downtime Payments Engine
Case Study

From DB Lock-Contention to 11× Throughput: How Finstack Built a Zero-Downtime Payments Engine

In late 2024, Finstack — a digital payments provider processing 8 million transactions monthly for micro-merchants in Southeast Asia — sat one regulation away from a three-day platform outage. A queue deep-dive revealed the root cause: a single PostgreSQL write path in the core ledger, with no idle compute and 1,200+ 500-ms retries per second bleeding edge cases into downstream microservices. This case study traces every technical decision that followed — from the architectural diagnosis and 90-day refactor sprint to the code reveal, the live-brownout migration, and the post-go-live lessons that reshaped how the entire billing and partnership team writes distributed systems. It is a story not just of performance, but of governance, team structure, and the discipline required to rewrite the software frontier beneath a production platform.

How PayForge Cut Payment Infrastructure Costs by 62% While Scaling to 1.2M Transactions Per Second
Case Study

How PayForge Cut Payment Infrastructure Costs by 62% While Scaling to 1.2M Transactions Per Second

When fintech startup PayForge hit 420 million monthly transactions in 2025, their legacy payment rails buckled under the load — slashing transaction costs by 62% and reclaiming 98% sub-second latency required a systematic overhaul of every layer from routing logic to observability. This case study breaks down the six-month modernization that rebuilt their entire vertical-stack payment orchestration layer.