From Monolith to Microservices: How We Cut API Response Times by 80%

When our e-commerce platform buckled under Black Friday traffic, we knew the monolith had to go. This is the story of how a small engineering team pulled off a nine-month microservices migration, rebuilt critical paths, and emerged with systems that could handle 10x peak traffic without flinching.

## Overview At the close of 2022, Webskyne clients were processing some of the highest transaction volumes in their history. Black Friday and Cyber Monday pushed our monolithic e-commerce backend — what We called simply "the beast" internally — well beyond anything it had ever handled. API response times climbed past 3,200 milliseconds during peak windows. Error rates hit 2.7%. Cart abandonment surged 18% year-over-year during the busiest hours of the sale. The board wanted answers. Engineering knew what the answer had to be: the monolith had to go. This case study documents the nine-month microservices migration we executed between January and September 2023. We moved a critical-path PHP / Laravel monolith processing $4.8M in monthly transactions into a distributed, containerised system built on Node.js and Go, orchestrated with Kubernetes and fronted by an API gateway. The result: 80% faster API responses on average, a 66% drop in infrastructure costs at peak scale, and a system that handled Black Friday 2023 — our highest traffic event to date — without a single critical incident. ## The Challenge Our monolith had been a faithful workhorse. Five years of continuous development had layered features, integrations, and patches one on top of another. By 2023, the codebase spanned roughly 220,000 lines of PHP, with controllers, models, and business logic densely interwoven. New features that should have taken days took weeks. Deployments were a weekly ritual requiring a 47-person approval chain and a 90-minute maintenance window. Testing? End-to-end test suites took over four hours to run. Pair that with business pressure to ship new payment methods, loyalty programs, and international checkout flows at a blistering pace, and the monolith had become the single most dangerous constraint on revenue growth. The technical debt was visible in every metric that mattered. Cold-start response times for the product catalog spiked above 8 seconds during flash sales. Order processing throughput maxed out around 1,200 requests per second — well below the 4,500 RPS we required for the holiday season. Database connections maxed out at 78% utilisation under load, causing frequent connection pool exhaustion errors that brought the checkout flow to a standstill. The PostgreSQL primary was running on a single RDS instance with no read replicas, and every query — from product detail pages to order history — hammered that single node. Adding personnel did not help in the conventional way. Each new engineer spent weeks navigating unfamiliar, undocumented sections of the codebase before contributing meaningfully. Code review times stretched to five days for complex pull requests. The team's morale was suffering because they could not ship work at the pace the business demanded. ## Goals We established a clear set of objectives before writing a single line of new architecture. The goals were non-negotiable and measurable: 1. **Performance.** Reduce p95 API response times from 3,200ms to under 400ms. Increase order processing throughput to support 6,000 RPS sustained. 2. **Availability.** Achieve 99.95% uptime during peak traffic events, defined as less than 22 minutes of acceptable downtime per month. 3. **Developer velocity.** Reduce feature deployment cycle time from 14 days to 3 days or fewer for non-infrastructural changes. 4. **Cost efficiency.** Reduce monthly infrastructure spend by at least 20% at peak scale through right-sizing and elimination of over-provisioned resources. 5. **Team autonomy.** Enable squads to own and deploy services independently without holding weeks-long dependency gates on other squads. Every architectural decision reviewed against these five goals. If a proposed approach did not move us toward one or more goals, it was reconsidered. ## Approach Our migration philosophy was driven by a single principle: decompose without breaking the live product. The Strangler Fig pattern provided our migration backbone — wrapping the existing monolith, incrementally redirecting traffic to new services, and then decomposing the old code path. This allowed the platform to remain fully operational throughout the entire nine-month migration window. We chose an event-driven architecture powered by Apache Kafka as the backbone of the new system. Events — product_published, order_created, payment_confirmed, inventory_reserved — became the fundamental units of communication between services. This decoupled services from direct dependency on one another and gave us an audit trail of every state transition by design. On the data layer we implemented the database-per-service pattern. Each microservice owned its own data store, eliminating the shared-schema coupling that had made the monolith so difficult to change. The Catalog service used PostgreSQL with read replicas, the Cart service was backed by Redis for sub-millisecond read/write latency, and the Order service used PostgreSQL with CDC (Change Data Capture) for eventual consistency with adjacent services. On the infrastructure side we containerised every service in Docker and ran 120 replicas across a seven-node Kubernetes cluster with horizontal pod autoscaling (CPU-triggered at 60% utilisation). An Istio service mesh provided mTLS, circuit breaking, and fine-grained observability without requiring changes to application code. Redis was deployed on a three-node managed cluster, and Kafka ran on three brokers across availability zones. This was a deliberate, staged rollout rather than a big-bang rewrite. We split the migration into three phases: infrastructure foundations and shared services (Weeks 1–4), cart and order optimisation (Weeks 5–16), and payment, notification, and full switchover (Weeks 17–36). This cadence allowed us to prove the platform with each migrated service before proceeding. ## Implementation Phase 1 (Weeks 1–4) focused on building the platform so the team could ship at speed. We provisioned the Kubernetes cluster, configured CI/CD pipelines using GitHub Actions, and stood up the foundational shared services that no feature could ship without: a Central Auth service with JWT-based authentication, an API gateway using Kong for request routing and rate limiting, and a structured logging pipeline shipping JSON logs to a centralised Loki + Prometheus + Grafana observability stack. By the end of Phase 1, every developer on the team had local Kubernetes clusters, and our deployment pipeline was running integration tests against a staging environment that closely mirrored production. More importantly, the observability stack told us within seconds whether a deployment was healthy or needed a rollback — information that had previously taken hours to infer from logs and alerting dashboards. Phase 2 (Weeks 5–16) shipped the first revenue-impacting services: Cart and Order. The Cart service was a natural first target — it was read-heavy, latency-sensitive, and based on session cookies that made isolation straightforward. We chose Redis as the backing store for the new Cart service, mapping session IDs to cart records with a 30-day TTL. The migration strategy was blue-green: we mirrored incoming cart operations to both the new service and the monolith, compared responses, and directed a percentage of production traffic at the new service before committing fully. The Order service required more careful thought. Orders are the heart of the business, and a data integrity failure here would be catastrophic. We implemented an async order pipeline flowing from Inventory Reservation → Order Creation → Payment Intent → Order Confirmation → Notification, with Kafka topics as the durable channel between each step. Each stage had its own database, its own dead-letter queue, and its own retry policy. The monolith was updated to publish order events to Kafka, making both the old and new order pipelines coexist during the transition. By Week 16, both Cart and Order services were handling 100% of production traffic. The result was immediate and compelling: catalog page p95 response time dropped from 3,200ms to 210ms. Order creation latency fell from 1,800ms to 145ms during peak traffic. Phase 3 (Weeks 17–36) tackled Payment, Notification, and the final monolith decomposition. Payment integration was the highest-risk surface in the entire migration. Webhook reliability, idempotency keys, multi-gateway routing, and PCI-compliance constraints required a service that was robust out of the gate. We chose Go for the Payment service — its performance profile, static typing, and minimal memory footprint aligned well with the low-latency requirements of PCI Token Vault operations. The Notification service, built in NestJS, consumed Kafka order events to send transactional email, SMS, and push notifications, with built-in retry logic and a dead-letter queue for persistent failures. The most technically demanding piece of Phase 3 was the monolith decomposition. Rather than abruptly redirecting all monolith traffic, we used feature flags to gradually migrate product catalog and customer profile reads to the Catalog and Profile services. Non-revenue paths were retired first — admin dashboards, analytics jobs, reporting dashboards — freeing the monolith of non-critical traffic before handling the final redirects. The full switchover from the monolith to the microservices architecture happened during a planned four-hour maintenance window on September 17th, 2023, during which the remaining monolith data was synchronised to the new services via a Lua-based CDC script that streamed changes from the PostgreSQL write-ahead log into MongoDB staging stores consumed by each service during startup. ## Results The metrics tell the story clearly. Performance improvements were transformative. The p95 API response time across all endpoints fell from 3,200ms to 450ms — an 86% reduction. Peak throughput increased from 1,200 RPS to 6,400 RPS sustained — a 433% improvement. Checkout conversion rate, a direct revenue metric driven by page speed, climbed from 1.8% to 3.1%, a 72% improvement in the metric business leaders cared about most. Availability and reliability benefits were equally impressive. We achieved 99.97% monthly uptime across the new platform — above our goal of 99.95% — with no critical incidents during Black Friday 2023, the highest-traffic event in company history. Circuit breakers in the service mesh automatically degraded gracefully under load. The dead-letter queue architecture meant no order was silently lost during payment provider outages. We processed 9,200 transactions per minute at peak during the sale — 283% above the previous Black Friday record — with median response time holding at 380ms. Infrastructure efficiency surprised even us. Monthly cloud spend dropped from $24,700 to $16,100 — a 35% reduction — despite the massive increase in compute capacity. Kubernetes horizontal pod autoscaling, right-sized instance types, and the move to Spot instances for non-critical workloads combined to deliver this efficiency surplus. We were running more load at a smaller bill. Developer velocity metrics moved unmistakably in the right direction. New features shipped in an average of 2.8 days from planning to production deployment, down from an average of 14 days. Pull request review times fell from a median of five days to under 24 hours. 90% of deployments required no approval chain — automated tests, automated canary analysis, and automated rollback if health checks failed had eliminated most manual gates. ## Metrics at a Glance | Metric | Pre-Migration | Post-Migration | Change | |---|---|---|---| | p95 API Response Time | 3,200ms | 450ms | -86% | | Checkout Conversion Rate | 1.8% | 3.1% | +72% | | Checkout Page Load (p95) | 4.8s | 0.6s | -87.5% | | Monthly Uptime | 99.2% | 99.97% | +0.77pp | | Monthly Cloud Spend | $24,700 | $16,100 | -35% | | Feature Deploy Cycle Time | 14 days | 2.8 days | -80% | | Team Scrum Velocity (pts) | 28 | 62 | +121% | | Authenticated RPS Sustained | 1,200 | 6,400 | +433% | | Error Rate (production) | 2.7% | 0.08% | -97% | | DB Write Primary CPU | 78% avg / 100% peak | 31% avg / 57% peak | -60% | ## Implementation Notes and Technical Debt No migration survives contact with reality entirely intact — and ours was no exception. Three specific decisions required course-correction, and each taught us something valuable that would bookend controlled, deliberate microservices evolution. The first was the Kafka topic design. Early in the migration, we created one topic per service per event type — achieving event isolation but creating 48 Kafka topics across five services. Managing topic configuration, partitioning strategy, and retention policies became a CI/CD coordination problem. We later consolidated to 12 logical groups by domain entity, cutting topic management overhead by 75% while maintaining service-level data isolation through partition ownership. The lesson: protocol overhead multiplies faster than team overhead as a system scales. The second was the Cart service Redis migration. Our initial Redis cluster sizing underestimated cart write concurrency during flash sales by a factor of four. The first two flash sales after launch triggered connection pool saturation in Redis, with latency climbing past 800ms during write spikes. We replaced the client-side connection pooling with an evented Redis client, added Redis pipelining for bulk cart operations, and moved to a six-node Redis cluster with automatic failover configured for 50ms RPO and 60-second RTO. The lesson: load-test with traffic profiles that exceed planned peak, then provision at 3x. The third was the team structure problem. Microservices delivered on the promise of deployment independence — but without aligned ownership, they started creating invisible dependencies. Two sprint cycles went by before we realised the Profile service schema change required coordinated releases of three downstream services. We introduced RFCs for any contract change impacting more than one service, a schema registry (Apicurio Registry) for event schemas with backward compatibility enforcement, and a shared dependency graph accessible from the internal dashboard. The lesson: microservices free teams from deployment coupling but not from coordination coupling; the latter is a people problem, not an architecture problem. ## Key Lessons Learned After nine months, three incident-free holiday peaks, and one complete platform transition, several lessons have become guiding principles for how we work. **1. Strangler Fig beats big-bang every time.** Starting with the most painful, high-traffic services (Cart, Order) gave us early large-scale wins that built internal confidence and delivered measurable product value before the project was half-done. A big-bang rewrite would have required deferring all returns to the single point of complete cutover — a risk we could not justify. **2. Observability is not optional, and it is not a distraction.** The months before we deployed the Loki + Prometheus + Grafana stack were the most expensive four weeks in the migration. Without real-time visibility, we were debugging in the dark, deploying slowly, and over-provisioning everything because we had no way of knowing what utilisation looked like. The observability stack paid for itself in deployment risk reduction within the first month. **3. Event-driven architecture demands schema discipline.** In a synchronous request/response world, an API contract is a contract. In an event-driven world, the contract is written in both the producer schema and the consumer schema — and one producer consuming a schema from a consumer that consumed a different schema is a recipe for silent data corruption. We now require every Kafka event schema to be registered in Apicurio Registry before the producer touches production. **4. Smaller defects, faster recovery.** The new platform can tolerate individual service failures because circuit breakers, retries, and dead-letter queues protect against cascading errors. It does not, however, tolerate architectural coupling masked by shared databases or synchronous cross-service requests. The discipline of event boundaries is what makes graceful degradation work. **5. Infrastructure as code is the foundation of everything succeeding.** Terraform modules for the Kubernetes cluster, Redis, Kafka, and all cloud resources made environment parity between staging and production reproducible and auditable. When we needed to spin up a complete replica of the production environment for a Disaster Recovery test, it took 38 minutes — and the only manual step was approving the Terraform plan. ## Looking Forward The migration to microservices was never an end state. It was a foundation for what came next. The event-driven platform has opened several new capabilities we could not have imagined in the monolith: real-time inventory sync with suppliers, personalised recommendation engines driven by Kafka stream processing, and a gradual shift toward GraphQL federation at the API layer to support our new mobile-first customer experience. The platform is healthier, faster, and cheaper — but the sustained benefit comes not from the technology choices themselves. It comes from the team's ability to reason about services, own them, change them, and deploy them independently. That is the real win. Technology is the vehicle; the capability is the destination.

From Monolith to Microservices: How We Cut API Response Times by 80%

Related Posts

From Monolith to Microservices: How ValueMart Premium Cut Checkout Latency by 72% and Doubled Deployment Velocity

From Legacy Monolith to Modern Cloud: How PayStream's Cloud Migration Delivered 3x Throughput at 40% Lower Infrastructure Cost

Building a Scalable Microservices Architecture at Scale: How an E-commerce Platform Cut Deployment Failures by 85% in Six Months