Webskyne
Webskyne
LOGIN
← Back to journal

19 May 20269 min read

How UrbanCart Rebuilt Its Platform in 12 Weeks: From Legacy Monolith to AWS Microservices

When UrbanCart's monolithic e-commerce system began buckling under Black Friday traffic, the engineering team had three months to rearchitect a platform handling ₹2.4 Crores in monthly GMV—without taking the site offline. Here is how they moved from a failing monolith to a fault-tolerant AWS microservices fabric, and the surprising lessons along the way. This in-depth case study covers the full journey from diagnosis to production rollout.

Case StudymicroservicesAWScloud-architecturenode-jsdevopse-commercesoftware-engineeringdigital-transformation
How UrbanCart Rebuilt Its Platform in 12 Weeks: From Legacy Monolith to AWS Microservices

Overview

UrbanCart is a mid-sized consumer electronics marketplace operating across India, serving approximately 1.8 million registered users and 12,000 active merchants. Before the migration described in this case study, the entire platform—catalog, cart, checkout, payments, inventory, order tracking, and notifications—ran as a single 140,000-line Node.js monolith deployed on a pair of EC2 instances behind a classic load balancer. The architecture had served the company well through four years of steady growth, but by late 2025 it was beginning to show the unmistakable signs of strain that technical debt inherits when left unattended.

The engineering team, then comprising 17 full-time developers, made the decision to pursue a full microservices migration in early January 2026. Twelve weeks later, UrbanCart was running six decoupled services on Amazon ECS, backed by RDS Aurora, Redis Cluster, SNS/SQS, and CloudFront—operating at 40% lower infrastructure cost per transaction while absorbing a 280% spike in peak traffic without a single deployment-related incident. This case study reconstructs that journey in detail.

The Challenge

The immediate catalyst for the migration was the impending Black Friday–Christmas shopping window. In November 2025, UrbanCart experienced a cascading failure during a promotional sale: the monolith's database connection pool exhausted under sustained 3,500 concurrent users, bringing down cart, checkout, and inventory simultaneously. Recovery took 47 minutes. Revenue lost in that window was later estimated by the finance team at approximately ₹14 lakhs.

But the problem extended beyond a single incident. Daily deployments had become a source of genuine anxiety—any push to the monolith required full regression testing across all modules, with a mandatory 90-minute maintenance window. Hotfixes to one subsystem risked introducing regressions in unrelated domains. New engineers took an average of six weeks to become productive on the codebase. And the monolith's cold-start time for new feature development had grown so long that the product team had begun depriororitizing agreed-upon roadmap items simply because engineering couldn't deliver them in a practical timeline.

Goals

The migration program—internally codenamed "Project Skyline"—was scoped around four explicit, measurable goals before a single line of new code was written:

Resilience under load: Achieve 99.9% uptime during peak promotional windows with graceful degradation. A failure in the payments service, for example, must not take down product recommendations or search.

Deployment independence: Enable any service to be deployed without requiring coordination across the entire team. Target time-to-production for low-risk changes: under two hours from PR merge to live.

Scalability by service: The checkout service, historically the most resource-intensive component, must scale independently of the catalog or notification services—reducing wasteful overprovisioning.

Minimal disruption: The migration must not require a full platform downtime event. All changes must be deployed incrementally, preserving the live production environment throughout.

Approach: Strangler Fig Architecture

Rather than attempting a "big bang" rewrite—a strategy that has destroyed more engineering teams than it has saved—the team adopted the Strangler Fig pattern: new services would be built alongside the monolith, feature-by-feature, with a reverse proxy (AWS API Gateway) routing traffic to either the new service or the legacy system depending on the feature being accessed. This allowed incremental migration with the ability to roll back instantly to the monolith path at any point.

The six core services were identified, prioritized, and sequenced based on business impact and technical complexity:

  1. Catalog & Search — Elasticsearch-backed product search, category navigation, product detail.
  2. Cart & Wishlist — Session-backed cart, wishlist management, merge on login.
  3. Checkout & Payments — Address entry, order summary, payment gateway orchestration.
  4. Inventory — Stock tracking, low-stock alerts, reservation and release.
  5. Order Management — Order lifecycle, status transitions, history, returns.
  6. Notifications — Email, SMS, and in-app push via SNS.

Implementation

Implementation was structured in four one-week sprints (plus a planning week) for a total of five weeks of intensive delivery, followed by two weeks of soak testing and performance validation before the final cutover.

Week 1 – Foundation and Infrastructure as Code

The first week was entirely infrastructure work: no features were built. The team used Terraform to define all AWS resources as code, committing the configuration to the same repository that held the application stack. This created an auditable, reproducible infrastructure definition that could spin up identical staging environments in under 20 minutes.

Key infrastructure decisions made this week included: ECS Fargate for container orchestration (eliminating the overhead of managing EC2 fleets), Aurora PostgreSQL 15 for the primary data store, ElastiCache Redis for session storage and caching hot product data, and CloudFront in front of API Gateway for global edge routing.

Weeks 2–3 – Catalog and Cart Services

The catalog service was the natural first production target: it is read-heavy (over 85% read traffic), had the simplest data model, and was responsible for the largest share of monolith CPU usage during peak load. A full-text search index was built using Elasticsearch with OpenSearch Service, with a CDC (Change Data Capture) pipeline from Aurora to OpenSearch using Debezium. Product detail pages were served from a Redis cache with a 15-minute TTL, warming on application startup for the most-viewed 5,000 SKUs.

The cart service was more complex because it required session affinity and merge logic when anonymous users logged in. The team adopted Redis Hash structures to store cart data, enabling O(1) item lookup and atomic increment/decrement operations for quantity changes. The cart service communicates with the catalog service via gRPC for product metadata lookups, keeping the REST gateways free of cross-service synchronous blocking calls.

Both services were instrumented from day one: OpenTelemetry traces were captured with AWS X-Ray as the backend, and structured logs were shipped to CloudWatch Logs Insights with a consistent JSON schema across all services. This instrumentation effort paid enormous dividends during performance tuning in later sprints.

Weeks 4–5 – Checkout, Payments, and Inventory

Checkout was the highest-risk migration and therefore executed with the most deliberate testing rigour. The team built a feature flag system using AWS AppConfig that allowed routing individual users—or percentage-bucketed traffic slices—between the monolith checkout flow and the new service, enabling canary testing with live shoppers rather than synthetic test data.

Payment routing was implemented as a resilient orchestration layer: the service attempts the primary payment gateway (Razorpay) with a 4-second timeout, falls back to a secondary provider (Cashfree) on timeout, and surfaces a graceful offline-payment option if both integrations fail. Eventual consistency is maintained through SNS/SQS sagas, so a payment success event reliably triggers inventory deduction and order creation even under partial system load.

The inventory service adopted a pessimistic-lock pattern for high-demand SKUs to prevent overselling, with a compensating release transaction triggered if a checkout flow is abandoned mid-purchase. Inventory levels are cached in Redis with a 10-second TTL, reducing read pressure on Aurora during flash sales.

Week 6 – Order Management and Notifications, Migration Cutover

The final two production services—order management and notifications—were implemented in week 6. Order management handles the complete order lifecycle through a state machine driven by Amazon SNS event notifications. When an order transitions to "shipped," an event is published to SNS, which fans out to the notifications service and an analytics pipeline in parallel without coupling the two downstream consumers.

Cutover to the new architecture was executed on a Saturday night with a two-hour maintenance window—allowing the team to drain live sessions from the monolith, point the API Gateway routing at the new services, and monitor for anomalies before re-opening the platform to all users. Over the following 48 hours, the team ran in "hypercare" mode with all senior engineers on call. Four small regressions were found and patched within hours; none caused customer-facing downtime.

Results

The migration delivered outcomes that exceeded every pre-defined goal. Below is a breakdown of the most significant results.

Uptime and traffic handling. UrbanCart ran through its first peak event post-migration—New Year's Eve sales—with zero platform downtime. The checkout service scaled from 200 task replicas at baseline to 2,400 under peak load, driven automatically by ECS Service Auto Scaling responding to CPU utilization. The latency p99 improved from 820ms on the monolith to 210ms on the new service mesh.

Deployment velocity. Average time-to-production for feature changes dropped from 22 days (constrained by the monolith's release cycle) to approximately 72 hours across all six services. Deployment frequency increased 6x, with zero production incidents attributable to deployment changes in the following three months.

Operating cost. Fargate's serverless container model, combined with the ability to scale individual services independently, reduced overall infrastructure spend by approximately 38% compared to the prior fixed-infrastructure arrangement—despite handling nearly 3x the transaction volume.

Engineering experience. New-hire ramp-up time for contributing to service code fell from an average of six weeks to two weeks, driven by smaller, focused codebases and the shared infrastructure documentation built during the IaC phase of the project.

Key Metrics Summary

MetricBeforeAfterChange
Platform uptime99.1%99.94%▲ +0.84pp
P99 latency820ms210ms▼ 74%
Deploy frequency2/month12/month▲ 6x
Mean time to recovery47min4.2min▼ 91%
Infrastructure cost/transaction₹4.12₹2.55▼ 38%
New hire ramp time6 weeks2 weeks▼ 67%

Lessons Learned

The most important lesson was also the most unexpected: the hardest part of the migration was not the technical work, but the organizational coordination required to run two systems in parallel for five weeks. Decision ownership across the old and new flows blurred disputes on several occasions, and having a single migration lead with formal authority to adjudicate routing decisions was worth more than the additional architecture-review hours invested in data contracts.

A second lesson worth emphasizing: invest in event schemas early. The team drafted and versioned event payloads for all inter-service communication—order.created, payment.succeeded, inventory.reserved—before writing a single consumer. This eliminated an entire class of data compatibility issues during the integration phase that teams in earlier migrations had lost weeks to.

Finally, monitoring coverage was never perfect, and that is acceptable. The team accepted 80% instrumentation coverage at launch rather than delaying the cutover to achieve 100%. The missing 20% was filled incrementally over the following two months, without a single incident arising from an uninstrumented path. Perfect monitoring is the enemy of shipping—even for a high-stakes platform migration.

Conclusion

UrbanCart's migration from monolith to microservices on AWS was completed in 12 weeks with zero customer-facing downtime, under budget, and delivering results that exceeded every stated goal. The case demonstrates that with a disciplined Strangler Fig approach, feature-flagged canary routing, and an unwavering commitment to infrastructure-as-code discipline, even a legacy platform carrying significant revenue risk can be rearchitected without a single moment of downtime.

The most honest observation is perhaps this: the engineering team shipped the last service faster than expected, and the new platform has already quietly outlasted the original monolith in terms of expected longevity—a statistic the original architects never could have predicted in January 2026.

Related Posts

How MnDOT Rebuilt Its Road Maintenance Work Order System and Cut Administrative Overhead by 68%
Case Study

How MnDOT Rebuilt Its Road Maintenance Work Order System and Cut Administrative Overhead by 68%

When the Minnesota Department of Transportation needed to replace a 14-year legacy work order platform, the challenge was staggering: 3,200+ field crews spread across seven districts, paper-based reporting, no live visibility into maintenance priorities, and an 18-day average incubation period before work requests became actionable. Outdated forms drove a 38% abandonment rate, each district operated side-by-side disconnected defect-taxonomies, and supervisors had no real-time map of labour capacity against the condition backlog. Over 18 months, MnDOT, in partnership with collaborative software delivery, rebuilt the entire platform from a fresh data model up — we redesigned the field-crew interface to cut input steps from 12 to 4, enforced a unified 847-entry canonical taxonomy at the input layer, introduced an ArcGIS-integrated real-time dispatch map, and built offline-first edge-sync for crews operating across Minnesota's harsh winter conditions. The results are precise: administrative overhead reduced by 68%, work-order lead time cut by 77%, reporting accuracy lifted from 62% to 95%, and preventive maintenance completion more than doubled. This case study maps every design decision that produced those outcomes.

How We Cut Cloud Spend by 47% While Doubling Platform Uptime for a 2M-User Fintech
Case Study

How We Cut Cloud Spend by 47% While Doubling Platform Uptime for a 2M-User Fintech

When a rapidly growing fintech platform faced mounting infrastructure costs and recurring outages at peak trading hours, we didn't just patch the problem — we rebuilt their entire AWS architecture from the ground up. In seven months, we achieved a 47% reduction in monthly cloud spend, 99.98% platform uptime, a 2.3-second improvement in average API latency, and a smooth transition to zero-downtime deployments — all without disrupting their 2 million active users.

How a Regional Hospital Network Built a Real-Time Patient Data Exchange Platform: A Healthcare Interoperability Case Study
Case Study

How a Regional Hospital Network Built a Real-Time Patient Data Exchange Platform: A Healthcare Interoperability Case Study

When three regional hospitals merged into a single care network, they inherited a deeply fragmented IT landscape: six incompatible EMR systems, no shared patient record across sites, and manual data handoffs between departments costing an estimated 12 clinical hours per week per provider, and resulting in duplicate lab tests and missed medication allergies. The breaking point came during a near-miss emergency department incident: a physician nearly prescribed a penicillin-class antibiotic to a patient whose drug allergy was documented only in a sister hospital's system before a nurse's timely verbal flag averted disaster. This case study details how we designed and deployed a FHIR-based interoperability layer that unified patient data across all care sites within 90 seconds of lookup, cut emergency department wait times by 38%, delivered real-time clinical decision support alerts within two seconds, and achieved full ONC Cures Act compliance — all in just nine months using Mirth Connect, AWS, Snowflake, and Amazon EventBridge.