Scaling to 10 Million Users: A Cloud Architecture Transformation Case Study

When FastCart's user base exploded from 500,000 to 10 million within 18 months, their monolithic infrastructure crumbled under the pressure. This comprehensive case study details how Webskyne's engineering team rearchitected their entire platform from the ground up, implementing a microservices-based solution on AWS that not only survived the scaling crisis but reduced infrastructure costs by 47%. From database optimization to auto-scaling policies, from legacy code refactoring to implementing chaos engineering practices—the complete story of how one startup transformed technical debt into competitive advantage.

Overview

FastCart, a rapidly growing Indian e-commerce startup, found themselves at a crossroads in late 2025. What began as a promising regional player with 500,000 active users had exploded into a national phenomenon, adding an averaged 500,000 new users every month. This explosive growth, while every startup's dream, had transformed into a nightmare—their Rails monolith, built in 2023 during simpler times, was literally collapsing under the load.

By October 2025, FastCart's systems were experiencing daily outages. Peak-hour latency exceeded 40 seconds for simple product catalog queries. Cart abandonment rates had jumped from 15% to 35%. Customer support tickets related to "app not working" had increased 400%. The founding team, initially resistant to major technical changes, finally authorized a complete platform transformation.

Webskyne's engagement with FastCart began in November 2025. Over the following five months, we executed what would become one of our most complex transformations—migrating a struggling monolithic application into a resilient, auto-scaling microservices architecture capable of handling 10 million+ users. This case study documents every phase of the transformation: the challenges encountered, the decisions made, the technologies selected, and the remarkable results achieved.

The Challenge

The FastCart platform in late 2025 represented a common but severe case of technical debt accumulation. The original architecture, built quickly to capture market opportunity, had served its initial purpose but was never designed for the scale it had reached.

Monolithic Constraints

The core application was a Ruby on Rails monolith deployed on a single AWS t3.xlarge instance. This single-server approach meant that every component—user authentication, product catalog, shopping cart, order processing, payment integration, notifications—shared the same resources. When any one component experienced high load, it consumed resources from all others.

The most critical bottleneck was the product catalog database. The initial MySQL database, designed with minimal optimization, had grown to include 2.5 million products. Queries that once executed in 50 milliseconds now took 8-12 seconds. The development team had implemented various band-aid solutions—caching layers, query optimizations, read replicas—but these addressed symptoms rather than root causes.

Database Performance Crisis

Peak-hour database CPU utilization consistently exceeded 95%. The primary-replica read replication, implemented six months earlier, had reached its practical limits—the write bottleneck meant that even perfectly distributed reads couldn't compensate for write performance degradation. Connection pooling was saturating at 800+ connections, with new requests queuing and timing out.

The product search functionality, powered by MySQL LIKE queries, had become essentially non-functional during peak hours. Users reported searches timing out or returning no results. The development team had disabled several advanced filtering options to reduce query complexity, degrading the user experience.

Deployment Nightmares

Each deployment was a nerve-wracking affair requiring 3-4 hour maintenance windows. The tightly coupled codebase meant that even minor changes required full application redeployment. Rollbacks, when necessary, took equally long. The team had reduced deployment frequency from weekly to monthly—each deployment risking system stability for days afterward.

Incident Response Fatigue

The on-call rotation had become unsustainable. With daily incidents and constant firefighting, developer burnout was accelerating. Three senior engineers had already resigned in Q3 2025, citing the technical environment as a primary reason. The remaining team was exhausted, defensive, and resistant to making any changes that might introduce new variables.

Goals

Before writing a single line of code, we established clear, measurable objectives for the transformation. These goals would guide every architectural decision and serve as metrics for success.

Primary Objectives

Scalability to 15 Million Users: The platform had to support not just current 10 million users but be prepared for continued growth. We targeted a theoretical capacity of 15 million concurrent users, with practical auto-scaling to handle traffic spikes 3x above baseline.

Sub-Second Response Times: Our baseline target was p99 response times under 500 milliseconds for all core operations during peak hours. This represented a 95th percentile improvement from the 40+ second latencies being experienced.

Zero-Downtime Deployments: The new architecture had to support continuous deployment with zero downtime. Feature releases should not impact users, and failed deployments should be rollable back instantly without service interruption.

Infrastructure Cost Efficiency: Despite increased capacity, we targeted a 30% reduction in infrastructure costs through right-sizing, spot instance usage, and architectural optimization.

Secondary Objectives

Beyond the primary goals, we established supporting objectives that would improve the overall engineering environment:

Developer Experience: Reduce build times to under 10 minutes, enable feature-branch deployments, and implement comprehensive testing automation.
Observability: Implement full-stack monitoring with sub-minute alerting for critical issues.
Disaster Recovery: Achieve RPO (Recovery Point Objective) under 5 minutes and RTO (Recovery Time Objective) under 15 minutes.
Chaos Engineering: Establish practices to proactively identify system weaknesses before they cause incidents.

Approach

We adopted a phased migration approach rather than a complete cutover. This strategy minimized risk while allowing the team to validate changes incrementally. The approach balanced technical rigor with business continuity—the platform had to remain operational throughout the transformation.

Phase 1: Foundation (Weeks 1-4)

The first phase focused on establishing the foundation for the new architecture. We began with infrastructure as code (IaC) implementation using Terraform, creating reproducible environments that could be versioned and reviewed like application code.

AWS organizational structure was redesigned with separate accounts for production, staging, and development, with strict security boundaries between them. AWS Control Tower provided governance and compliance guardrails. The production account was further partitioned into network, compute, data, and security domains.

Key personnel received intensive training on Kubernetes, AWS services, and the new deployment pipelines. We established coding standards and code review processes. This investment in human capital proved essential—developer buy-in accelerated throughout the project.

Phase 2: Extract and Optimize (Weeks 5-10)

The second phase focused on extracting the most problematic components from the monolith while optimizing the remaining core. We identified the product catalog as the highest-impact extraction target—it represented 60% of database load and was the primary source of user complaints.

We implemented a new product service using Node.js, designed for horizontal scaling. Product data migrated to Amazon DynamoDB, chosen for its consistent single-digit millisecond latency at any scale. Elasticsearch replaced MySQL for search functionality, providing fuzzy matching, faceted search, and relevance tuning capabilities.

The new product service was deployed alongside the monolith, with traffic gradually shifted using a canary deployment strategy. We started with 1% of traffic, monitored for 24 hours, then increased to 5%, 25%, and finally 100%. Each step included comprehensive validation and rollback triggers.

Phase 3: Service Decomposition (Weeks 11-18)

The third phase executed the core decomposition—breaking the monolith into discrete services. Each service was designed around business capabilities rather than technical layers, following domain-driven design principles.

We extracted the following services:

User Service: Authentication, authorization, and profile management using Amazon Cognito
Cart Service: Shopping cart operations with Redis caching and DynamoDB persistence
Order Service: Order processing and history with Aurora PostgreSQL
Payment Service: Payment integration with Stripe, isolated for PCI compliance
Notification Service: Email, SMS, and push notifications using AWS SNS and third-party providers
Search Service: Unified search across products, orders, and content using Elasticsearch

Each service was deployed in its own Kubernetes namespace, with service mesh connectivity handling inter-service communication. We implemented circuit breakers, retry policies, and graceful degradation patterns to handle partial failures gracefully.

Phase 4: Optimization and Hardening (Weeks 19-22)

The final phase focused on optimization and resilience hardening. We implemented auto-scaling policies that responded to custom metrics, not just CPU utilization. Database connections were pooled and optimized. CDN configurations were tuned for maximum cache efficiency.

Chaos engineering practices were established, with regular chaos experiments that injected failures into the system to validate resilience. We tested database failures, network partitions, service failures, and even complete availability zone failures. Each experiment revealed weaknesses that were then addressed.

Implementation

The technical implementation involved numerous specific decisions, technologies, and configurations. This section details the key implementation elements that made the transformation successful.

Kubernetes Architecture

Amazon EKS formed the compute foundation, with managed node groups providing compute capacity. We configured node groups across three availability zones for high availability. Cluster autoscaler automatically adjusted capacity based on workload demands.

Each microservice was containerized using Docker, with multi-stage builds producing minimal images. We implemented health checks, readiness probes, and liveness probes for proper container orchestration. Resource requests and limits ensured predictable scheduling and prevented noisy neighbor problems.

Database Strategy

Multi-database architecture matched data characteristics to appropriate database technologies:

DynamoDB: Product catalog, user sessions, cart data (key-value access patterns)
PostgreSQL (Aurora): Order processing, transactions (relational integrity requirements)
Elasticsearch: Search, analytics (full-text search, aggregations)
Redis (ElastiCache): Caching, session store, rate limiting (in-memory workloads)
S3: Binary storage, logs, analytics data (object storage)

Each database was configured with appropriate scaling, backup, and recovery mechanisms. DynamoDB Accelerator (DAX) provided caching for DynamoDB, reducing read costs and improving latency further.

Networking and Security

VPC architecture isolated workloads with private subnets for compute and database layers. Application Load Balancers distributed traffic across services, with AWS WAF providing Web Application Firewall protection. API Gateway managed external API access with rate limiting and API key management.

Service mesh via AWS App Mesh provided encrypted service-to-service communication, with mutual TLS for identity verification. Secrets management used AWS Secrets Manager with automatic rotation for database credentials and API keys.

CI/CD Pipeline

GitHub Actions powered the continuous integration pipeline. Each commit triggered automated testing—unit tests, integration tests, and security scans. Code quality analysis using SonarQube enforced maintainability standards. Container vulnerability scanning identified dependencies with known CVEs.

Continuous deployment used ArgoCD for GitOps-style deployments. Changes to repository state automatically propagated to the cluster after passing validation gates. Blue-green deployments enabled instant rollbacks when issues were detected. Feature flags allowed gradual feature rollout independent of deployment.

Results

By April 2026, the transformation was complete. The results exceeded our most optimistic projections, delivering performance improvements and cost savings that transformed FastCart's competitive position.

Performance Improvements

Response times improved dramatically. Average p99 latency dropped from 40+ seconds to 280 milliseconds—a 99.3% improvement. Peak-hour performance, previously the worst time, now matched or exceeded off-peak performance. The platform handled 12 million concurrent users during a flash sale event without degradation.

Cart completion rates increased from 65% to 94%. Search functionality, previously a major complaint source, now returned results in under 200 milliseconds. Product page loads decreased from 8+ seconds to under 1 second.

Reliability Metrics

System availability improved from 96.5% to 99.98%. The platform experienced exactly one minor incident during the entire post-migration period—a database connection issue that auto-resolved within 2 minutes. Zero customer-impacting incidents occurred during the March 2026 shopping festival.

Mean time to recovery (MTTR) improved from 4 hours to 3 minutes. The new architecture's self-healing capabilities handled most issues automatically, with human intervention required only for the most severe scenarios.

Developer Velocity

Deployment frequency increased from monthly to multiple times daily. Lead time for changes decreased from 3 weeks to 2 days. Build times decreased from 45 minutes to 8 minutes through optimized CI/CD pipelines.

Developer satisfaction scores, measured through quarterly surveys, increased significantly. The on-call rotation went from daily incidents to averaging under one alert per week—and most alerts were informational rather than actionable.

Metrics

Quantitative metrics document the transformation's impact across key dimensions:

Metric	Before	After	Change
p99 Latency	40,000 ms	280 ms	-99.3%
Availability	96.5%	99.98%	+3.48%
Cart Completion	65%	94%	+44.6%
Search Latency	8,000 ms	180 ms	-97.8%
Monthly Costs	$42,000	$22,300	-46.9%
Deployment Frequency	Monthly	Daily	30x
Incident Count (Monthly)	47	3	-93.6%
MTTR	4 hours	3 minutes	-98.8%
Developer Satisfaction	2.1/5	4.3/5	+105%

Lessons Learned

The FastCart transformation yielded insights applicable to any organization undertaking similar migrations. These lessons represent hard-won knowledge that can inform similar initiatives.

Start with People, Not Technology

Our biggest success factor was investing in team preparation before technical work began. Training,processes, and culture changes were more important than technology selection. Technical solutions fail without team buy-in and capability.

Recommendation: Budget 20% of project time for team preparation—training, process development, and change management. This investment pays compound returns.

Phased Migration Reduces Risk

Attempting a complete cutover would have been catastrophic. The canary deployment approach, moving traffic gradually from old to new systems, enabled early issue detection and rollback safety. Each phase taught lessons applied to subsequent phases.

Recommendation: Never migrate everything at once. Identify the highest-impact, lowest-risk component to extract first, validate, then continue iteratively.

Observability is Non-Negotiable

Implementing comprehensive monitoring before making changes was essential. We could quantify impact—each change's effect on latency, error rates, and costs. Without observability, we're flying blind.

Recommendation: Implement metrics, logging, and tracing first. Ensure you can measure before making changes.

Database Selection Matters

The original MySQL monolith was a primary bottleneck. Our multi-database approach—matching data patterns to database technologies—delivered outsized returns. One size doesn't fit all data.

Recommendation: Analyze data access patterns. Key-value, document, relational, and search workloads each suit different databases.

Chaos Engineering Prevents Incidents

Our chaos experiments—intentionally injecting failures—revealed weaknesses before they caused outages. Proactively breaking things in controlled ways builds more resilient systems than hoping nothing breaks.

Recommendation: Start small—chaos experiments can begin with single service restarts and evolve to complex failure scenarios.

Cost Optimization Requires Ongoing Attention

Our 47% cost reduction came from sustained optimization—right-sized instances, spot instances for batch workloads, reserved instance planning, and continuous monitoring. One-time optimization isn't enough.

Recommendation: Implement cost monitoring at the service level. Make cost visible to encourage optimization ownership by engineering teams.

Looking Forward

By late 2026, FastCart's transformed platform has positioned the company for continued growth. The architecture handles current demand comfortably and includes headroom for anticipated expansion. The engineering team, once burned out and defensive, now proactively proposes improvements.

The FastCart case demonstrates that technical transformation is possible even in the most challenging circumstances—with the right approach, team, and commitment to excellence. The platform that once threatened to sink the company now provides competitive advantage.

For organizations facing similar scaling challenges, the message is clear: the path from crisis to competitive advantage is difficult but achievable. The investment pays dividends across performance, cost, and team satisfaction dimensions. The only failure is failing to act.