Cloud Migration at Scale: How RetailFlow Reduced Infrastructure Costs by 67% While Doubling Traffic Capacity

RetailFlow provides real-time inventory and sales analytics for over 2,400 online retailers, processing more than 15 million data points daily across Shopify, WooCommerce, Magento, and custom storefronts. By late 2025, monthly AWS bills had climbed to $42,000, yet the system struggled during peak traffic periods. Black Friday 2025 saw 15 minutes of degraded performance affecting 340 customers and resulting in $12,000 in SLA penalties. The company launched Project Phoenix in January 2026, a six-month initiative to migrate from legacy AWS EC2 to a serverless architecture on Lambda and DynamoDB. This case study details the technical challenges, strategic decisions, and execution framework that enabled significant cost reduction and scalability improvements. The team implemented a strangler-fig pattern migration, rearchitected data storage with DynamoDB single-table design, and built automated deployment pipelines. Results exceeded targets with 67% cost savings, 6x latency improvement, and zero downtime during Black Friday 2026. The migration demonstrates how careful planning, incremental execution, and data-driven decision making can successfully modernize complex legacy systems while maintaining customer trust and business continuity.

Overview

RetailFlow provides real-time inventory and sales analytics for over 2,400 online retailers, processing more than 15 million data points daily across multiple e-commerce platforms including Shopify, WooCommerce, Magento, and custom storefronts. Founded in 2019, the company had grown rapidly from a 12-person startup to a 68-person organization, yet found itself operating on an aging infrastructure stack that was becoming both expensive and operationally burdensome. By late 2025, monthly AWS bills had climbed to $42,000, yet the system struggled during peak traffic periods, requiring emergency scaling operations that cost additional overtime and created reliability concerns that threatened customer relationships.

The leadership team — CTO Marcus Chen, Lead Engineer Sarah Kim, and Platform Architect David Torres — launched Project Phoenix in January 2026: a six-month initiative to migrate their entire stack to a serverless architecture while maintaining 99.9% uptime and reducing operational costs by at least 50%. The project required careful coordination across engineering, product, and customer success teams to ensure zero disruption to existing customers while building a foundation for future growth.

RetailFlow's core challenge was not just technical debt, but architectural debt accumulated through three years of rapid feature development. Each new feature had been built as an endpoint within the monolithic application, creating a complex web of dependencies that made any change risky. Database queries that had started as simple lookups had grown into multi-table joins involving 15 tables, with average query times climbing from 45ms in the first year to over 300ms by late 2025. This degradation was masked by extensive caching, but cache misses during traffic spikes exposed the underlying performance issues.

Challenge

The existing infrastructure presented several critical pain points that required simultaneous resolution:

Infrastructure Costs. The monolithic Node.js application ran on 42 EC2 instances (t3.large and m5.xlarge) with auto-scaling groups that rarely scaled down effectively due to inconsistent health checks and connection pooling issues. Database queries on the primary PostgreSQL instance were growing slower, requiring three expensive read replicas that operated at under 10% utilization 90% of the time. The RDS instance itself was provisioned at db.r5.4xlarge with 5TB of provisioned IOPS storage, costing $8,400 monthly despite utilization rarely exceeding 30%.

Scalability Limitations. During Black Friday 2025, traffic spiked 5x above normal levels as retailers scrambled to track inventory across flash sales and supply chain disruptions. The auto-scaling policies triggered, but provisioning new instances took 8-12 minutes due to AMI customization scripts and security hardening processes. This caused a 15-minute window of degraded performance that affected 340 retail customers, resulting in 89 support tickets and two SLA penalty payments totaling $12,000.

Operational Overhead. The DevOps team of four engineers spent 35 hours weekly on routine maintenance tasks: patch management on operating systems, log rotation and archival, backup verification across multiple regions, and manual capacity planning for anticipated traffic spikes. This represented nearly half of their productive engineering time, delaying critical feature development and technical improvements. Security compliance audits consumed an additional 15 hours monthly, with quarterly penetration testing requiring 40+ hours of preparation and remediation.

Technical Debt. The codebase contained over 12,000 lines of infrastructure-specific logic, tightly coupling business logic with deployment concerns. Startup scripts, health check endpoints, and cache warming routines were scattered throughout the application code. Database migrations required scheduled maintenance windows, typically lasting 2-3 hours, limiting release frequency to twice monthly and creating bottlenecks when urgent fixes were needed.

Data Consistency Issues. The caching layer used Redis with TTL-based expiration, leading to race conditions where multiple requests for the same data could trigger simultaneous database queries. This caused occasional data inconsistency in customer dashboards, particularly during high-traffic periods. Cache invalidation was manual and error-prone, requiring developer intervention several times weekly.

Monitoring Blind Spots. While the team used Datadog for infrastructure monitoring, the alerting configuration was overly broad, generating an average of 23 false-positive alerts daily. Engineers had become desensitized to pages, leading to delayed response during an actual incident in March 2026 where disk space exhaustion caused partial data loss.

Goals and Success Metrics

The team established four primary objectives with concrete success criteria, each tied to measurable outcomes:

Cost Reduction: Reduce monthly infrastructure spend from $42,000 to under $20,000 (target: 52% reduction, stretch goal: 65%)
Scalability: Handle 10x traffic spikes without manual intervention, eliminating cold-start provisioning delays and ensuring sub-second response times
Operational Efficiency: Reduce routine maintenance tasks to under 10 hours weekly, freeing engineering for product development and allowing team reallocation
Reliability: Achieve 99.9% uptime during and after migration, with automated rollback capability within 5 minutes if needed

Success would be measured through three key metrics: AWS Cost Explorer for infrastructure costs, Datadog SLOs for uptime and latency, and Jira time-tracking for operational hours. Each metric would be reviewed weekly in leadership syncs, with quarterly business reviews comparing actual performance against projections. The team also committed to documenting learnings continuously, creating a migration playbook for future infrastructure projects.

Approach and Strategy

The migration followed a strangler-fig pattern: building new serverless services alongside the existing monolith rather than attempting a big-bang replacement that could disrupt customer operations. The team established a three-phase framework with clear entry and exit criteria, rollback procedures, and success metrics for each stage. Risk mitigation was paramount, as RetailFlow's reputation with enterprise clients depended on reliable service delivery.

Phase 1: Foundation (Weeks 1-4)

The team began by establishing the serverless foundation using AWS SAM (Serverless Application Model) and Terraform for infrastructure-as-code. This dual-tool approach was intentional: SAM provided rapid iteration capabilities for Lambda functions, while Terraform managed cross-service dependencies and networking. Critical architectural decisions were made in week 1, documented in Architecture Decision Records (ADRs), and reviewed by external AWS Solutions Architects.

Migrating from PostgreSQL to DynamoDB required extensive modeling. The team spent three weeks analyzing query patterns, identifying which queries were read-heavy versus write-heavy, and understanding the cardinality of their data. They settled on a single-table design pattern to minimize query complexity and maximize performance, accepting the trade-off of eventual consistency for most reads. A separate analytics table using a different primary key structure handled time-series queries.

Event-driven architecture using EventBridge replaced direct service calls, enabling decoupled service communication and retry handling. Each service emitted events to a central bus, with other services subscribing to events relevant to their domain. This pattern allowed independent scaling and deployment while maintaining clear data flow boundaries.

Lambda functions were grouped by bounded contexts: authentication handled all identity and access management; data ingestion accepted API requests and queued them for processing; analytics processing transformed raw events into actionable insights; reporting generated customer-facing dashboards and scheduled exports. Each service had its own CloudWatch log group, IAM role, and deployment pipeline.

Observability was prioritized before feature development. The team set up CloudWatch-based monitoring with custom dashboards for cost and performance tracking. They implemented correlation IDs across all services using EventBridge tracing, enabling end-to-end request tracking through the distributed system. Distributed tracing revealed that Lambda cold starts were the primary latency contributor, informing later optimization decisions.

A significant investment was made in testing infrastructure: they built a parallel staging environment that ran 10% of production traffic continuously, allowing for real-world validation of the serverless architecture before any customer impact. The staging environment used synthetic data that mimicked production patterns while ensuring PII was never present. Load testing tools simulated traffic up to 20x normal levels, revealing bottlenecks in the DynamoDB access patterns.

Phase 2: Migration Waves (Weeks 5-16)

Instead of migrating by component, the team migrated by user segment following a dark-launch approach. They used traffic-mirroring with a weighted load balancer that routed a percentage of requests to the new architecture while maintaining dual writes to both systems. Each wave increased the percentage by 5%, with two-week stabilization periods between waves for metric analysis and issue resolution.

User segmentation prioritized low-risk customers: small retailers with basic setups and fewer integrations were migrated first, allowing the team to identify issues with minimal customer impact. High-value enterprise customers with complex integrations were migrated last, after patterns had been validated and hardened. This approach required careful coordination with the customer success team to manage expectations and provide clear communication.

The most complex migration involved the analytics pipeline, which processed 50,000 events per minute during peak hours and generated insights updated every 15 seconds. The team implemented a batch processor using Step Functions that could handle backpressure gracefully, automatically extending execution time when queues backed up. For sub-second processing requirements, they used Lambda with provisioned concurrency, keeping a pool of warmed instances ready for immediate execution.

Database migration required special attention to ensure data integrity. They used AWS DMS (Database Migration Service) with CDC (Change Data Capture) to keep DynamoDB synchronized during the transition, capturing every write operation and replicating it to the new system. A custom reconciliation tool compared record counts and checksums between systems hourly, flagging discrepancies for manual review. The tool was built using Step Functions with parallel comparison workers, completing full-table validation in under 6 hours.

API Gateway configuration presented unexpected challenges. The original monolith used a simple ELB with path-based routing, but API Gateway required explicit route definitions and CORS handling. The team used Terraform modules to manage routes, with OpenAPI specifications defining request/response schemas. Rate limiting was implemented at the gateway level, replacing application-level throttling that had been inconsistent.

Caching strategy evolved from Redis to a multi-tier approach: API Gateway response caching for frequently-requested aggregations, Lambda container reuse for warm function instances, and DynamoDB Accelerator (DAX) for complex query caching. This reduced cache-related code in the application by 70% while improving hit rates and reducing inconsistency.

Phase 3: Optimization and Cutover (Weeks 17-24)

With 95% of traffic on the serverless stack, the team focused on optimization to meet their aggressive cost targets. They implemented Lambda Power Tuning to optimize memory allocation for each function, discovering that many functions were over-provisioned. The power tuning process involved running each function at multiple memory levels with representative workloads, measuring duration and cost to find the optimal configuration.

DynamoDB Auto Scaling policies were refined based on actual usage patterns rather than theoretical projections. Initial configurations used AWS-recommended settings that proved too conservative during traffic spikes. The team analyzed six months of traffic data to establish baselines, then implemented custom scaling with faster response curves. Reserved capacity was purchased for predictable baseline load, further reducing costs by 20%.

Automated cost alerts using AWS Budgets integrated with Slack notifications, alerting the team when daily spend exceeded thresholds. These alerts were configured per-service, enabling quick identification of runaway costs. Monthly cost reviews compared actual spend against projections, adjusting configurations as needed.

A blue-green deployment system was built using Lambda aliases and weighted routing at the API Gateway level. Each deployment created a new alias pointing to updated function versions, with traffic gradually shifted over 30 minutes. Automated health checks monitored error rates and latency, triggering automatic rollback if metrics degraded beyond thresholds. This system handled 23 rollback events during the optimization period, each completing successfully without customer impact.

The final cutover weekend spanned 72 hours: Friday evening began with 95% traffic on serverless, Saturday gradually shifted the remaining 5%, and Sunday decommissioned the EC2 fleet. The process was deliberately slow to allow quick rollback if any issues emerged, with engineers on-call throughout the weekend. Only one minor issue surfaced — a timezone handling bug in reporting — resolved within 45 minutes using the blue-green rollback system.

Implementation Details

Architecture Transformation

The original monolith served 42 endpoints from a single codebase using Express.js with extensive middleware for authentication, rate limiting, and request logging. The new architecture decomposed this into 18 microservices, each deployed independently using AWS SAM templates that defined function configuration, IAM roles, and event sources. API Gateway handled routing between services, with Cognito managing authentication and session state.

The Auth Service replaced custom session management with Cognito-based authentication, using Lambda triggers for custom workflows like passwordless login and MFA enrollment. Session tokens were validated at the API Gateway level, reducing function invocations by 40% for authenticated requests. Custom claims in JWT tokens eliminated database lookups for user permissions, further reducing latency.

The Ingestion Service handled API requests through API Gateway endpoints, writing to Kinesis Firehose for buffering and automatic batching. Firehose delivered records to Lambda functions every 60 seconds or 5MB, whichever came first, providing natural load balancing across processing instances. Dead letter queues captured failed records for later analysis, with automatic reprocessing after fixes.

Processing Service Lambda functions were triggered by Firehose for real-time analysis, transforming raw events into aggregated metrics stored in DynamoDB. Functions were organized by data type: product events, order events, and customer events each had dedicated processing logic. Step Functions orchestrated complex multi-step workflows, maintaining state across function invocations without external storage.

Reporting Service Lambda functions generated daily and hourly aggregates on scheduled triggers, querying DynamoDB for raw data and producing compressed reports for customer download. Reports were stored in S3 with pre-signed URLs for secure access, eliminating the need for persistent storage in DynamoDB. Large reports used S3 Transfer Acceleration for faster downloads.

Notification Service used SNS for alerting on threshold breaches and system events, with subscriptions to Slack channels, email lists, and SMS for critical alerts. Message templates were defined in DynamoDB, allowing non-engineering team members to update notification content. The service also handled webhook delivery to customer integrations, with automatic retry and dead-letter handling.

The single-table DynamoDB design used composite keys strategically following the adjacency list pattern. The primary partition key was pk = tenant_id, with sort keys organized by entity type and timestamp. This enabled efficient queries for both current data and historical analytics. Global Secondary Indexes (GSIs) provided alternate access patterns, with projected attributes minimizing additional read costs.

CI/CD Pipeline Overhaul

The team adopted GitHub Actions with a multi-stage deployment pipeline that eliminated manual approval steps while maintaining safety. Unit tests and linting ran on every PR against the changed code only, using Jest with parallel execution. Integration tests deployed to a temporary staging environment with a production data subset, validating end-to-end flows for 15 minutes before automatic teardown.

Canary deployment routed 1% of traffic for 30 minutes before proceeding, with automated promotion on successful health checks. The canary system used CloudWatch Synthetics to validate key user flows, checking API responses and dashboard accuracy. Manual approval was required for the first three deployments of any service, after which automated promotion was enabled.

Gradual rollout used Lambda aliases with weighted traffic shifting, moving 10% of traffic every 15 minutes while monitoring metrics. Automated rollback on CloudWatch alarm triggers provided safety for urgent issues. The rollback system used Step Functions to revert multiple services atomically, preventing partial rollback states that could cause data inconsistency.

Infrastructure changes used Terraform Cloud with policy checks preventing destructive operations during business hours. Security scans ran on every PR using tfsec and checkov, blocking deployments with critical vulnerabilities. Network changes required additional approval from the security team, with temporary exceptions for emergency fixes.

Data Migration Strategy

The PostgreSQL database contained 3.2TB of data across 45 tables, accumulated over three years of operation. Rather than migrate all at once and risk extended downtime, they used a selective approach based on access patterns. Hot data (last 90 days) migrated immediately using DMS full load plus CDC, completing in 72 hours with ongoing synchronization.

Cold data (older than 90 days) was archived to S3 Glacier and loaded on-demand when requested. The archival process involved compressing data into columnar format for efficient querying, with Athena used for infrequent analytical queries. A metadata layer tracked archive locations, abstracting the distinction from application code.

Frequently accessed reference data was denormalized into DynamoDB items, eliminating joins that had become performance bottlenecks. Merchant categories, product catalogs, and shipping zones were replicated across partitions to enable single-query access. This denormalization increased storage costs by 15% but reduced query costs by 60% and improved latency.

An A/B testing framework compared old vs. new system results during transition, randomly assigning users to view data from each system and comparing dashboard outputs. Discrepancies triggered alerts for manual investigation, with tolerance thresholds set at 0.1% difference for numerical values and exact matching for categorical data. The testing framework ran continuously for six weeks, identifying edge cases in aggregation logic.

Data validation used checksums at multiple levels: record counts, field values, and calculated aggregates. The validation service ran hourly comparisons during migration, producing reports for engineering review. Automated remediation fixed common issues like timezone conversions and decimal precision, while complex discrepancies were flagged for manual intervention.

Results and Metrics

After six months of execution, the migration delivered results exceeding all targets. Monthly infrastructure costs dropped to $14,000 — a 67% reduction from the original $42,000. The savings came primarily from eliminated EC2 instances (saved $28,000/month) and optimized database costs (saved $10,000/month). The remaining costs were offset by increased development productivity and reduced incident response.

P99 latency improved from 840ms to 142ms — a 6x improvement exceeding the team's expectations. The serverless architecture scaled automatically to handle Black Friday 2026 traffic, which exceeded projections by 30%, with zero performance degradation. Database queries that previously timed out during peaks now completed in under 50ms, with DAX caching handling 85% of requests.

Routine maintenance tasks fell from 35 hours weekly to just 7 hours, achieved through automation and elimination of manual processes. The DevOps team redeployed those hours to building a customer-facing analytics dashboard that contributed to a 12% increase in user engagement. Security compliance became largely automated, with infrastructure-as-code ensuring consistent configurations and automated scanning catching vulnerabilities before deployment.

The system achieved 99.95% uptime during the migration period and has maintained 99.98% uptime since completion. The rollback capability was tested once during a database misconfiguration in May 2026 and completed in 2.3 minutes, automatically restoring service. Customer-reported incidents dropped from 12/month to 2/month, with mean time to resolution improving from 4 hours to 26 minutes.

Deployment frequency increased dramatically: from twice weekly to an average of 18 deployments daily, with zero downtime. Mean time to recovery improved from 4 hours to under 30 minutes, with automated rollback handling most issues. The team attributed this improvement to smaller, focused changes and better observability through distributed tracing.

Customer satisfaction scores improved across all measured dimensions. NPS increased from 32 to 58, with reliability being the top driver of improvement. Churn dropped from 4.2% monthly to 1.8% monthly, attributed to improved performance and fewer service interruptions. The customer success team reported spending 30% less time on troubleshooting and 40% more time on proactive engagement.

Lessons Learned

1. Start with Observability. Investing in comprehensive monitoring before migration saved weeks of debugging. The team instrumented both old and new systems with identical metrics, making comparison trivial. CloudWatch dashboards displayed side-by-side metrics for latency, error rate, and cost per request. Without this visibility, optimization would have been guesswork rather than data-driven decisions.

2. Embrace Incrementalism. The strangler-fig approach and user-segment migration waves eliminated the risk of a catastrophic failure. Each wave provided learnings that improved subsequent deployments. The team documented these learnings in a shared playbook that accelerated later waves. Attempting a big-bang migration would have been impossible given the system's complexity and customer expectations for reliability.

3. Right-Size Lambda Memory. Initial deployment used arbitrary memory settings based on container intuition. Power Tuning revealed that 12 of 18 functions could run on 512MB instead of 2GB, cutting Lambda costs by 40%. The tuning process took two weeks but saved $40,000 annually. They scheduled quarterly power tuning reviews to catch configuration drift as code evolved.

4. Prepare for Cold Starts. Provisioned concurrency added $3,000/month in costs that exceeded the Infrastructure team's budget. The team restructured functions to minimize initialization time and used scheduled warm-up invocations instead, saving $2,200 monthly. They also implemented runtime caching for expensive initialization operations, reducing cold start impact for functions that couldn't be provisioned.

5. Test with Real Data. The staging environment with 10% production traffic revealed compatibility issues that synthetic testing missed. Budgeting for this infrastructure is essential for successful migrations. The synthetic data, while useful for basic functionality, didn't capture the data volume and access patterns that caused real issues. They recommended allocating 20% of migration budget to parallel infrastructure for testing.

6. Database Design is Everything. The single-table DynamoDB design was initially met with skepticism, but proved essential for performance and cost optimization. Alternative approaches using multiple tables or relational databases would have required expensive provisioned capacity and complex sharding. They invested heavily in training and pair-programming to ensure all team members understood the access patterns and could contribute effectively.

7. Communicate Proactively with Customers. Regular updates during migration prevented customer anxiety and built confidence in the process. The customer success team provided weekly status emails with technical details appropriate to each audience. Enterprise customers received personal check-ins during their migration waves, contributing to zero churn during the transition period and positive testimonials for the company blog.

Project Phoenix concluded in July 2026, positioning RetailFlow for sustainable growth through 2027 and beyond. The team has since begun applying the same patterns to their mobile backend, with similar cost and performance targets. The migration playbook has been adopted company-wide, with three additional projects already in planning stages.

The success of Project Phoenix demonstrates that even complex legacy systems can be successfully modernized with careful planning, incremental execution, and data-driven decision making. The 67% cost reduction and 6x performance improvement positioned RetailFlow competitively against both legacy competitors and newer cloud-native entrants, while freeing engineering resources for product innovation rather than infrastructure management.