Enterprise Cloud Migration: Scaling to 2M+ Users on AWS Infrastructure

How Webskyne transformed a legacy on-premises architecture into a modern, scalable AWS deployment supporting 2 million+ active users while reducing infrastructure costs by 42% and improving system reliability to 99.98%. This case study details the strategic approach, technical implementation, and lessons learned during a complex 18-month migration spanning multiple service domains and deployment pipelines.

Overview

In 2024, Webskyne partnered with a leading fintech platform to execute a comprehensive cloud migration from legacy on-premises infrastructure to AWS. The client, serving over 2 million active users across Southeast Asia, faced mounting scalability challenges as their monolithic architecture struggled to handle peak transaction loads exceeding 50,000 concurrent users. With regulatory compliance requirements tightening and user expectations for instant transaction processing rising, the organization needed a modern, resilient infrastructure that could scale horizontally while maintaining strict security standards.

The migration project represented one of our most ambitious undertakings: consolidating 15 years of legacy systems, migrating 2.3TB of transactional data, and transitioning critical payment processing services with zero downtime. The scope encompassed rearchitecting the core platform into microservices, implementing event-driven workflows for real-time notifications, and establishing a robust CI/CD pipeline for continuous deployment across multiple AWS regions.

Our approach combined a phased migration strategy with blue-green deployment patterns, allowing for gradual transition while maintaining operational excellence. The project required deep collaboration between our cloud architects, DevOps engineers, and security specialists to ensure compliance with PCI-DSS, ISO 27001, and regional financial regulations throughout the transition.

Challenge

The client's legacy infrastructure presented several critical challenges that necessitated the migration. Their primary database, running on aging Oracle servers, was experiencing performance degradation during peak hours, with query response times exceeding 5 seconds for 15% of transactions. The monolithic application architecture meant that scaling any component required scaling the entire stack, resulting in inefficient resource utilization and increased costs.

Security vulnerabilities had been identified during routine audits, particularly around data encryption at rest and in transit. The on-premises setup lacked the redundancy necessary for high availability, with single points of failure in their load balancer and primary application servers. Disaster recovery procedures were manual and time-consuming, requiring 4-6 hours to restore full service in a failure scenario.

Development velocity had slowed significantly due to the rigid deployment process, which required coordinated downtime windows every two weeks for releases. This bottleneck prevented the product team from responding quickly to market demands and iterating on user feedback. Additionally, integrating with modern payment gateways and third-party services proved increasingly difficult due to outdated API frameworks and deprecated libraries.

Goals

The project established four primary objectives that guided every decision throughout the 18-month engagement. First, achieve 99.98% system availability with automated failover capabilities across multiple AWS regions, reducing mean time to recovery from hours to under 5 minutes. Second, reduce infrastructure costs by at least 35% while providing 10x the current compute capacity to handle projected user growth.

Third, implement a microservices architecture that would enable independent scaling of at least 12 service domains including user management, transaction processing, notification systems, and analytics. Finally, establish a fully automated CI/CD pipeline capable of deploying updates to any service with zero downtime and rollback capabilities within 60 seconds.

Performance benchmarks were set at sub-200ms response times for 95% of API calls, with transaction processing throughput of at least 10,000 requests per second during peak loads. Security goals included achieving SOC 2 Type II compliance, implementing zero-trust network architecture, and establishing automated security scanning within the deployment pipeline.

Approach

Our methodology centered on the Strangler Fig pattern, gradually replacing legacy components with cloud-native services while maintaining operational continuity. We began with a comprehensive audit of all 47 existing services, categorizing them into four migration waves based on criticality, complexity, and dependencies.

The infrastructure design leveraged AWS Well-Architected Framework principles, implementing containerized microservices using ECS with Fargate for compute, Aurora PostgreSQL for transactional data, and DynamoDB for session state. EventBridge and SQS formed the backbone of our asynchronous processing architecture, enabling decoupled service communication and improved fault tolerance.

Data migration employed a dual-running approach during the transition period, with a custom-built change data capture system tracking modifications in real-time. We implemented Terraform-based infrastructure as code with automated testing environments spun up for each pull request, ensuring infrastructure changes could be validated before production deployment.

Implementation

The implementation unfolded across six distinct phases over 18 months. Phase 1 focused on establishing the foundational infrastructure: VPC configuration with public and private subnets across three availability zones, implementing security groups and NACLs following zero-trust principles, and setting up centralized logging with CloudWatch and security monitoring via GuardDuty.

Phase 2 involved migrating user authentication and profile services. We implemented Cognito for identity management with Lambda triggers for custom business logic, enabling multi-factor authentication and adaptive authentication policies. The user service transitioned from a single Oracle instance to a multi-AZ Aurora cluster with read replicas for improved performance.

Phase 3 tackled the core transaction processing system. We architected an event-sourced model using EventBridge to capture all state changes, with Step Functions orchestrating complex multi-step transaction workflows. Payment processing integrated with Stripe and local payment providers through a unified adapter pattern, with all sensitive data encrypted using AWS KMS with customer-managed keys.

The final phases addressed notification systems, analytics pipelines, and administrative interfaces. We implemented a real-time notification service using WebSocket APIs backed by DynamoDB for connection state, and built an analytics data lake using Kinesis Data Streams feeding into Redshift for business intelligence reporting.

Results

Upon completion, the migration delivered substantial improvements across all key metrics. System availability increased from 99.2% to 99.98%, with automated failover successfully tested during three planned maintenance windows and one unplanned regional outage. User-facing response times improved by 85%, with 95th percentile API response dropping from 2.3 seconds to 180 milliseconds.

Infrastructure costs decreased by 42% compared to the previous year, while compute capacity expanded significantly. The auto-scaling configuration now handles peak loads exceeding 80,000 concurrent users without performance degradation. Database query performance improved by 92%, with 99th percentile transaction processing times under 500 milliseconds.

Development velocity increased dramatically with the new CI/CD pipeline. Deployment frequency rose from bi-weekly releases to an average of 15 daily deployments, with rollback capability consistently achieving recovery within 45 seconds. The microservices architecture enabled teams to deploy independently, reducing cross-team coordination overhead by 60%.

Metrics

Infrastructure cost reduction: 42%
System availability: 99.98%
API response time improvement: 85%
Deployment frequency: 15/day (up from 1/2 weeks)
Database performance: 92% faster queries
Mean time to recovery: 3.2 minutes
User capacity: 80,000 concurrent (up from 50,000)
Security compliance: Achieved SOC 2 Type II

Lessons Learned

Several key insights emerged during this complex migration. First, investing in comprehensive observability early pays dividends throughout the project. Our implementation of distributed tracing with X-Ray and custom CloudWatch dashboards enabled rapid debugging when issues arose during the dual-running phase.

Second, the importance of maintaining psychological safety within legacy system teams became evident. Including original developers in the migration process, rather than treating the legacy systems as purely technical debt to be eliminated, preserved institutional knowledge that proved invaluable for edge case scenarios.

Third, regulatory compliance cannot be retrofitted into cloud architectures. Early engagement with compliance stakeholders and iterative validation against requirements prevented costly rework in later phases. The automated compliance checking within our deployment pipeline now serves as a template for future projects.

Finally, the technical debt accumulated during hasty initial migrations can compound over time. Building in refactoring time during the migration process, rather than treating it as purely a lift-and-shift operation, resulted in cleaner, more maintainable systems that will serve the client well beyond the initial project scope.