From Legacy to Cloud-Native: How We Helped a Fintech Startup Scale from 10K to 500K Users in 18 Months
When a fast-growing fintech startup hit a wall with their legacy monolithic architecture, they came to us with a clear mandate: scale without downtime. This case study walks through the end-to-end cloud migration, microservices redesign, and engineering process overhaul that took them from daily outages and 4 AM deployment windows to 99.99% uptime and multiple deployments per day. Over 18 months of close collaboration, we extracted services using the Strangler Fig pattern, built an EKS-based internal developer platform, and reorganized the team around domain-driven boundaries. The result was dramatic: API latency dropped from 3.2 seconds to 185 milliseconds, the user base grew tenfold, and the engineering team reclaimed hours previously spent on firefighting. We'll cover the technical decisions, organizational challenges, performance gains, compliance constraints, and the lessons that any engineering lead can apply to their own scaling journey.
Case Studycloud-nativemicroservicesAWSscalabilityfintechKubernetesPCI complianceengineering culture
## Overview
In early 2024, a Series B fintech startup approached Webskyne with a familiar but urgent problem. Their payment processing platform was built on a legacy monolithic Node.js application deployed on a single AWS EC2 instance. What started as a lean MVP had grown into a business processing over ₹200 crore in annual transactions. Their user base had exploded from 10,000 to nearly 500,000 active users—but their infrastructure hadn't evolved to keep pace.
The symptoms were classic: daily 503 errors during peak hours, deployment windows that required 4 AM wake-up calls, and a growing backlog of feature requests that the engineering team couldn't ship fast enough. The CTO made it clear—they needed to scale, and they needed to do it without losing their fast-paced engineering culture.
This is the story of how we partnered with their team over 18 months to transform their architecture, their delivery pipeline, and their engineering organization.
---
## The Challenge
When we first engaged, the situation was more complex than a simple infrastructure upgrade. Here's the full picture:
**Technical Debt:** The monolith had grown organically for three years. There were no clear module boundaries, shared mutable state across services (though they were technically one process), and database queries that spanned 50+ tables in a single transaction.
**Performance Bottlenecks:** Their primary PostgreSQL database was hitting 15,000 concurrent connections during peak hours. Response times had degraded from 200ms to over 3 seconds for critical payment workflows. The caching layer was non-existent.
**Operational Overhead:** Deployments took 45 minutes and required a full system restart. There was no blue-green deployment, no canary testing, and rollbacks meant restoring from a backup taken the previous night.
**Team Velocity:** The engineering team of 12 was spending 60% of their time on firefighting and maintenance. New features took 3-4 weeks to ship, and the on-call rotation was burning out senior engineers.
**Compliance Requirements:** As a regulated payment processor, the startup needed to maintain PCI DSS compliance while modernizing their infrastructure—a constraint that eliminated many "easy" cloud-native patterns.
---
## Our Goals
We established a clear, measurable set of objectives:
1. **Zero-Downtime Deployments**: Enable continuous deployment with automated rollbacks
2. **10x Performance Improvement**: Reduce P95 API response times from 3s to under 300ms
3. **50x Scalability**: Support 500,000+ concurrent users with room to grow
4. **Team Velocity**: Increase feature shipping speed by 3x while reducing operational overhead
5. **Compliance**: Maintain PCI DSS Level 2 certification throughout the migration
6. **Cost Optimization**: Right-size infrastructure to handle 50x traffic at 1.5x current cost
---
## Our Approach
We didn't want to be another vendor that dropped a blueprint and left. This required deep collaboration with their team. Our approach combined four key pillars:
### 1. Strangler Fig Pattern for Incremental Migration
We rejected the "big bang" rewrite. Instead, we used the Strangler Fig pattern—gradually routing traffic from the monolith to new services while keeping the system running. This minimized risk and allowed us to deliver incremental value from day one.
### 2. Domain-Driven Design for Service Boundaries
Before writing any code, we spent two weeks in workshops with their product and engineering teams to map bounded contexts. This gave us clear boundaries for microservices: User Management, Transaction Processing, Ledger, Notifications, and Reporting.
### 3. Platform Engineering Foundation
Building individual services was pointless without the platform to run them. We built a complete internal developer platform (IDP) using:
- **Kubernetes on EKS**: For container orchestration
- **Terraform**: Infrastructure as code with complete environment reproducibility
- **ArgoCD**: GitOps-based continuous delivery
- **Prometheus + Grafana**: Full observability from day one
- **DataDog APM**: Distributed tracing for the service mesh
### 4. Organizational Change Management
Architecture changes fail when teams don't evolve with them. We embedded a senior engineer from our team into theirs for the first six months, ran regular architecture review sessions, and created a "platform guild" that met weekly to share patterns and decisions.
---
## Implementation Details
Let me walk you through the key technical decisions and how we executed them.
### Phase 1: Foundation (Months 1-3)
The first quarter was about establishing the platform and starting the strangulation.
**Infrastructure:** We migrated from single EC2 to a multi-AZ EKS cluster with auto-scaling node groups. Initially running 3 worker nodes, the cluster could scale to 50+ nodes under load. We used AWS RDS Aurora for PostgreSQL, which gave us automated failover and read replicas without manual replication management.
**Observability:** We instrumented every service with OpenTelemetry traces, structured logging (JSON to CloudWatch), and custom business metrics. The team could finally answer "what's slow?" with data instead of guesswork.
**First Service Extraction:** The Notifications service was the obvious first candidate—it had minimal dependencies and clear boundaries. We extracted it in 6 weeks, using an anti-corruption layer that translated between the monolith's internal API and the new service's interface.
### Phase 2: Core Services (Months 4-9)
With the platform proven, we tackled the heart of the system.
**Transaction Processing:** This was the highest-stakes service. We couldn't afford to get payments wrong. Our approach:
- Implemented the Saga pattern for distributed transactions
- Added idempotency keys to all payment webhooks
- Built a dead-letter queue for failed transactions with automated retry logic
- Created a reconciliation service that compared internal ledger state with bank statements daily
The migration involved running both systems in parallel for 60 days, comparing outputs for every transaction. This parallel-run period gave us confidence to cut over.
**Database Strategy:** We moved from a single shared database to database-per-service for most services, with event-driven data synchronization using Kafka. For services that still needed read access to other services' data, we created materialized views updated via change-data-capture.
### Phase 3: Optimization and Scale (Months 10-18)
The final phase was about squeezing performance and preparing for growth.
**Caching Architecture:** We implemented a multi-layer caching strategy:
- **L1:** In-memory cache in application workers (user sessions, static config)
- **L2:** Redis Cluster for hot data (transaction rates, user preferences)
- **L3:** CDN for static assets and API responses where appropriate
Cache hit rates went from 0% to 94% for read-heavy endpoints.
**Database Optimization:** Query performance improvements included:
- Adding composite indexes for common query patterns
- Implementing read replicas (divided read traffic 70/30 between primary and replicas)
- Introduction of TimescaleDB for time-series transaction data
- Connection pooling via PgBouncer (reduced connection overhead by 80%)
**Auto-scaling:** We configured both cluster-level and service-level auto-scaling:
- Horizontal Pod Autoscaler based on CPU, memory, and custom metrics
- Request-rate-based scaling for predictable traffic patterns
- Scheduled scaling for known peak periods (month-end salary processing)
### Phase 4: Decommissioning (Month 18+)
The monolith wasn't killed—it was gradually strangled. By month 18, the monolith was handling only 2% of traffic (legacy admin endpoints and a few batch jobs). The team made an internal decision to sunset it moving forward.
---
## Results
The numbers tell the story:
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| API P95 Latency | 3,200ms | 185ms | **17.3x faster** |
| System Uptime | 99.2% | 99.99% | **Annual downtime: 8.8h → 52min** |
| Deployment Frequency | Weekly | Multiple per day | **10x more deploys** |
| Lead Time | 18 days | 2.5 days | **7.2x faster** |
| Change Failure Rate | 15% | 2% | **87% reduction** |
| Monthly Active Users | 50,000 | 500,000 | **10x growth** |
| Infrastructure Cost per Transaction | ₹0.45 | ₹0.08 | **82% reduction** |
**Business Impact:**
- Customer churn dropped by 34% after the performance improvements
- Support tickets related to "app is slow" decreased by 78%
- The engineering team reclaimed 15 hours per week from incident response
- Product shipped 3x more features in the final quarter compared to the first quarter of the engagement
- The company raised their Series C six months after completing the migration, citing technical stability as a key investor requirement
---
## Key Metrics (Extended)
**Performance Metrics:**
- API P50 latency: 45ms → 12ms
- Database query P95: 2200ms → 95ms
- Cache hit rate: 0% → 94%
- Concurrent connections supported: 2,000 → 50,000+
- Throughput: 200 TPS → 12,000 TPS sustained
**Operational Metrics:**
- Mean Time to Recovery (MTTR): 4 hours → 18 minutes
- Deployment rollback time: 45 minutes → 90 seconds
- On-call pages per week: 15 → 2
- Incident post-mortems requiring executive summary: monthly → quarterly
**Business Metrics:**
- Payment success rate: 96.3% → 99.97%
- User-reported errors: 1,200/week → 45/week
- NPS score: 42 → 68
- Time-to-market for new features: 3-4 weeks → 1 week
---
## Lessons Learned
This engagement taught us as much as it taught the client. Here are the lessons we'd take into every similar engagement:
### 1. Parallel Runs Are Worth the Investment
Running both systems in parallel during the Transaction Processing migration added 60 days of overhead. But we caught 23 discrepancies during that period—bugs that would have cost thousands in financial reconciliation and customer trust. Parallel validation is non-negotiable for high-stakes systems.
### 2. Observability Isn't Optional
We couldn't optimize what we couldn't measure. The first action in month one wasn't code—it was instrumentation. Those metrics became the north star for every decision that followed.
### 3. Bounded Contexts Come Before Services
Two weeks of domain modeling saved us months of painful refactoring. Getting the service boundaries right upfront meant we didn't have migration-induced data inconsistencies.
### 4. Platform Teams Make Product Teams Faster
Investing in the internal developer platform paid for itself within three months. What looked like "infrastructure overhead" was actually a force multiplier for the entire engineering organization.
### 5. Organizational Change Matters as Much as Technical Change
The most technically sound migration fails without team buy-in. Our embedded engineer, weekly guild meetings, and transparent decision logs kept the team aligned and invested in the outcome.
### 6. Compliance Doesn't Have to Slow You Down
We disproved the myth that cloud-native and PCI compliance are incompatible. By building compliance into the platform from day one (encrypted data stores, audit logging, access controls), the annual audit went from a 3-week project to a 1-week review.
### 7. Start with the Team's Biggest Pain Point
We didn't start with the hardest technical problem. We started with the Notifications service—the team's most visible daily frustration. Early wins built the trust needed for harder migrations later.
---
## What Comes Next
This engagement is now entering its "mature cloud-native" phase. The team is:
1. Migrating remaining monolith components to event-driven architecture
2. Implementing AI-assisted anomaly detection for fraud prevention
3. Building a real-time analytics pipeline with Kafka and ClickHouse
4. Expanding the platform to support international markets (multi-region, multi-currency)
The infrastructure is no longer the constraint—the team's ambition is.
---
## Conclusion
Scaling from 10K to 500K users isn't just an infrastructure problem. It's a systems thinking problem. The fintech's success came from treating the entire stack—not just the code, but the team, the processes, the observability, and the culture—as something that needed to evolve together.
At Webskyne, we've run this play multiple times now, and the pattern is consistent: start with platform foundations, move incrementally with parallel validation, and invest equally in the people using the system. Technology scales when the organization is ready to scale with it.
---
## About This Engagement
- **Duration:** 18 months (ongoing platform evolution)
- **Team Size:** 2–4 Webskyne engineers embedded with the client team
- **Technologies:** AWS EKS, Kubernetes, Node.js, PostgreSQL, Kafka, Redis, Terraform, ArgoCD, DataDog
- **Client:** Leading Indian fintech startup (Series B → Series C)
- **Engagement Model:** Deep collaboration with embedded engineering partnership