From Monolith to Micro-FaaS: How We Rebuilt a Legacy Payment Platform to Handle 12 Million Transactions

A year-long engineering transformation turned a fragile, single-server payment processor into a distributed, event-driven system resilient to a 10x traffic spike. This case study traces the planning, the painful migrations, and the architecture decisions that cut failure rate from 4.2% to 0.03% while cutting infrastructure costs by 40%.

# From Monolith to Micro-FaaS: Transforming a Legacy Payment Platform ## Overview In early 2024, a Series B fintech startup was processing nearly 250,000 transactions per month across Southeast Asia. Their infrastructure was a monolithic Node.js application hosted on a single EC2 instance, backed by a primary PostgreSQL database with a lagging read replica. One bad deployment or a Black Friday–level traffic surge could bring the entire payment stack down for hours. The engagement began as a performance audit and ended as a full architectural modernization. Over ten months, the engineering team redesigned the system using serverless principles, event-driven choreography, and rigorous observability. By the end of the engagement, the platform was processing over 1.2 million transactions per month with 99.97% uptime and a far lower operational burden. ## Challenge The client’s engineering team had grown quickly, but the codebase had not kept pace. Three core problems defined the engagement: 1. **Single point of failure:** The monolith ran on one EC2 instance behind an Application Load Balancer. A failed deployment or a hung process meant absolute downtime. 2. **Database bottlenecks:** Every transaction hit the primary database for both reads and writes. The read replica was fifteen minutes behind, used only for nightly analytics. Connection pooling was misconfigured, and long-running transactions regularly caused deadlocks. 3. **No observability:** Logs were written to local disk. There were no structured metrics, no distributed tracing, and alerts only fired after customers complained on social media. The operational cost of these problems was concrete: in Q4 2023, a bad deployment caused 47 minutes of downtime during a regional holiday sale. The client estimated direct revenue loss at $180,000. The engineering team also reported a 35% burnout rate due to frequent 2 a.m. pages and on-call anxiety. ## Goals The modernization was guided by five explicit goals: - **Availability:** Achieve 99.95% uptime or better during peak loads. - **Scalability:** Support a 10x increase in transaction volume without a proportional infrastructure cost increase. - **Cost efficiency:** Reduce monthly cloud spend by at least 30%. - **Developer experience:** Reduce deployment lead time from an average of three days to under four hours. - **Observability:** Ensure any production incident can be diagnosed within five minutes without relying on log scraping. Every technical decision was measured against these goals. When there were trade-offs, the team explicitly documented the reason for the exception. ## Approach The engagement used a phased, risk-first methodology rather than a big-bang rewrite. The team spent three weeks on discovery before writing a single line of production code. ### Discovery Assessments The first step was a deep-dive assessment across four dimensions: - **System architecture:** Dependency graphs, call chains, and data flow diagrams - **Observability maturity:** Existing logging, metrics, tracing, and alerting - **Team processes:** CI/CD pipelines, incident response, on-call rotation - **Business criticality:** Revenue impact per minute of downtime, transaction SLAs The assessment revealed that 60% of the monolith’s API surface was read-only reporting endpoints that could be safely extracted. The remaining 40% handled payments, settlements, and compliance checks. ### Target Architecture Based on the assessment, the team proposed a "middleware" layer between the legacy system and the new architecture: - **API Gateway** for traffic routing, authentication, and rate limiting - **Payment Service** as a serverless function handling transaction validation - **Settlement Service** for asynchronous reconciliation jobs - **Event Bus** using Amazon EventBridge for decoupled communication - **Read Model** backed by DynamoDB for fast aggregations and dashboards - **Distributed Tracing** with AWS X-Ray for full visibility This approach allowed the team to migrate traffic gradually, feature-flag by feature-flag, keeping the legacy system live as a fallback. ## Implementation The implementation was carried out in six two-week sprints, each delivering production traffic. ### Sprint 1: Observability Foundation Before any architectural changes, the team deployed centralized logging via Amazon CloudWatch, added structured JSON logs, and configured custom metrics for database connections, queue depths, and API latency. Alerts were rewritten to use anomaly detection instead of static thresholds. Within two weeks, the team had identified a previously unknown memory leak that had been causing weekly restarts. ### Sprint 2: API Gateway and Authentication The API Gateway replaced the direct ALB-to-monolith path. Rate limiting, IP allowlisting, and OAuth2 token validation moved to the edge. This sprint reduced attack surface and unblocked the rest of the migration because all subsequent services could trust the gateway-validated identity. ### Sprint 3: Payment Service Extraction The highest-traffic, lowest-risk path was extracted first: transaction validation and PCI-compliant tokenization. The Payment Service was implemented as an AWS Lambda function behind API Gateway. Because it was stateless, scaling was automatic. The team used the Strangler Fig pattern: 5% of traffic was routed to the new service first, then 25%, then 100%, with the legacy system as a dark fallback. Database access was changed from synchronous ORM calls to event-driven commands. Instead of writing directly to PostgreSQL, the Payment Service emitted a `TransactionInitiated` event. Downstream consumers decided how to react. ### Sprint 4: Settlement and Reconciliation Settlement logic was the most complex part of the system. It involved idempotent database writes, retry logic, and compensation transactions. The team implemented the Saga pattern using EventBridge rules and Lambda functions. Each settlement step emitted an event, and the next step react to it. If any step failed, a compensating event reversed the previous action. This decoupled the settlement logic from the Payment Service entirely. The old monolith no longer needed to know about settlement retries or reconciliation rules. ### Sprint 5: Read Model and Analytics The reporting endpoints were extracted into a separate read model using DynamoDB and Athena. Instead of hitting the primary PostgreSQL database, reporting queries read from pre-aggregated tables updated in near-real-time by EventBridge events. This eliminated the load on the OLTP database and reduced report generation time from 45 seconds to under 200 milliseconds. ### Sprint 6: Deprecation With the new system handling 100% of production traffic, the legacy monolith was retired. Database migrations were reversed: the old application database became a read-only archive. The team ran a final chaos test by terminating the old instance and confirming zero customer impact. ## Results The transformation delivered measurable improvements across every goal. **Availability:** Uptime improved from 99.1% to 99.97% over six months. The event-driven architecture eliminated several classes of failure that had plagued the monolith. Service restarts became zero-downtime because Lambda cold starts were mitigated via provisioned concurrency during peak hours. **Scalability:** The platform now handles 1.2 million transactions per month. During a marketing campaign, traffic spiked from 30 to 350 transactions per second. The system scaled automatically within seconds; no engineers were paged. **Cost efficiency:** Monthly infrastructure spend dropped from $18,400 to $10,900—a 41% reduction. The serverless model meant the client paid only for compute during actual usage, not for idle capacity. **Developer experience:** Deployment lead time dropped from an average of 72 hours to under 2 hours. The team adopted trunk-based development with feature flags, and every service had its own CI/CD pipeline. Pull requests were merged within the same day they were opened. **Observability:** Mean time to detection (MTTD) fell from 90 minutes to 4 minutes. Mean time to recovery (MTTR) fell from 4 hours to 12 minutes. Engineers could trace a transaction end-to-end in X-Ray and see exactly where latency was introduced. ## Metrics | Metric | Before | After | Change | |--------|--------|-------|--------| | Monthly transactions | 250,000 | 1,200,000 | +380% | | Uptime | 99.1% | 99.97% | +0.87 pp | | Payment failure rate | 4.2% | 0.03% | -99.3% | | P95 latency | 820 ms | 140 ms | -83% | | Monthly cloud spend | $18,400 | $10,900 | -41% | | Deployment frequency | 2 per month | 60 per month | +2900% | | Lead time for changes | 72 hours | 2 hours | -97% | | MTTR | 4 hours | 12 minutes | -95% | | Developer satisfaction (survey NPS) | +12 | +68 | +56 points | These metrics were tracked continuously using CloudWatch dashboards shared across engineering, product, and finance. Transparent dashboards meant the entire company could see progress in real time. ## Lessons Learned The engagement produced insights that shaped future consulting engagements and the client’s internal engineering culture. ### 1. Start with Observability, Not Architecture Many teams jump straight into new technology. This engagement proved that observability is the prerequisite for any modernization. Without good logs, metrics, and traces, you cannot safely refactor because you cannot measure whether the new system behaves like the old one. The first sprint was the highest-ROI sprint of the entire project. ### 2. Extract by Business Value, Not by Technical Convenience The team could have started with the database layer or the authentication layer. Instead, they extracted the highest-business-value, lowest-risk path first: transaction validation. This built trust with stakeholders and provided early wins that funded the remaining sprints. ### 3. The Strangler Fig Pattern Works Running old and new systems in parallel is intimidating. The Strangler Fig pattern—routing incremental traffic percentages to the new system while keeping the old as a dark fallback—reduced fear and provided instant rollback capability. The team never experienced a situation where they needed to roll back. ### 4. Event-Driven Architecture Requires Contract Discipline Moving from synchronous calls to events required strict schema governance. The team introduced event versioning from day one and enforced backward compatibility. Without this discipline, small teams can easily create event-driven monoliths where every producer knows about every consumer. ### 5. Serverless Is Not Free, but It Matches Demand The serverless model shifted the cost curve from capacity planning to usage-based pricing. For workloads with spiky traffic, the savings were dramatic. For steady-state workloads, reserved instances or Fargate Spot may be more economical. The lesson is to model your traffic pattern before choosing a compute model, not after. ### 6. Include the On-Call Team in Design Decisions The team included the on-call engineer in every architecture review. This gave them direct ownership of the alerting and runbooks they would inherit. Post-migration, the on-call rotation became dramatically quieter, and team morale improved. ### 7. Chaos Engineering Validates Confidence The final chaos test—terminating the old monolith with the new system at 100% traffic—gave the team and leadership tangible proof that the migration was complete. Without a formal validation step, teams often leave legacy systems running indefinitely "just to be safe," accruing technical debt. ## Conclusion Architectural modernization is often treated as a technical exercise. This case study demonstrates that it is equally a business, cultural, and operational transformation. By grounding the work in explicit goals, phased delivery, and rigorous observability, the team turned a brittle, expensive monolith into a resilient, cost-effective distributed system. The client is now exploring further optimizations, including predictive capacity planning using machine learning on historical transaction patterns. The increased confidence in their infrastructure has allowed them to pursue aggressive regional expansion without fearing platform failure. --- *Case study prepared by Webskyne editorial for internal review and client presentation.*

From Monolith to Micro-FaaS: How We Rebuilt a Legacy Payment Platform to Handle 12 Million Transactions

Related Posts

How TechVantage Cut Cloud Costs by 60% and Page Load Times by 80%: A Full-Stack Architecture Overhaul

How a Series A FinTech Startup Cut Infrastructure Costs by 62% While Scaling to 2M Users

From Legacy Microservices to Event-Driven Architecture: A Mid-Sized Fintech’s 60% Throughput Turnaround