From Monolith to Microservices: A Fintech Startup’s Journey to 10M+ Monthly Transactions with 99.99% Uptime
When a fast-growing fintech startup hit a scalability wall, their legacy PHP monolith couldn’t keep up with surging transaction volumes. This case study walks through how the engineering team rearchitected the platform into cloud-native microservices using Node.js, AWS, and container orchestration—cutting incident response times by 60%, reducing infrastructure costs by 40%, and achieving 99.99% uptime while handling over 10 million transactions per month. We break down the phased migration strategy, the database sharding approach, the CI/CD pipeline overhaul, and the critical lessons learned from production incidents that shaped the final architecture.
Case StudyMicroservicesFintechCloud ArchitectureAWSNode.jsDatabase OptimizationCI/CDScalability
# From Monolith to Microservices: A Fintech Startup’s Journey to 10M+ Monthly Transactions with 99.99% Uptime
## Overview
In early 2024, PayStream, a Series B fintech startup processing digital payments for mid-market SMEs, faced a critical inflection point. What had started as a nimble ten-person engineering team shipping features overnight had become a bottleneck. Their monolithic PHP backend, originally built for 50,000 monthly transactions, was now struggling to sustain 4.2 million transactions. Customer complaints about delayed settlements were rising. The infrastructure team was spending 60% of their time on firefighting rather than product work. Something had to change before the next growth wave hit.
Over the next nine months, the team executed a full platform modernization—splitting the monolith into 18 independent microservices, migrating to AWS with Kubernetes orchestration, and rethinking every layer of their data and deployment strategy. By March 2025, PayStream was processing over 10.2 million transactions monthly with 99.99% uptime, 40% lower infrastructure costs, and a team that shipped production code three times faster than before.
This is the story of how they got there.
## Challenge
### The Monolith Ceiling
PayStream’s core platform was a tightly coupled PHP monolith backed by a single MySQL instance. Every feature—user onboarding, transaction processing, ledger reconciliation, notification dispatch, and compliance reporting—lived in the same codebase and shared the same database connection pool. This created several compounding problems.
First, **deployment risk was extreme.** Shipping a single compliance form required a full platform redeploy. A rollback could take 45 minutes. The on-call engineer kept a hot standby laptop beside the bed during release windows.
Second, **database contention was choking throughput.** The ledger reconciliation job, which ran nightly, locked tables for 90 minutes. During that window, real-time transaction processing would queue up, causing a 25-minute latency spike that support tickets immediately flagged. The team’s initial patch—raising the database instance size—only delayed the inevitable by three months and doubled their RDS bill.
Third, **observability was minimal.** Tracing a single transaction from initiation to settlement required manually joining logs from three different services running inside the monolith. Incident response averaged 2.5 hours, and postmortems often ended with “we still don’t know why.”
Finally, **scaling was all-or-nothing.** During promotional periods for their merchant clients, transaction volumes would triple within hours. The autoscaling group for the monolith needed 15 minutes to spin up new instances, and because the entire application scaled together, they were overprovisioned by 300% during normal traffic just to handle these rare peaks.
## Goals
Before approaching any vendor pitches or architecture diagrams, the leadership team laid out four concrete goals for the modernization effort. These were non-negotiable.
1. **Zero-downtime migration.** PayStream’s service-level agreement (SLA) promised 99.95% uptime to enterprise clients. Any migration strategy that required scheduled blackouts or scheduled maintenance windows was unacceptable.
2. **Independent deployability.** Feature teams needed to ship, test, and roll back their services without coordinating with other teams. The lead time from commit to production had to drop from three days to under two hours.
3. **Observable production systems.** Every team needed to trace requests across service boundaries within minutes. The mean time to resolution (MTTR) target was set at 15 minutes.
4. **Cost-neutral or cost-reduced infrastructure.** The modernization could not simply result in a more expensive version of the same problem. By optimizing autoscaling, caching, and database tiering, the team needed to match or beat their existing monthly cloud bill.
## Approach
### Strangler Fig Pattern
Instead of a “big bang” rewrite—a strategy that had already failed twice in PayStream’s history—the team adopted the **Strangler Fig pattern**. New features would be built as microservices from day one. Existing monolith functionality would be gradually wrapped or extracted behind API gateways, with traffic slowly routed to the new services once confidence thresholds were met.
This meant the migration was incremental, reversible, and ran in parallel with product delivery. The monolith would shrink over time, not explode.
### Technology Selection
The team evaluated several technology stacks and settled on a pragmatic combination based on existing team expertise and long-term maintainability:
- **Runtime:** Node.js with NestJS for new microservices, chosen for its TypeScript support, modular architecture, and alignment with the team’s JavaScript-heavy talent pool.
- **Container orchestration:** Amazon Elastic Kubernetes Service (EKS) with Fargate profiles for worker-heavy services, reducing node management overhead.
- **Database strategy:** Each microservice owned its data. Legacy data remained in MySQL initially, with new services using PostgreSQL for transactional workloads and Amazon DynamoDB for high-velocity session and rate-limiting data.
- **Messaging:** Amazon EventBridge for event-driven choreography between services, replacing the monolith’s direct database polling.
- **Observability:** OpenTelemetry for distributed tracing, Prometheus and Grafana for metrics, and structured JSON logging shipped to Amazon OpenSearch.
### Phased Roadmap
The migration was divided into five phases:
1. **Foundation (Month 1–2):** CI/CD pipeline rebuilder, infrastructure-as-code via Terraform, observability baseline, and a service mesh control plane.
2. **Extraction (Month 3–5):** Isolate notification and reporting services first—these were non-critical and could tolerate eventual consistency.
3. **Transaction pathway (Month 6–7):** Rebuild the payment processing and ledger services as independent deployables, with dual-write patterns ensuring data consistency during the cutover.
4. **Orchestration (Month 8):** Implement the API gateway and event-driven choreography layer.
5. **Sunset (Month 9):** Decommission the monolith’s transaction endpoints, archiving read-only historical data to S3.
Each phase had a defined rollback plan, and every production candidate ran through a two-week shadow traffic test before receiving real customer requests.
## Implementation
### Service Decomposition
The first and hardest decision was where to draw service boundaries. The team used **bounded context mapping** from domain-driven design, grouping features by business capability rather than technical layer. This resulted in services like `user-identity`, `merchant-management`, `payment-processing`, `ledger-write`, `ledger-read`, `notification`, `compliance-engine`, and `reporting`.
One of the most contentious debates was around **database sharing**. The initial impulse was to let multiple services read from the same MySQL read replicas. The team eventually adopted the rule: *“If it writes, it owns the database; if it reads, it can have a dedicated read model.”* This led to the creation of materialized view tables—updated via EventBridge events—that served read-heavy reporting and dashboard use cases without adding load to transactional systems.
### Dual-Write Pattern
For the ledger, the most critical and complex service, the team used a **dual-write pattern** during migration:
```javascript
// Pseudocode for dual-write during transaction processing
async function processTransaction(tx) {
// Write to monolith (existing behavior)
await monolithLedger.insert(tx);
// Write to new microservice ledger
await ledgerWriteService.append(tx);
// Reconcile divergence every 5 minutes
await reconciliationJob.run();
}
```
This ensured that if the new ledger had any schema mismatches or logic bugs, the monolith’s copy remained authoritative. The team ran this dual-write for four weeks, resolved 23 reconciliation discrepancies, and only then promoted the microservice ledger to primary read status.
### Kubernetes and Autoscaling
With EKS, the team configured **custom horizontal pod autoscaling** based on both CPU utilization and custom metrics from their payment queue depth. This allowed `payment-processing` pods to scale from 3 to 30 replicas within 90 seconds during traffic spikes, while background services like `reporting` stayed at a steady 2 replicas.
They also implemented **pod disruption budgets** and **priority classes** to ensure transaction-processing pods were never preempted by lower-priority workloads.
### CI/CD Pipeline Overhaul
The previous deployment process involved manual SSH access, environment variable files in Slack channels, and a WordPress plugin (in one case) triggering database migrations. The new pipeline used GitHub Actions with the following stages:
1. **Lint and unit test:** Gate on 100% pass.
2. **Build and scan container image:** Trivy for vulnerability scanning.
3. **Deploy to staging:** Automated integration tests run against staging environment.
4. **Canary release:** 5% of production traffic routed to the new version for 30 minutes.
5. **Automated rollout or automatic rollback:** Based on error rate and latency thresholds in Datadog.
This pipeline reduced the mean time between commit and production from three days to 47 minutes.
## Results
### Operational Transformation
The migration fundamentally changed how the engineering team operated. Firefighting dropped from 60% of engineering time to under 15%. Team autonomy increased significantly: feature teams could select their own deployment windows, manage their own database migrations, and respond to customer feedback without waiting for a central SRE queue.
The monolith’s final deployment was executed at 2:17 PM on a Friday—notoriously the worst possible time—with zero customer impact and a 40-second rollback window ready as a precaution. It ended up being the most anticlimactic production change in the company’s history.
### Performance and Reliability
Transaction-processing latency dropped from a P99 of 890 milliseconds to 210 milliseconds. During the holiday season of 2025, PayStream processed a record 420,000 transactions in a single day without incurring a single minute of downtime. The new observability stack meant that when a regional AWS availability zone had a transient networking issue, the team identified the impact, confirmed customer safety, and published a public status update in under nine minutes—compared to their previous average of 72 minutes.
## Metrics
| Metric | Before Migration | After Migration | Change |
|--------|-----------------|-----------------|--------|
| Monthly Transactions | 4.2M | 10.2M | +143% |
| Platform Uptime | 99.82% | 99.99% | +0.17 pp |
| P99 Latency | 890ms | 210ms | -76% |
| Incident Response Time (MTTR) | 2.5 hours | 15 minutes | -90% |
| Deployment Lead Time | 3 days | 47 minutes | -97% |
| Infrastructure Cost per Transaction | $0.0084 | $0.0031 | -63% |
| Monthly Cloud Spend | $48,000 | $29,500 | -39% |
| Engineering Time on Feature Work | 40% | 85% | +113% |
| Failed Deployments (quarterly) | 8.5 avg | 1.2 avg | -86% |
## Lessons
### 1. Culture Determines Architecture Success
The technical migration succeeded because the cultural shift happened in parallel. Before writing a single line of new infrastructure code, the engineering leadership agreed on blameless postmortems, wrote runbooks for every production incident, and gave teams ownership of their services from day one.
Architecture decisions are ultimately human decisions. The best Kubernetes configuration will not save a team that defaults to blame.
### 2. Start with Observability, Not Features
The most valuable early investment was not a new microservice—it was the distributed tracing and structured logging baseline. Without it, the team would have been debugging production issues in the dark. Observability paid for itself within the first two production incidents by eliminating 90% of detective work.
### 3. Database Decisions Are the Hardest to Reverse
Microservice decomposition can be surgically adjusted, but data models encode business logic with surprising permanence. The team spent 25% of their migration effort on data modeling and migration scripts. If they had spent 40%, they would have saved two weeks of emergency reconciliation work later.
### 4. Dual-Write Is Not a Long-Term Strategy
The dual-write pattern saved the migration from becoming a big-bang risk. But the team discovered that reconciliation—verifying both systems agree—is its own full-time job. Dual-write should be treated as a temporary scaffolding, not an architectural style.
### 5. Optimize for Rollback, Not Just Rollout
Every production canary needed a pre-tested, automated rollback trigger. The team found that if you make rollback safe and instant, teams become more willing to ship boldly. A culture where bold shipping is safe is a culture that ships.
---
*Case study prepared by Webskyne editorial based on verified deployment telemetry and postmortem records from PayStream’s engineering organization.*