From Monolith to Microservices: How a Fintech Startup Scaled to 10 Million Users on AWS

PayStream, a Bangalore-based fintech startup, faced a critical infrastructure crisis in late 2023. Its original monolithic Node.js backend—built in a six-week sprint to capture market share—was buckling under production load, with deployment risks, database contention, scaling inefficiencies, and team friction threatening both engineering velocity and business growth. The breaking point came during Diwali 2023, when a routine schema migration caused 47 minutes of downtime and 2,300 failed payout attempts. This case study documents the full 18-month architectural transformation of PayStream's infrastructure, from a fragile monolith to a Kubernetes-native microservices architecture on AWS. Over the course of the migration, the engineering team decomposed 47 business capabilities into domain-aligned services, introduced event-driven communication via Kafka, re-architected the data layer from a single PostgreSQL instance to a distributed system, and implemented comprehensive observability with distributed tracing. The results were decisive: 99.95% monthly uptime, a 3.75x increase in peak transaction throughput, 42% reduction in per-transaction infrastructure costs, and the operational capacity to support 10x user growth. This is the story of how PayStream rebuilt its foundation—not as a marketing exercise, but as a genuine engineering transformation that changed how the entire organization builds software.

Overview

PayStream is a Bangalore-based fintech startup that provides instant payouts, wallet infrastructure, and payment orchestration for gig economy platforms, e-commerce marketplaces, and SaaS companies across India and Southeast Asia. Founded in 2021, the company grew rapidly—by early 2024 it was processing over 2 million transactions monthly for 800+ business customers. But beneath that growth lay a critical infrastructure problem: the original monolithic backend, built in a six-week sprint to capture market share, was buckling under production load.

This case study walks through the full journey of that transformation. It covers the technical debt that forced the migration, the strategic decisions made along the way, the step-by-step implementation of 47 decomposed services, and the measurable outcomes that validated the investment. Throughout, the emphasis is on the engineering decisions—not just what was built, but why each choice was made.

Challenge

The monolith was a classic Node.js Express application with a single PostgreSQL database. All business logic—user management, wallet ledger, payout orchestration, KYC verification, notification dispatch, webhook delivery, and admin tooling—lived in one codebase with shared models and direct database access. Deployment was a single Docker container behind an NGINX reverse proxy, scaled horizontally behind an AWS Application Load Balancer.

In practice, this created a cascade of problems:

Deployment risk: Any change to the checkout flow required deploying the entire application. A bug in the admin dashboard could theoretically disrupt live payment processing.
Database contention: The PostgreSQL instance handled OLTP for the ledger, OLAP-style reporting queries, and full-text search for transaction history. At peak Diwali 2023, connection pool exhaustion caused 2,300 failed payout attempts in a 90-minute window.
Scaling inefficiency: The team had to over-provision the monolith because every instance carried all capabilities. During low-traffic overnight hours, 80% of the deployed compute was idle.
Team friction: With 18 backend engineers working on the same deployable unit, merge conflicts were weekly events. Code review turnaround averaged 3.2 days.
Observability gaps: Tracing a failed transaction required grepping through aggregated logs. There was no per-capability latency breakdown, no targeted autoscaling, and no ability to roll back a single feature independently.

The breaking point came in November 2023, when a routine schema migration on the shared database caused 47 minutes of downtime across all payment flows. The incident post-mortem was unambiguous: the monolith had become an organizational and technical liability that was constraining both engineering velocity and business growth.

Goals

The leadership team and engineering leads aligned on four concrete goals for the migration:

Decouple deployment boundaries so that individual business capabilities could be released, rolled back, and scaled independently.
Eliminate single points of failure in both the application and data layers, targeting 99.95% monthly uptime.
Reduce infrastructure cost per transaction by at least 30% through right-sized compute and storage.
Improve engineering team autonomy by assigning clear domain ownership: each of the four squads would own end-to-end delivery for a set of capabilities.

A non-goal was also explicitly stated: do not rebuild the product. The migration had to be zero-downtime for existing users, with no feature regression.

Approach

The team chose an incremental strangler-fig pattern rather than a big-bang rewrite. The monolith would remain in production throughout the migration, with new capabilities built as services from day one and existing capabilities extracted one by one.

The high-level approach had five pillars:

1. Domain-Driven Design to Define Boundaries

Before writing any infrastructure code, the team spent three weeks in event-storming workshops to map existing business capabilities. The outcome was a bounded-context diagram with 47 distinct capabilities grouped into six domains: Identity & KYC, Wallet & Ledger, Payout Orchestration, Notifications, Webhook Delivery, and Admin & Compliance.

This upfront investment paid immediate dividends: it eliminated the "where does this logic belong?" debates that had slowed development for months, and it gave each squad a clear ownership boundary.

2. Service Mesh and Infrastructure Standardization

The team standardized on Amazon EKS for container orchestration, Istio for service-to-service communication (with mutual TLS and automatic retries), and AWS Secrets Manager for credential distribution. Every new service followed a common contract: REST for synchronous external APIs, Kafka for internal async events, and OpenTelemetry for distributed tracing from day one.

3. Data Architecture: Identifying the Right Decomposition

Data was the hardest problem. The team applied the "database-per-service" principle selectively:

Services with strict consistency needs (ledger, KYC) got dedicated PostgreSQL instances with read replicas.
Event-sourced services (webhook delivery, notification dispatch) used Kafka with compacted topics as the source of truth.
Reporting and analytics were extracted into a dedicated Amazon Redshift cluster, fed by CDC from the operational databases via AWS DMS.

This hybrid approach preserved ACID guarantees where they mattered while allowing other services to move faster with eventual consistency.

4. API Gateway and Backend-for-Frontend

An Amazon API Gateway instance became the single entry point for mobile apps and web dashboards. Behind it, backend-for-frontend (BFF) services aggregated calls to the microservices layer, preventing the client apps from needing to orchestrate multiple service calls. This also allowed the mobile team to iterate independently of backend changes.

5. Observability and Incident Response

The team invested heavily in observability from the start: structured JSON logs shipped to Amazon OpenSearch, metrics to Amazon Prometheus, and distributed traces visualized in Jaeger. Every service had a runbook, a health-check endpoint, and a defined SLO—targeting <200ms p95 latency for synchronous payment flows and <5 minutes of allowed downtime per quarter for async workflows.

Implementation

The migration ran in four distinct phases over 18 months, with parallel tracks for infrastructure, service extraction, and data migration.

Phase 1: Foundation (Months 1–3)

The first step was building the platform layer: EKS cluster, Istio mesh, CI/CD pipelines with GitHub Actions, shared libraries for logging and tracing, and the API gateway. The team also created an anti-corruption layer in the monolith that intercepted writes to critical tables (users, wallets, transactions) and published domain events to Kafka. This gave the new services real-time data without direct coupling.

By the end of Phase 1, the platform could run services, but none were in production yet. Two greenfield services—Notification Dispatch and Webhook Delivery—were built entirely new to validate the platform and train the squads on the new patterns. Both went live in Week 10 with zero incident.

Phase 2: High-Value Extractions (Months 4–9)

The team identified the most impactful services to extract first. The Wallet and Ledger domain was the backbone of the business and the source of most production incidents, so it was extracted in three incremental steps:

Read path split: Reporting queries were moved to the Redshift cluster, and the new Ledger Query Service began handling transaction history lookups via read replicas, reducing primary database load by 35%.
Write path split: New wallet operations were routed to the Ledger Write Service, which wrote to its own PostgreSQL instance. The monolith was modified to write to both databases temporarily.
Monolith cutover: Once data consistency was verified over a 30-day shadow period, read traffic was fully migrated to the Ledger Query Service, and the monolith stopped serving transaction history. The write cutover happened two weeks later after reconciliation reports showed zero discrepancies.

This careful approach meant the migration was transparent to users. There was no downtime, no transaction loss, and no visible performance degradation at any point.

Phase 3: Mass Extraction and Async Hardening (Months 10–15)

With the platform proven, the team accelerated. The Identity & KYC, Payout Orchestration, and Admin & Compliance domains were extracted using the same pattern: introduce an anti-corruption layer, build the new service, shadow traffic, validate, and cut over. Kafka was used extensively to decouple cross-domain communication—for example, a successful KYC verification now published a KYCVerified event that the Wallet service consumed to unlock full account capabilities, eliminating the synchronous HTTP call that had been a latency bottleneck.

During this phase, the team also introduced chaos engineering experiments using AWS Fault Injection Simulator. The team deliberately killed Ledger Write Service pods, disrupted Kafka brokers, and saturated the API gateway to validate autoscaling rules and circuit breakers. Each experiment surfaced gaps—most critically, that the retry logic in the Payout Orchestration service was not idempotent, which would have caused duplicate transfers in a real outage. Fixing this before it caused a production incident was a direct win from the chaos practice.

Phase 4: Decommissioning and Hardening (Months 16–18)

The final phase had three tracks:

Monolith decommissioning: After all capabilities were extracted, the monolith was reduced to a thin compatibility shim that proxied legacy webhook calls. That shim was disabled after 60 days of zero traffic, and the monolith container was removed from the EKS cluster entirely.
Cost optimization: Kubernetes Horizontal Pod Autoscalers were tuned using historical traffic patterns, and spot instances were introduced for non-critical async workers. Redshift reserved instances replaced on-demand clusters.
Operational maturity: Runbooks were updated, on-call rotations were formalized, and a quarterly architecture review process was established to prevent new services from becoming monoliths in their own right.

Results

Eighteen months after starting, PayStream's engineering organization and infrastructure looked fundamentally different. The quantitative results speak for themselves:

Metrics

Availability: Monthly uptime improved from 99.2% (pre-migration) to 99.95%, exceeding the original target. The number of P1 incidents dropped from 3.4 per quarter to 0.5 per quarter.
Transaction throughput: Peak TPS increased from 3,200 to 12,000—a 3.75x improvement—thanks to independent scaling of the Wallet Write Service and Payout Orchestration Service. The architecture now handles the highest-traffic days (salary disbursements, festival shopping peaks) without queuing or connection exhaustion.
Infrastructure cost: Per-transaction infrastructure cost dropped by 42%, beating the 30% target. The combination of right-sized EKS nodes, spot instance workers for async tasks, and reserved Redshift capacity created meaningful savings at scale.
Deployment frequency: Deployments per week increased from 1.2 to 8.4. Mean time to recover from failed deployments dropped from 47 minutes to 12 minutes, because rollbacks targeted individual services rather than the entire platform.
Engineering velocity: Mean review turnaround dropped from 3.2 days to 18 hours. With domain-aligned squads, knowledge silos shrank dramatically. Engineers reported 60% less time spent on coordination overhead in internal surveys.
Latency: End-to-end p95 latency for the payment initiation flow decreased from 820ms to 340ms, primarily because the Wallet Write Service no longer competed with reporting queries for database resources.

These metrics were achieved without any regression in product functionality. PayStream's business customers—800+ merchants and platforms—saw no disruption during the migration, and new customer onboarding actually increased by 18% during the extraction period, driven by improved reliability and faster feature delivery.

Lessons Learned

The migration was not without setbacks. Several lessons emerged that shaped both the final architecture and the team's approach to future work:

1. Eventual Consistency Is a Product Decision, Not Just an Architecture Decision

Early in the migration, the team defaulted to eventual consistency for cross-domain communication because it was technically simpler. This led to a situation where a user's KYC status appeared updated in the Admin portal before the Wallet service had processed the event, causing support tickets and user confusion. The lesson was that consistency models must be discussed with product and support teams, not just chosen by engineers. The eventual fix was an explicit "read model projection" that gave each service its own eventually consistent view of foreign domain data, with a UI indication when synchronization lag exceeded two seconds.

2. The Strangler Fig Requires Strict Governance

The incremental approach worked well, but it also created a prolonged hybrid state where the monolith and services coexisted for over a year. Without strict rules—specifically, a ban on new features in the monolith and a 90-day SLA for extracting any capability touched in production—the migration could have stalled. The team created a "monolith feature freeze" policy two months in, which was initially unpopular but ultimately kept the project on track.

3. Observability Must Come Before Scale

The investment in OpenSearch, Prometheus, and Jaeger during Phase 1 was controversial. Several engineers argued the platform was "too small" to need that level of observability. The turning point came during Phase 2, when a latency spike in the Ledger Query Service was diagnosed in 20 minutes using trace data rather than the multi-hour log-scouring sessions that had been routine with the monolith. Observability paid for itself twice over in that single incident.

4. Database Migrations Are People Problems

The technical work of migrating data was largely straightforward. The harder problem was managing the knowledge transition—ensuring that squads understood their new data ownership, that the DBA team adapted to a distributed model, and that backup and recovery procedures were defined for 15+ databases instead of one. The team created a new "Data Reliability" guild to own this domain, meeting weekly to share patterns and runbooks. That social infrastructure was as important as the technical migration.

Looking Forward

The microservices architecture is now the foundation for PayStream's next phase: expanding into cross-border payouts, launching a white-label wallet SDK, and building a merchant-facing analytics dashboard. Each of these products will build on the service boundaries established during the migration.

The team is also evaluating serverless alternatives for specific async workloads—particularly webhook delivery and notification dispatch—to further reduce operational overhead. And a new effort to introduce GraphQL federation across service boundaries is underway, aiming to give client applications a unified query API without re-introducing the coupling that made the original monolith so brittle.

The lesson PayStream's engineering team takes from this work is not that microservices are universally better. It is that the right architectural choice depends on the scale of the problem you are solving. At two million transactions a month, a monolith is perfectly reasonable. At twelve million, with engineering teams distributed across time zones and product requirements that demand independent deployment cycles, a decomposed architecture is not just a technical upgrade—it is an organizational necessity.