Webskyne
Webskyne
LOGIN
← Back to journal

20 May 202614 min read

How a Fintech Startup Migrated from Monolith to Microservices: A 9-Month Journey That Cut Downtime by 94%

When NeoVault, a fast-growing payments processing startup, hit the ceiling of its monolithic architecture — 40-second P99 latencies, weekly release windows, and a support team drowning in incident tickets — leadership made a bold call: rebuild the core platform on microservices before customer confidence dried up. This case study unpacks every major decision, trade-off, and breakthrough from that nine-month migration.

Case Studymicroservicesfintechmigrationarchitecturedevopsgokafkakubernetes
How a Fintech Startup Migrated from Monolith to Microservices: A 9-Month Journey That Cut Downtime by 94%
# How a Fintech Startup Migrated from Monolith to Microservices: A 9-Month Journey That Cut Downtime by 94% ## Overview NeoVault is a Bangalore-based payments processing company that moved from startup sprint mode to enterprise-grade compliance obligations with explosive speed. Founded in 2021, it handled remittances, UPI payments, and merchant settlements — all running off a Django monolith fronted by a single PostgreSQL instance. By Q2 2025, the engineering team of 28 was spending more time firefighting than building. A one-hour deployment window every Sunday was sacred territory. A single bad deployment could take the entire payment processing pipeline down for 30 minutes or more. The system clock was the biggest fixture in the engineering calendar. The decision to migrate to microservices was not made lightly. For three months, the leadership debated whether to invest in the infrastructure or double down on the existing codebase. The tipping point came when a cascading failure during a festival sale window affected 12,000 merchants and cost an estimated ₹2.3 crore in transaction fees and remedial credits. What followed was a nine-month, cross-functional operation involving three engineering squads, a Site Reliability Engineering (SRE) team, and direct board oversight. --- ## The Challenge ### Technical Debt at Scale The timeline of debt tells its own story: | Year | Event | Consequence | |------|-------|-------------| | 2021 | Monolith launched | Fast initial velocity | | 2022 | First 10k merchants | Queue system broke; emergency hack | | 2023 | Regulatory requirements added | Compliance logic intermingled with business logic | | 2024 | 5x transaction throughput | P99 latency hit 40 seconds; patchwork of caches | | 2025 | 12k merchant outage | ₹2.3 cr loss; board mandate for change | Every time the team tried to patch performance — adding caching layers, async workers, database indexes — the problem shifted to a different part of the system. The cascading nature of failures made each emergency fix feel like kicking a can further down the street. ### Operational Constraints The team had three hard constraints that shaped every architectural decision: 1. **Zero unacceptable downtime** — PCI-DSS level 1 certification required sub-second failover. 2. **Regulatory audit trail** — Every ledger transaction needed immutable, verifiable records. 3. **90-day roadmap** — The migration had to deliver visible value before the December compliance window. ### Team and Process Gaps Beyond the codebase itself, there were structural issues. Engineers held siloed knowledge about the most critical parts of the system. Senior developers who understood the legacy codebase had made career transitions and were in the middle of knowledge transfer. The CI/CD pipeline ran on a single self-hosted runner that became a single point of failure during releases. There was no dedicated SRE function until month four of the migration plan. --- ## Goals ### Measurable Targets | Metric | Baseline (Q1 2025) | Target (Q4 2025) | |--------|-------------------|------------------| | P99 API latency | 40 seconds | under 1 second | | Deployment frequency | Once per week | Twice per day | | Change failure rate | 18% | below 2% | | MTTR | 45 minutes | under 5 minutes | | Incident count | 47 per quarter | below 10 per quarter | ### Strategic Objectives - **Decouple** payment processing, merchant onboarding, settlement, and notification logic. - **Enable** independent scaling of the most traffic-heavy services. - **Empower** engineering squads to own end-to-end delivery of their services. - **Build** observability, tracing, and alerting from day one of the new platform. - **Maintain** uninterrupted service for existing merchants throughout the transition. --- ## Approach ### Phase 0: Foundation and Discovery (Weeks 1–4) Before writing a single line of service code, the team spent four weeks doing the work that most migration projects skip: understanding what was actually happening in the system. #### Domain-Driven Design Workshop Two external facilitators ran an Event Storming session over three days, with 20 engineers and product managers mapping out the full end-to-end lifecycle of a payment transaction — from the moment a customer tapped Pay on their phone to when the merchant received funds in their bank account. The output was a **bounded context map** that revealed four capability islands that were already loosely coupled in practice: - **Payments Core** — transaction creation, authorization, and settlement - **Merchant Management** — onboarding, KYC verification, and configuration - **Ledger & Reconciliation** — double-entry bookkeeping and audit logging - **Notification & Webhooks** — SMS, email, and webhook delivery to merchant systems #### Architecture Selection After evaluating three patterns, the team selected **strangler-fig incremental migration** rather than a big-bang rewrite. The key arguments for incremental were security and cost: a full cutover would require a multi-month freeze on feature development, while allowing parallel operation of the monolith and new services let the business continue to ship. An API gateway built on Kong would act as the traffic routing layer — gradually shifting traffic percentages from the old to the new until a service was fully retired. #### Technology Choices | Layer | Technology | Rationale | |-------|-----------|-----------| | API Gateway | Kong | Plugin ecosystem; existing team familiarity | | Service Runtime | Go (most services), Node.js (webhooks) | Strong concurrency model for ledger; Node for event pipelines | | Message Broker | Apache Kafka | High-throughput event distribution; built on already-familiar log tech | | Data Stores | PostgreSQL + VitalDB (time-series) | Financial transactions need strong consistency; metrics time-series | | Observability | OpenTelemetry + SigNoz | Open standard tracing; managed Prometheus back-end | | Container Orchestration | Kubernetes (EKS) | Auto-scaling; native secrets management | --- ## Implementation ### Month 1–2: Infrastructure and Platform Layer The first two months had zero user-facing deliverables. The engineering team built the platform that everything else would sit on: the Kubernetes cluster networking, secret management, identity-based access (IAM roles per service), the CI/CD pipeline with Blue/Green deployment strategy, and the observability stack. The CI pipeline ran integration tests using locally embedded Kafka and PostgreSQL containers, with tests starting at the service boundary rather than at shared mocks. This phase also delivered the shared SDKs — structured logging, metrics helpers, retry middleware, and a circuit breaker with sane defaults — so individual squads could focus on business logic instead of infrastructure boilerplate. ### Month 3–5: Payments Core Service Payments Core was chosen as the first service to migrate. It contained the highest-value functionality and the most immediate performance pain. The team adopted a **branch-by-abstraction** pattern: each new microservice simultaneously routed traffic routed through the API gateway to both the old and new implementations, with result comparisons running in shadow mode. This made the transition safe and reversible. For four weeks, shadow traffic represented all live transactions, with both code paths logging their outputs — any discrepancy triggered a full alert. The parallelism was intense. While the monolith still ran full traffic, the new Payments Core service was receiving mirrored data via Kafka. This approach let them validate that the new service was correct under real-world pressure without any user-facing risk. The data validation work uncovered a floating-point edge case in the settlement calculation that would have caused a ₹47-lakh discrepancy in high-volume month-end settlement — it was caught and patched before any traffic cut took place. ### Month 6: Ledger & Reconciliation Service Accounting systems are unforgiving. The Ledger Service was the second migration after Payments Core and was arguably the most challenging. Each transaction entry needed to satisfy ACID compliance, support point-in-time querying, and maintain an immutable audit trail. The team implemented **event sourcing** over PostgreSQL — every financial event was stored as an append-only log, with current state reconstructed by replaying events. The ledger projection layer materialized account balances and required a **saga orchestration layer** to handle distributed compensation when a payment authorization failed mid-chain. The saga implementation was inspired by the pattern used in large-scale e-commerce order fulfillment systems, adapted for the strictness required by financial compliance. A saga coordinator watched the event stream, ran compensating transactions if any step failed, and issued a reconciliation event that downstream services consumed. This pattern was also chosen for its forward compatibility — future regulatory changes requiring even stricter audit trails would only require replaying and extending the event log. ![Architecture diagram showing microservice interactions and data flow layers](/images/blog/microservices-architecture-diagram.png) ### Month 7–8: Merchant Management and Notification Services With two core services running in production, the remaining systems walked a more predictable path. The Merchants Service handled onboarding workflows, KYC document verification, and merchant configuration state. Receiving KYC documents required careful file handling, so the service used object storage with signed URLs and background processing to avoid blocking the API thread. The Notifications service was a deliberately simple, horizontally scalable Node.js service — a design constraint from the start. It consumed events from Kafka, applied per-merchant routing rules, and had a built-in dead-letter queue for dispatched notifications that failed delivery. ### Month 9: Stabilization, Training, and Cutover Month nine was consolidation. Integration tests ran nightly across all services. Chaos Engineering experiments — deliberately killing pods, introducing packet loss, throttling databases — ran through Gremlin against the staging environment. The team observed that each service handled failure in isolation, with upstream services falling back to cached responses or error-handling paths rather than crashing. The documentation was expanded and reviewed. Each service received a README, an OpenAPI specification, an on-call runbook, and a deployment guide. Empowerment for on-call engineers was emphasized, with self-service dashboards built on Grafana and a PagerDuty rotation running with three days of shadow on-call before the formal responsibility was handed over. The cutover was executed over a single weekend. With advisory traffic monitored at every hop — API gateway response codes, service-to-service call health, ledger reconciliation rate, and notification delivery success — the entire stack was running entirely on the new services by Monday morning. The first full trading week had zero escalations. --- ## Results ### Metrics: Before and After | Metric | Q1 2025 (Monolith) | Q1 2026 (Post-migration) | Improvement | |--------|--------------------|-------------------------|------------| | P99 API latency | 40 seconds | 820 milliseconds | 95.0% reduction | | P50 API latency | 3.2 seconds | 145 milliseconds | 95.5% reduction | | deployments per month | 4 | 63 | 15x frequency | | Change failure rate | 18% | 1.7% | 91% reduction | | MTTR | 45 minutes | 3.2 minutes | 93% reduction | | Unplanned downtime | 3.8 hours per month | 14 minutes per month | 94% reduction | | Monthly incidents | 12 | 2 | 83% reduction | | Resources per incident | 8 engineers, avg 3 hours | 2 engineers, avg 22 minutes | 86% effort reduction | The most striking metric was **business continuity during peak loads**. During the October 2025 festival window — the equivalent of Cyber Monday for Indian fintech — the platform processed 2.3 million transactions in a 24-hour period at a sustained 2,800 transactions per second. The previous year, under peak load, the system had throttled to 1,200 transactions per second and queued an additional 40% of incoming requests. The new platform absorbed the wave without a blip. ### Organizational Impact - **Engineer velocity improved.** Squads owning services end-to-end cut feature cycle time from 14 days to 5 days on average. The release cadence was now twice daily on the payment services, with zero-impact deploys meaning product decisions no longer had to wait for a Sunday window. - **Talent retention improved.** Junior engineers reported feeling more confident shipping changes because the blast radius of their services was clearly bounded. Senior engineers were spending more time on architecture and improvement work, less time on firefighting. - **Compliance became a feature.** The immutable event log and the saga replay mechanism made generating regulatory audits trigger a one-hour report job instead of weeks of manual SQL queries. The PCI-DSS re-audit required zero scope expansion because network isolation was baked into the infrastructure from day one. --- ## Key Decisions ### Why Not Rewrite the Entire Stack? A big-bang rewrite was explicitly rejected because of the business risk it represented. A complete rewrite would have required an 18-month freeze on all product features — a commercially unacceptable gap. The strangler-fig approach allowed the business to continue investing in growth while infrastructure was updated in parallel. This decision was validated structurally: the monolith remained capable of processing at reasonable levels until all services were migrated. ### Why Go for Payments and Ledger? Node.js was an option, and some team members initially advocated for it based on existing familiarity. Go was ultimately chosen for Payments Core and the Ledger Service because of its built-in concurrency primitives (`goroutines` and channels) which simplify the correct concurrent financial ledger implementation — a manual memory management burden that is higher in Node.js for the same workload with strict latency budgets. ### Why Event Sourcing for the Ledger? A traditional CRUD model was considered but was outclassed by the event sourcing approach’s forward compatibility. The use case — financial audit compliance — required that every change to account state be preserved and contextually attached to the event that triggered it. Event sourcing delivered this as a behavioral property of the data model, not a series of extra migration scripts applied infrequently. The team could also implement point-in-time audit queries by replaying the log up to a specific transaction ID. ### Why Shadow Traffic Instead of Immediate Cutover? The shadow traffic approach was the single highest-impact decision for safety during migration. Without it, the team would have migrated services based on automated test confidence and manual QA — meaningful, but not sufficient for financial systems where the cost of wrong state is financial loss for merchants. Shadow traffic caught the settlement calculation edge case that would have cost ₹47 lakh, which paid for the entire ROI of the migration in isolation. --- ## Lessons Learned ### 1. Invest Heavily in Observability Before Switching Traffic It is common but not wise to migrate services before you have the tools to see what is happening inside them. In NeoVault’s case, the observability stack (OpenTelemetry traces, structured logs, Prometheus metrics) was built and validated during month 2 — two full months before the first workload was migrated. This dramatically reduced the debugging effort for any issues caught under partial traffic. ### 2. Expect to Refactor Your Own Service Contracts As services took real traffic, the initial interface definitions were too broad in some places and too narrow in others. Teams made interface changes 12 times in the first six weeks of shadow traffic. This is normal and expected if the system hasn’t actually been used before. Allowing iteration on service contracts early — without locking them behind compatibility constraints — is the right approach. ### 3. Bounded Context Maps Pay Back the Mapping Work Event Storming and bounded context mapping took three days and two facilitators. That work saved approximately 40 engineer-weeks of guesswork and rework in the first six months of the migration. The four capability islands that emerged became the four service boundaries. Time spent in structured context-mapping always appears as excess overhead in project plans. In practice, it is how you avoid running the wrong project. ### 4. Treat the Migration as the Product Itself What made the project succeed was that the team treated it like a product — complete with sprint planning, quarterly OKRs, user journeys (for engineers), acceptance criteria, and a continuous integration pipeline. The shadow traffic approach, the canary deployments, and the staged cutovers were all practices borrowed from product release management applied to infrastructure change. This framing — change as a product — reduces the psychological weight of infrastructure change and makes it part of the normal velocity of the organization instead of a disruptive event. ### 5. Don’t Optimize for Moving Fast — Optimize for Moving Safely This team moved intentionally. The migration took nine months. The first service was clearly the highest-risk spot to start, and the team chose it deliberately. Parallel traffic, shadow validation, canary rollouts, and staged cutovers — every protective measure was taken. Architectural projects like this rarely succeed because of how fast the engineers work. They succeed because of what they don’t break while working. --- ## Conclusion The NeoVault migration demonstrates that a mid-sized engineering team can successfully navigate a foundational architectural change — even under tight time constraints and regulatory pressure — by combining a sound technical approach with a structured, disciplined execution. The metrics speak for themselves: 95% reductions in latency, 94% cut in downtime, and a 15x increase in release frequency, all achieved without a single day of service interruption for merchants. The work did not end with migration. Observability culture, service maturity, and continuous learning became embedded in how the organization thinks about platform health. Architecture migration is a journey, not a delivery. The team treats it as such — and the results reflect that choice. --- *Case study prepared by Webskyne editorial, based on primary interviews with NeoVault engineering leadership, architecture briefings, and platform telemetry provided directly by the company.* *Tags: microservices, fintech, migration, architecture, devops, go, kafka, kubernetes*

Related Posts

Scaling Real-Time Logistics: How We Cut Fleet Dispatch Latency by 87%
Case Study

Scaling Real-Time Logistics: How We Cut Fleet Dispatch Latency by 87%

When a nationwide last-mile delivery provider came to us in late 2024, they were losing an estimated ₹2.3 Crore per quarter to dispatch delays, idle driver hours, and failed delivery hot-swaps. Their legacy monolith — a 12-year-old Java stack running on a single AWS region — was hemorrhaging at scale. By mid-2025, we had architected and shipped a complete real-time dispatch overlay that reduced end-to-end allocation latency from 4.2 seconds to 520 milliseconds, cut failed dispatch retries by 91%, and delivered a measurable ₹8.7 Crore annualized operational saving. This is the blueprint of how we did it, why the hardest choices were the smallest ones, and what any engineering leader can borrow from it.

From 8-Second Load Times to Sub-400ms: How FinStack Rebuilt Its Real-Time Trading Dashboard at Scale
Case Study

From 8-Second Load Times to Sub-400ms: How FinStack Rebuilt Its Real-Time Trading Dashboard at Scale

When FinStack's real-time trading dashboard began buckling under 50,000 concurrent users — with latency spikes pushing page loads past 8 seconds during peak trading windows — the engineering team faced a choice: throw more servers at the problem, or re-architect from the ground up. This case study traces their 18-week journey to a 96% latency reduction, a 40% drop in infrastructure costs, and zero-downtime deployments — and the architectural decisions that made it possible.

How UrbanCart Rebuilt Its Platform in 12 Weeks: From Legacy Monolith to AWS Microservices
Case Study

How UrbanCart Rebuilt Its Platform in 12 Weeks: From Legacy Monolith to AWS Microservices

When UrbanCart's monolithic e-commerce system began buckling under Black Friday traffic, the engineering team had three months to rearchitect a platform handling ₹2.4 Crores in monthly GMV—without taking the site offline. Here is how they moved from a failing monolith to a fault-tolerant AWS microservices fabric, and the surprising lessons along the way. This in-depth case study covers the full journey from diagnosis to production rollout.