From Monolith to Cloud-Native: How FinServe Labs Cut Loan Processing Time by 87%

When FinServe Labs, a Bengaluru-based B2B fintech serving 180+ NBFCs, inherited a 12-year-old Rails monolith that timed out under anything above light traffic, its engineering team faced a stark fork in the road: keep patching a sinking ship, or undertake one of the riskiest migrations a regulated financial platform has ever attempted. Six months later, that same team is publishing 16 deploys a week without breaking a sweat, cutting end-to-end loan processing from 47.2 seconds to 5.8 — an 87% improvement — and reducing infrastructure costs by 60%. How did a 28-person team achieve what many thought impossible without losing a single client or slipping a single SLA? This case study walks through the full migration, from the painstaking discovery phase through the infrastructure build-out, service extraction, and ruthless go-to-production order — the architecture decisions, the hidden traps, the raw numbers, and the real lessons learned along the way. operationally

Overview

FinServe Labs is a B2B fintech platform headquartered in Bengaluru, India, serving over 180 regulated NBFCs (Non-Banking Financial Companies) across South and Southeast Asia. Its flagship product — a white-labeled loan origination and management platform — processes origination requests, credit scoring, KYC verification, disbursal, and collections repayment within a unified interface. By mid-2024, the platform was handling an estimated 240,000 loan applications per quarter across its client base. Internally, FinServe's engineering team was running on a Rails-based monolith with a traditional Ruby on Rails backend, a single PostgreSQL database, and VirtualBox-hosted worker nodes managed through Capistrano. The team was 28 people strong: 12 backend engineers, 5 DevOps, 4 frontend, 4 data, and 3 QA. The monolith had been in continuous service since 2012, with no architectural review, no service boundary enforcement, and no budget for significant re-architecture. Technical debt was compounding faster than anyone could pay it down.

The Challenge

Symptoms of a Monolith Under Strain

By January 2024, the trouble signs were impossible to ignore — and impossible to keep from clients.

Processing Timeouts at Scale: During the last quarter's peak season, 18% of loan applications timed out mid-processing. Worst of all, one client — Midland Credit, a ₹2,800 crore NBFC — reported they had re-initiated over ₹90 crore in disbursals manually because the platform's credit-check step simply never returned. The embarrassment was compounded by the financial and reputational cost.

Deployment Fear Culture: Any release that touched more than 2% of the monolith required a full freeze of the platform for at least 45 minutes — and deployment frequency had dropped to once every three weeks because the QA team couldn't cycle fast enough. A codebase of 380,000 lines, including tests, made even the smallest change high-risk. The DevOps team kept a rollback runbook that was five pages long and still often wrong.

AWS Cost Inefficiency: Because the application could not scale horizontally with demand staggering on a single machine, the infrastructure team was running at 85% CPU utilization on an oversized AWS r5.4xlarge instance as a primary app server, costing approximately ₹2.1 lakh (~$2,470 USD) per month. During peak hours, auto-scaling kicked in erratically, running to 4 instances before the scheduler corrected — creating a spike of ₹80,000 in a single night during an unscheduled capacity event.

Specific Performance Targets at Stake

Metric	Pre-Migration Baseline	Client SLAs Committed
Loan origination end-to-end	47.2 seconds (avg)	15 seconds
Credit score API p99	8,200ms	2,000ms
System availability	97.8%	99.9%
Deployment frequency	1 per 3 weeks	1 per week
Mean Time To Recovery (MTTR)	127 minutes	30 minutes

The risk of losing Midland Credit alone — a client contributing over ₹1.8 crore in annual revenue — made inaction untenable.

Goals

Primary Non-Negotiable Objectives

Before any line of code was written, the CTO, Priya Nair, ran a series of whiteboard workshops with engineering, product, and leadership to define what success looked like. Three non-negotiables stood out:

Goal 1 — Break the Monolith Along Business Domains: Identify five clearly bounded business domains — Loan Origination, Credit Scoring, KYC Verification, Disbursal, and Collections — and extract each as an independently deployable service with its own data store.

Goal 2 — Reduce End-to-End Loan Processing to <15 Seconds: Meet or beat the SLAs published in all client contracts. At 47 seconds, the system was in breach of its own service agreements for 82% of clients for at least 3 out of 5 SLA metrics.

Goal 3 — Enable Zero-Downtime Deployments: Every service must be deployable independently, with canary and blue-green strategies, without platform-wide outages.

Secondary/Stretch Objectives

Reduce infrastructure spend by at least 40% (over 12 months)
Achieve 10x improvement in MTTR through independent service observability
Enable 50-person engineering team by end-of-year without compounding complexity
Replace PostgreSQL with CockroachDB for regionally-distributed data resilience

The Approach

Architecture Strategy: Strangler Fig + Event-Driven, Not Big-Bang Rewrite

After a two-week discovery sprint that burned through four standing lamps in the team's war room, it became clear that a "big-bang" rewrite (write everything, switch everything over in one weekend) was a non-starter. The codebase had too many unknown unknowns, client expectations were too inflexible, and a three-week freeze would likely break relationships they worked for years to build.

Instead, the team adopted the Strangler Fig pattern, a method originally described by Martin Fowler: wrap legacy endpoints with adaptive layer proxies, route new feature development through service boundaries, and gradually "strangle" the old system out of existence. This meant the new services and the old monolith would coexist for a transition period of approximately three to four months.

The chosen stack reflected the team's priorities: Kubernetes on AWS EKS for orchestration, Go (gRPC for inter-service communication, REST for client endpoints) for the core services, Apache Kafka for event streaming, CockroachDB for the primary data store, and Redis for distributed caching. The frontend — a React single-page application — would receive a new Backend-for-Frontend (BFF) layer to decouple it from backend contract changes.

Service Extraction Order

The team chose a deliberate order of service extraction designed to minimize blast radius:

Credit Scoring — most orthogonal business domain, lowest coupling to other services, and the highest client SLA pressure
KYC Verification — well-defined data boundaries, high volume, crudely managed queue inside the monolith
Loan Origination — the orchestrator, most complex, most clients call this directly
Disbursal — payment orchestration, tightly coupled to originating service but with clear external contracts
Collections — the smallest team, moved last with support from the now-free Scaling team

This order ensured that despite extracting services incrementally, each service had the ability to reach full performance targets before the next was cut over — creating a feedback loop of speed and confidence.

Implementation

Phase 1: Infrastructure Foundation (Weeks 1–3)

The first three weeks were unglamorous infrastructure work — and critical to everything that followed. EKS + Terraform Infrastructure as Code Priya's team provisioned three EKS clusters (dev, staging, production) using Terraform modules stored in a new dedicated GitHub org. Each cluster was provisioned with dual-AZ multi-region awareness, with Karpenter for automatic node scaling replacing the existing cluster-autoscaler — reducing wasted capacity by up to 35%. GitHub Actions replaced the existing Jenkins pipeline, with build times dropping from 19 minutes to approximately 4 minutes for the majority of services.

Observability Stack Overhaul The legacy Prometheus + Grafana setup had 72 dashboards, 63% of which no longer collected data. The team rebuilt observability on OpenTelemetry for traces, Prometheus with Thanos for long-term cloud storage, Grafana Alerting with PagerDuty integration, and structured JSON logging via Loki. Log correlation IDs were introduced across all inter-service calls — allowing engineers to trace a single loan application end-to-end across five services without guessing which service produced which error.

CI/CD Pipeline for Each Service Every new service got an isolated GitHub Actions pipeline with linting, unit tests, integration tests, and a Chaotic Monkey deployment check. No service went to staging without passing a minimum of 85% unit test coverage. All services containerized with Docker and scanned with Trivy for CVE bugs on every push.

Phase 2: strangler-proxy and Data Migration (Weeks 4–8)

The team built a Kong API gateway layer that served as the "strangler proxy" — routing client requests either to the monolith or to the new services based on URL path or HTTP method headers, with traffic split controlled by feature flags managed through LaunchDarkly. No client endpoint was disrupted; read traffic to the monolith could be transparently shadowed to new services before committing to production routing.

CockroachDB was installed and configured for strong multi-AZ consistency. A one-way CDC (Change Data Capture) pipeline was established from the legacy PostgreSQL to CockroachDB using Debezium + Kafka Connect — ensuring the new database was kept in sync with the old during the transition. When a service broke away from the monolith, it directed its writes to CockroachDB and the Debezium pipeline ensured all new writes were replicated in PostgreSQL until that service was fully dark-launched.

Phase 3: Building the Credit Scoring Service (Weeks 8–16)

Credit Scoring was the first service extracted, chosen for its clean business boundaries — it accepts a loan application ID with borrower data, scores the borrower against internal and third-party bureau data, and returns a deterministic risk grade. The team identified three key failure modes in the legacy implementation:

Database lock contention on the scoring table during peak hours
Synchronous calls to external bureau APIs with no timeout or fallback policy
Cache stampede on bureau data during market events (festival season loan surges)

The new Credit Scoring service eliminated the lock contention by partitioning the scoring table by borrower region, isolated external bureau calls behind a resilient HTTP client with circuit breakers (Hystrix) and a Redis-based result cache with 12-hour TTL, and handled cache stampedes with a "single-flight" pattern that serialized identical requests during cache expiration. The result was before they had even released it to production — the stress test SLA compliance rate went from 62.1% to 98.4% in load testing.

Phase 4: KYC Verification (Weeks 16–24)

KYC (Know Your Customer) verification was the second extraction. The legacy implementation in the monolith processed verification through a single-threaded ActiveJob worker that serialized all requests — at peak hours, a queue of 12,000 verification jobs was common, creating a 45-minute lag. The team migrated to an HTTP-based worker pool of 200 goroutines in the new service, backed by Redis-based rate limiting and an auto-scaling Horizontal Pod Autoscaler (HPA) set to 70% CPU target.

They also extracted third-party identity document OCR processing (from a previously inline monolith method) into an async Sliding Window Queue with a 30-second polling depth, so a slower-than-expected OCR response wouldn't block downstream disbursal.

Phase 5: Loan Origination (Weeks 24–34)

Loan Origination was the most complex extraction — it was the monolith's core orchestration service. The team rewrote the state machine driving loan application progression, splitting it into domain events (ApplicationSubmitted, CreditScored, KycCompleted, UnderwritingApproved, Disbursed) published to Kafka. Each domain event triggered idempotent handlers that updated the application state — decoupling the machine from direct database writes.

This event-driven rewrite also enabled an immediately useful new feature: real-time client dashboards showing loan application status without polling — a feature Midland Credit had been asking for for 18 months. The team used WebSocket connections managed by a BFF service and backed by Redis pub-sub, so each frontend client received push updates without hitting the loan service API directly.

Results

Performance Metrics

When full migration was completed in late September 2024, the numbers came in faster than the team expected.

End-to-end loan origination time dropped from 47.2 seconds to 5.8 seconds — a reduction of 87.7%. Credit scoring p99 latency fell from 8,200ms to 312ms, a 26× improvement. System availability climbed from 97.8% to 99.96%, a tenfold reduction in unplanned downtime minutes per year. Mean Time To Recovery dropped from 127 minutes to 9 minutes — a 13× improvement — primarily because PagerDuty alerts now included precise service-level context and the relevant service owner was auto-paged, rather than the entire on-call rotation.

Infrastructure costs dropped from ₹2.1 lakh/month to₹83,000/month — a 60.5% reduction. The savings were driven by better resource targeting via Karpenter, the removal of oversized instances, and the ability to burst to spot instances during unexpected traffic spikes without significant cost. AWS savings plan commitments of 3 years drove additional 18% discounts on top of this.

Operationally

Deployment frequency climbed from once every three weeks to 16 deploys per week across all services, with mean deploy time of approximately 9 minutes per service and zero service-impacting deployment incidents in the 90-day post-migration observation window. Each service maintained independent release schedules, so a bug fix to the credit scoring service required no freeze of disbursal or origination — and vice versa.

Business Impact

The client adoption of new features was immediate and substantial. The real-time application tracking dashboard — built as a byproduct of the event-driven origination rewrite — was adopted by 94% of active clients within three weeks. Midland Credit formally renegotiated their contract, increasing their committed annual spend by 23% after seeing the improvement in processing reliability and speed.

Customer support ticket volume on performance-related complaints dropped by 73% within 30 days of full migration, which translated to an estimated savings of approximately ₹10 lakh a year in support labor costs.

Metric	Before (Jan 2024)	After (Sep 2024)	Change
End-to-end processing time	47.2s	5.8s	⬇ −87.7%
Credit scoring p99	8,200ms	312ms	⬇ −96.2%
System availability	97.8%	99.96%	⬆ +10x uptime
Infrastructure cost	₹2.1L/month	₹83k/month	⬇ −60.5%
MTTR	127 min	9 min	⬇ −92.9%
Deployment frequency	1/3 weeks	16/week	⬆ 48× faster
Support tickets (performance)	423/month	113/month	⬇ −73%

Lessons Learned

1. Don't Underestimate the Strangler Fig Transition Cost

Originally, the team expected the strangler-fig migration to take 12 weeks per service. In reality, the first service (Credit Scoring) took 10 weeks because of unexpected coupling to legacy state stored in the monolithic database — forcing the team to duplicate data synchronization logic across the Debezium pipeline boundaries. Future migrations started with a dedicated two-week coupling discovery phase, which reduced actual extraction time by 35%.

2. Invest in Observability Before You Build

The team's investment in structured logging, distributed tracing, and correlation IDs in Phase 1 paid for itself dozens of times over. During the Loan Origination extraction, the team identified a bug in the Kafka event schema in 18 minutes using a single trace ID — an equivalent investigation in the old system would have taken an estimated 6–8 hours through log files and debugger breakpoints.

3. Infrastructure Partners: Don't Fly Solo

The DevOps team chose CockroachDB partly because of their direct account support for FinServe's region — a decision that paid off when their authorized-story feature flag accidentally triggered a regional CockroachDB outage during staging. Their on-call engineer was in the Slack channel with deployment logs in under four minutes, issuing a graduated repair in 11. Choice of vendor support tiers matters far more than anyone who has never been oncall at 2 a.m. easily believes.

4. Parallel Development Requires Parallel Communication

With five services under simultaneous active development, the weekly all-hands was not sufficient to surface cross-service API contract conflicts. The team introduced bi-weekly "API Composition" reviews with all engineers to maintain shared schema references — a practice that prevented at least three production-breaking schema mismatches in the first five months. Technical excellence requires deliberate communication architecture.

5. Know When to Stop — and What to Defer

The stretch objectives around CockroachDB regional failover and real-time ML fraud detection were explicitly deferred to the post-migration roadmap. The team cut these features several times during the development window to protect delivery on the core five services. Clinically evaluated scope — with transparent stakeholder tracking of what was deferred and why — was one of the most important decisions of the project.

Final Reflection

Six months after full migration, FinServe Labs had more than a faster system. They had a culture change: a pattern of confident engineers deploying without fear, of leadership making technical decisions grounded in data rather than politics, of a platform that was finally doing what the business needed rather than the other way around.

The numbers are remarkable — 87% processing speed, 60% infrastructure savings, nearly 99.97% availability — but the real win is something harder to quantify: the returnees from doubt, the team that stopped asking "Will we make our SLA?" and started asking "What's next on the roadmap?"

For any engineering team staring at a monolith and wondering if they'll survive the rewrite: the migration wasn't easy, but it was always going to be harder to not do it. The question wasn't whether to migrate — it was whether to do it when they could choose, or when they couldn't.