How FinEdge Financial Migrated 200+ Microservices to AWS and Cut Infrastructure Costs by 42%

In early 2024, FinEdge Financial faced a looming infrastructure crisis: rising cloud costs, growing operational debt, and fragile reliability guarantees across more than 200 microservices. This case study traces the six-month AWS migration, the architecture decisions that reduced spend without sacrificing uptime, and the lasting operational practices that keep the system stable. Our client achieved 42% cost reduction while improving SLO adherence from 83% to 98.7%. If you are planning a large-scale cloud migration, this is the blueprint.

Overview

FinEdge Financial is a mid-sized digital banking platform serving roughly 1.2 million active users across Southeast Asia. Between 2021 and 2023, the company scaled aggressively, adding payment processing, fraud detection, analytics, and customer-loyalty features as independent microservices. By late 2023, the estate had grown to 217 production services running across multiple cloud regions and a convoluted network of legacy contracts.

This case study documents the end-to-end migration of that estate to a unified AWS platform, sponsored by the CTO and driven by a cross-functional migration squad. Over six months, the team re-architected networking, standardized deployment pipelines, introduced centralized observability, and retired eight legacy applications that no longer provided meaningful business value. The migration was completed with zero customer-facing downtime and a 42% reduction in monthly infrastructure spend.

Challenge

The problems were not theoretical. In a November 2023 review, FinEdge leadership identified three compounding issues that threatened both customer trust and financial sustainability.

1. Uncontrolled cloud spend. Monthly cloud bills had grown from $87,000 to $164,000 in 18 months. The growth was not driven by user volume alone; a third of the spend came from abandoned resources, oversized instances, redundant multi-region deployments, and unoptimized data-transfer patterns. Every engineering lead owned their own cloud footprint, which made accountability nearly impossible.

2. Reliability erosion. As service count grew, failure domains multiplied. In the four weeks leading up to the migration decision, the platform recorded 23 customer-visible incidents. Post-incident reviews consistently pointed to the same root causes: inconsistent retry policies, missing circuit breakers, and ad-hoc networking rules that allowed dependency cascades across services.

3. Observability debt. Logs, metrics, and traces were scattered across four vendors and several open-source stack choices. Teams routinely shipped debugging dashboards alongside new features, resulting in hundreds of one-off Grafana panels with no shared definition of critical signals. On-call engineers spent 40% of their time hunting for data rather than resolving root causes.

The executive mandate was unambiguous: the platform had to become cheaper, more reliable, and easier to operate within two quarters.

Goals

Before drafting a technical plan, the migration squad translated business needs into measurable outcomes. The goals were:

Reduce total infrastructure cost by at least 35% within eight months without reducing product capability or harming customer experience.
Eliminate single points of failure in payment processing and customer authentication, the two highest-value user journeys.
Establish a single pane of glass for monitoring by consolidating all metrics, logs, and traces into one vendor-managed observability pipeline.
Reduce mean time to recovery (MTTR) from 42 minutes to under 10 minutes for P1 incidents.
Document all infrastructure decisions in a decision log that could be audited six months later, reducing institutional knowledge loss.

Every subsequent technical choice was judged against these five criteria. When trade-offs arose, the goals became the tie-breaker.

Approach

Rather than attempting a single-day cutover, the team selected a strangler-fig migration pattern. Production traffic would remain on the existing environment while each service was re-platformed incrementally. The migration was organized into three sequential waves.

Wave one: Networking and identity. Before moving individual services, the team rebuilt the foundational layer. They introduced a Transit Gateway to replace dozens of peering connections, implemented IAM Identity Center for unified access control, and migrated DNS to Route 53 with weighted routing. Network latency between regions dropped by 28% during this wave, validating the redesign early.

Wave two: Application services. The squad migrated the 50 highest-traffic services first, including the payment engine and fraud detection pipeline. Each service was containerized, profiled for resource usage, and repacked into optimized Fargate tasks or EC2 Auto Scaling groups. Database connections were pooled, and read replicas were introduced where query patterns justified them.

Wave three: Observability and incident response. Only after the core services were stable in the new environment did the team build out unified dashboards, runbooks, and on-call rotations. This sequencing prevented the common mistake of instrumenting platforms that change faster than the instrumentation can keep up.

A six-person migration guild met daily. Kanban boards tracked each service from discovery through deprecation, and a weekly 30-minute leadership sync resolved cross-team blockers.

Implementation

The implementation phase lasted four months and touched almost every part of the technology stack.

Infrastructure as code. The team adopted AWS CDK for all new resources, replacing hand-rolled CloudFormation templates and manual console changes. CDK allowed the team to define networking, security, and compute abstractions in TypeScript, catch policy violations at build time, and publish reusable constructs that any service team could import. Within three weeks, 80% of new infrastructure was codified and subject to peer review before deployment.

Database and caching strategy. Most services used provisioned RDS instances with size-on-demandâ€”a euphemism for slightly too large and never right-sized. The team ran Performance Insights queries over 90-day windows, applied AWS Compute Optimizer recommendations, and moved seasonal workloads to Aurora Serverless v2. For caching, they introduced DAX clusters in front of three high-traffic DynamoDB tables, reducing read latency by 60% and eliminating stale-read retry storms.

CI/CD and deployment gates. To keep the migration safe, every service deployment was wrapped in a CD pipeline with mandatory security scanning, artifact signing, and canary validation. Canary releases compared latency, error rate, and business metrics against baseline for five minutes before promoting traffic. This added roughly ninety seconds to each deployment but caught three regressions that would otherwise have reached production customers.

Cost visibility. The team enabled Cost Explorer at the service level and tagged every resource with environment, owner, and project. A weekly cost review meeting examined the top 20 spenders and decided whether to resize, terminate, or renegotiate reserved capacity. These reviews explicitly excluded the CTO to keep the conversation tactical and blameless.

Observability unification. Logs were centralized in CloudWatch Logs with subscription filters fanning out to the security team and an external SIEM. Metrics were exported to Amazon Managed Service for Prometheus, and traces were ingested into X-Ray. A single Grafana dashboard template was published for all teams, enforcing four golden signals: latency, traffic, errors, and saturation.

Results

The migration concluded on schedule. The weekly cost review board declared the migration complete on April 28, 2024, when the final production workload switched from the legacy environment to AWS and the old data center contract was terminated.

From a business perspective, the migration delivered on every stated goal. Customer complaints related to downtime fell by 74%, and the Net Promoter Score for platform reliability rose from 31 to 58. Engineering teams reported a marked improvement in sprint predictability because they spent less time firefighting infrastructure and more time delivering product features.

The finance team celebrated the end of the data center lease, which had been a five-year commitment with early-termination penalties. By exiting early, FinEdge avoided an estimated $320,000 in sunk costs. The remaining data center assets were recovered, decommissioned, and repurposed within three weeks, generating additional cash through hardware resale.

Perhaps the most underappreciated outcome was cultural. The migration forced every team to document their dependencies. Before the project, only 34% of service dependencies were formally recorded. By the end, 98% had explicit upstream and downstream maps. That documentation became the foundation for an architecture review board that now evaluates new service proposals before any line of code is written.

Metrics

The following table summarizes the key performance changes between the pre-migration baseline (October 2023) and the post-migration steady state (May 2024):

Infrastructure cost: from $164,000/month to $95,000/month, a 42% reduction.
SLO adherence: from 83% to 98.7%, exceeding the original 95% target.
Mean time to recovery (MTTR): from 42 minutes to 7 minutes for P1 incidents.
Customer-visible incidents: from 23 in four weeks to 3 in four weeks.
Deployment frequency: from 2.1 per week to 8.4 per week, a four-fold increase.
Change failure rate: from 18% to 4%, reflecting the impact of canary releases and automated testing.
Tag compliance: from 34% of resources tagged to 99%, enabling accurate cost attribution.

One metric that did not improve was build time. The addition of security scanning and CDK synthesis added roughly eight minutes to end-to-end pipelines. The team mitigated this by introducing remote caching and separating fast-fail linting from heavier integration tests, dropping the median build time back to pre-migration levels by mid-June.

Lessons Learned

The migration team recorded observations in a living retrospective document. Here are the lessons that shaped future work at FinEdge and may benefit any organization undertaking a similar journey.

1. Start with networking. The team initially debated moving applications first and networking later. That would have been a mistake. A well-designed Transit Gateway and consistent IAM policy eliminated hours of debugging before they occurred. In networking, a day of prevention is worth weeks of incident response.

2. Tag everything from day one. Delaying tagging by even a week created an orphan population of resources that were invisible to cost dashboards. The team built tagging validation into the CDK pipeline, making it impossible to deploy untagged resources into production accounts.

3. Right-size after migration, not before. Attempting to predict optimal instance sizes before seeing real traffic patterns in the new environment led to over-provisioning during the first two weeks. The team learned to provision conservatively, rely on auto scaling, and apply Compute Optimizer recommendations two weeks after each service cutover.

4. Canary releases are non-negotiable. Three incidents during the migration were caught by canary analysis before any customer impact occurred. The small time cost of canary validation is an insurance policy against brand damage and regulatory scrutiny in financial services.

5. Observability should follow stability. Building dashboards for services that are still changing every day is frustrating and wasteful. The team waited until each wave of services had run in production for at least a week before publishing team-wide observability templates. The resulting dashboards were accurate, adopted quickly, and required almost no rework.

6. Document decisions, not just systems. Architecture Decision Records turned out to be more valuable than runbooks. When a new engineer joined three months after the migration, they could read the ADRs and understand why the Transit Gateway was designed with three availability zones, why RDS was chosen over Aurora for certain workloads, and why the team rejected a multi-cloud strategy that had been proposed during planning.

Six months after go-live, FinEdge Financial continues to refine its AWS environment. Reserved Instance coverage is now above 72%, saving an additional estimated $180,000 annually. A dedicated platform engineering team maintains the shared CDK constructs and observability templates, ensuring that the improvements are institutional rather than individual. The platform is cheaper, faster to ship on, and quieter at night—exactly the outcome the board had demanded.