How We Cut Cloud Spend by 47% While Doubling Platform Uptime for a 2M-User Fintech

When a rapidly growing fintech platform faced mounting infrastructure costs and recurring outages at peak trading hours, we didn't just patch the problem — we rebuilt their entire AWS architecture from the ground up. In seven months, we achieved a 47% reduction in monthly cloud spend, 99.98% platform uptime, a 2.3-second improvement in average API latency, and a smooth transition to zero-downtime deployments — all without disrupting their 2 million active users.

Overview

In early 2024, FinCore — a digital payments and wealth-management platform serving over two million registered users across South and Southeast Asia — approached Webskyne with a problem that felt, at the time, almost unsolvable. Their monthly AWS bill had crossed $128,000, their engineering team was spending nearly 40% of their sprint capacity fighting fires rather than shipping features, and their customers were starting to notice. During peak trading windows — market open at 9:15 AM and close at 3:30 PM local time — the platform experienced an average of three significant incidents per month, ranging from degraded loading times to outright service unavailability that cascaded across their payment gateway, portfolio tracker, and customer support dashboards.

This case study details how our cloud architecture and DevOps team engaged with FinCore's technology leadership over a seven-month engagement, designed and implemented a comprehensive infrastructure modernization program, and delivered measurable outcomes that fundamentally transformed the platform's operational and financial posture.

The Challenge

FinCore's infrastructure had grown organically over four years, piece by piece, without a cohesive architecture strategy. Multiple engineering teams had independently provisioned services using their preferred AWS patterns, resulting in a sprawl of EC2 instances, RDS databases, Lambda functions with overlapping responsibilities, and a handful of third-party SaaS monitoring tools that communicated poorly with one another. There was no centralized logging strategy, no formal incident management process, and no infrastructure-as-code discipline — meaning every environment (development, staging, production) was subtly, dangerously different.

The most visible symptom was cost. Uncapped autoscaling policies, a collection of long-running instances serving low-traffic services, and a complete absence of resource tagging made it nearly impossible to know which team or service was responsible for which portion of the bill. Add to that, several high-traffic APIs were running on burstable t-series instances under-provisioned for their actual load, resulting in noisy-neighbor contention and frequent CPU throttling.

The reliability challenge was equally serious. FinCore's primary RDS PostgreSQL instance — serving read traffic for both their portfolio tracker and their historical analytics engine — was a single-AZ configuration with no automated failover. Any AZ-level hardware failure would result in a complete platform outage with no recovery path other than a manual restoration from backup, something their on-call engineers had been forced to perform twice in the preceding quarter.

On the monitoring and observability side, FinCore had adopted four different tools over the years — CloudWatch for basic metrics, Datadog for application tracing, Sentry for error tracking, and a legacy ELK stack for log aggregation — none of which integrated cleanly. Engineers typically diagnosed incidents by manually correlating dashboards across three tabs. Mean Time To Acknowledge (MTTA) for P1 incidents was averaging 18 minutes; Mean Time To Resolve (MTTR) was averaging 47 minutes.

Goals

From our first workshop with FinCore's CTO and engineering leads, we established four non-negotiable goals. Cost reduction had to be meaningful and sustainable — not a one-time trim achieved by shutting down idle resources, but a structural reduction driven by architectural discipline, right-sizing, and pricing model optimization. The target we agreed on was 40% or better reduction in monthly AWS spend, measured on a like-for-like basis over a normalized six-month operational window.

Platform reliability was the second priority. We needed to eliminate single points of failure across the entire request path and achieve at least 99.95% monthly uptime — a significant lift from the approximately 99.2% baseline. An auxiliary goal in this category was reducing MTTA to under five minutes and MTTR to under 20 minutes for P1 incidents, enabled by centralized observability and automated alerting.

Performance, measured in API p99 latency, needed to improve by at least 25% across the five highest-traffic endpoints. This was both a user experience requirement and an infrastructure efficiency indicator — faster APIs tend to handle load more economically, compounding the cost-reduction benefit.

Finally, the engineering team needed to own the new platform from day one of handover. We established that every infrastructure change during the engagement would be governed by Terraform, that deployments would follow a GitOps model, and that FinCore's engineers would participate in every sprint review and architecture decision meeting. The sustainability of the solution was as important as the solution itself.

Our Approach

We structured the engagement into six phases, each with clearly defined deliverables and success criteria. The first phase was discovery and assessment. We conducted a comprehensive infrastructure audit — reviewing every AWS resource across all accounts, analyzing three months of CloudWatch and cost data, auditing IAM permissions and security posture across the organization, and running structured interviews with every engineering team to understand their current workflows, pain points, and tooling gaps. We delivered a 48-page assessment report containing a prioritized list of 87 infrastructure improvement items, each with an effort estimate, cost impact, and risk rating.

Phase two was architecture design. We produced a target-state architecture document covering compute, storage, networking, data, security, observability, and deployment strategy. A critical decision made early in this phase was the adoption of a multi-account AWS Organizations structure, with separate accounts for production, staging, development, and shared services — replacing the single-account setup that had been the source of so many environment-discrepancy bugs. We also committed to a container-based compute model using Amazon ECS with Fargate for stateless services, replacing the heterogeneous mix of EC2, ECS on EC2, and Lambda functions.

Phase three was the infrastructure sprint cycle. Rather than attempting a big-bang migration — which history suggests often goes poorly at this scale — we implemented a phase-gated approach. Sprint one established foundational infrastructure: networking with private subnets, security groups hardened to the principle of least privilege, IAM with centralized access management via SSO, and centralized logging with OpenSearch replacing the legacy ELK stack. Sprint two migrated the highest-risk service — the payment processing API — first, as a proof of concept that would validate our tooling, our runbooks, and our incident response process before we moved on to higher-traffic services.

Sprints three through five covered the remaining services in batches of two to three, with each sprint including delivery of infrastructure-as-code using Terraform, a comprehensive operational handover, and a sprint review with FinCore's leadership. Throughout, we maintained a parallel production environment running on the old architecture with a weighted traffic shift, then used gradual traffic migration to validate performance at each stage before removing the legacy infrastructure.

Implementation Details

The compute migration to Amazon ECS with Fargate was the heaviest lift of the project. We designed a service-per-container pattern where each stateless service — the payment API, the portfolio tracker, the notifications service, the analytics aggregator — was packaged as a Docker container and deployed as an independent ECS Fargate service. Services were configured with auto-scaling policies driven by both CPU utilization and custom CloudWatch metrics for latency and request queue depth, replacing the static capacity planning that had been the norm.

For stateful services — the primary PostgreSQL database, the analytics data warehouse on Redshift, the document store for user profiles on DynamoDB — we implemented a data tier separation strategy. OLTP workloads (user authentication, payment transactions, account management) were consolidated onto Amazon RDS PostgreSQL with automated Multi-AZ failover, read replicas for analytics query offloading, and automated backup snapshots rotated across seven days. No-structure or semi-structured data — user activity logs, session data, event streams — moved to DynamoDB, which provided automatic scaling and sub-millisecond access latency. Heavy analytical workloads migrated to Redshift Serverless, replacing the long-running Redshift cluster that had been the single largest cost driver on FinCore's bill.

On the networking and security side, we restructured the VPC into a multi-tier private subnet model with no public subnet exposure for compute resources. All outbound internet access is mediated through NAT gateways with cost-optimization through Egress-only internet gateways for IPv6. Security groups were audited and reduced by an average of 60% per service, with a new policy that every group must justify its existence with a documented use case. IAM policies moved to permission boundaries aligned with each service's actual AWS API usage, analyzed via Access Analyzer across three months of production logs. The result: IAM Access Analyzer findings dropped from 127 to 7.

The observability overhaul was, for many of FinCore's engineers, the most immediately impactful change. We consolidated four monitoring tools into a unified stack: Amazon Managed Prometheus for metrics collection, AWS X-Ray for distributed tracing across all services, Amazon OpenSearch for centralized log aggregation and search, and a single Grafana dashboard layer that replaced three separate tool dashboards. Alerting was centralized in Amazon EventBridge with on-call routing through PagerDuty, replacing the manual alert triage process. A synthetic monitoring suite using CloudWatch Synthetics runs ten pre-configured user journeys every three minutes — login, portfolio load, trade execution, deposit, withdrawal — providing proactive incident detection before real customers notice.

Results

The cost reduction was immediate and substantial. Month-over-month comparison between the last full month on the legacy architecture and the first full month on the new stack showed a 47% reduction in AWS infrastructure spend: from a normalized baseline of $124,000 monthly average down to $65,600. The savings came from a combination of right-sizing Fargate task CPU and memory allocations using actual four-week performance data, moving long-running services to appropriate purchase options, decommissioning provisioned Redshift capacity in favor of Redshift Serverless, and shutting down the orphaned instances that represented over $12,000 monthly in previously untracked spend. FinCore has maintained a cost-monitoring dashboard on their internal Grafana, and spending discipline is now an explicit part of their weekly engineering review.

Platform uptime reached 99.98% within the first month of full production migration — compared to the 99.22% baseline established over the prior six months. The two most frequent incident types — AZ-level database failover incidents and traffic spikes overwhelming the payment API — were eliminated entirely. The payment API lost zero seconds of availability in Q2 2025, compared to 312 seconds of total downtime in Q2 2024. The monitoring suite's synthetic monitoring detects roughly two potential incidents per week before they reach customers; of those, 80% are resolved before any user impact occurs — a forward-looking reliability posture that the team now treats as a competitive differentiator in their sales conversations.

API performance improvements were measurable across the board. The two highest-traffic endpoints — the portfolio summary API and the order execution API — improved average latency by 31% and 29% respectively, driven by connection pooling strategies, in-memory caching introduced at the Fargate task level using Redis ElastiCache, and query optimization on the PostgreSQL read replicas. Average API response time dropped from 892 milliseconds to 620 milliseconds; the 99th percentile dropped from 2,840 milliseconds to 1,240 milliseconds — a meaningful improvement during peak trading windows when response variance is most visible to users.

Key Metrics

Metric	Before	After	Change
Monthly AWS Spend	$124,000	$65,600	▼ 47%
Monthly Uptime	99.22%	99.98%	▲ +0.76pp
Avg API Latency	892ms	620ms	▼ 30.5%
P99 API Latency	2,840ms	1,240ms	▼ 56.3%
MTTA (P1)	18 min	4 min	▼ 78%
MTTR (P1)	47 min	19 min	▼ 60%
Deploy Lead Time	14 days	< 2 hours	▼ 98.6%
IAM Findings	127	7	▼ 94.5%

Lessons Learned

The most consequential lesson from this engagement is that architectural debt compounds in a way that billing alone does not reveal. FinCore's infrastructure problems were not simply cost problems; they were visibility problems, security problems, governance problems, and velocity problems packaged in one. The team with the best metrics dashboard and the most disciplined code review process still lost the battle on reliability because the infrastructure beneath them was never designed to support the traffic patterns they were now experiencing. Investing in architecture — and in the governance processes that keep architecture honest — pays dividends that accrue across every dimension of the business simultaneously.

A second lesson concerns human factors in infrastructure modernization. Our phase-gated migration approach, which moved one service at a time with staged traffic shifting, spent significantly longer in the planning phase than the team had initially hoped. But that extra planning saved weeks of incident management later. Three times during the engagement — twice during the payment API migration and once during the database failover drill — the phased approach revealed edge cases that would have been catastrophic had they emerged in a big-bang cutover. Speed of delivery matters, but speed of not having to revert matters more.

Third, the engineering team's active participation in the infrastructure sprint process was not optional — it was the single biggest predictor of long-term success. Every Terraform module we co-wrote with FinCore's engineers in week three was owned and maintained by them after month seven. The ones we wrote in isolation reflected different naming conventions, different testing approaches, and different deployment rhythms — and those isolated modules were the ones that were hardest to maintain after handover. The lesson is clear: extract knowledge, enforce ownership, and make the handover a learning process, not a knowledge-transfer event.

Finally, and perhaps most counter-intuitively, cost optimization and reliability investment reinforced each other here rather than working in tension. The right-sizing work that drove the 47% spend reduction also eliminated the noisy-neighbor CPU throttling that had been a primary contributor to the platform's earlier reliability problems. The caching layer introduced to reduce API latency also reduced database read load, lowering the migration path to cheaper storage tiers. The GitOps deployment pipeline built for velocity also eliminated the last-mile configuration errors that had been causing environment-drift incidents. The right architecture does not force trade-offs — it resolves the tension between competing objectives by changing the problem itself.