Zero-Downtime Migration: How FinFlow Cut Infrastructure Costs by 62% While Serving 2M+ Transactional Users

FinFlow, a high-volume fintech platform processing over 12 million transactions monthly, was drowning in rising AWS bills and fragile manual deployments. After a six-month cloud-native overhaul — including a zero-downtime Kubernetes migration and full observability rebuild — the engineering team slashed annual infrastructure spend by $1.8M, reduced deployment time from 45 minutes to under 90 seconds, and brought system availability from 99.65% to 99.97%. This is the complete playbook.

## Overview Founded in 2018, FinFlow emerged as a fast-growing neo-banking infrastructure provider, enabling mid-sized financial institutions to launch branded digital banking products without building the entire backend stack. By the end of 2024, the platform was orchestrating more than 12 million transactions per month, serving over 2.1 million end-users across Southeast Asia, with a mandate to expand into three new markets in 2025. The engineering team had grown from three founding engineers to a 32-person platform squad spanning backend, DevOps, SRE, and data engineering. Yet the underlying infrastructure — originally prototyped on a hastily assembled AWS setup — had not received the systematic architectural treatment its scale now demanded. The platform operated with a monolithic backend running on a fleet of c5.xlarge EC2 instances behind an Application Load Balancer, a managed PostgreSQL deployment that had itself outgrown its provisioned IOPS, and an event-processing pipeline built on a handful of long-running Lambda functions sharing a single dead-letter queue with no alerting. Staging mirrored production only loosely, and deployments went out via SSH sessions and manual `pm2 restart` commands on the production servers. Despite — or perhaps because of — its strengths in product and market fit, the FinFlow codebase and infrastructure were quietly approaching a breaking point. --- ## The Challenge By Q3 2024, FinFlow's leadership team was staring at three converging and compounding pressures. **Rising infrastructure costs** were the most visible. AWS bill data over the 18 months prior painted a concerning picture: costs had climbed from $47,000/month to $112,000/month, a 138% increase over a period in which MAUs had grown only 81%. A module of unused ELB instances running crowd-funding experiments from a scrapped subsidiary were contributing roughly $8,400/month, unbeknownst to the DevOps team until a routine tagging audit in July. Over-provisioned EC2 instances consuming more than 60% idle CPU were also flagged as a key cost driver. **Fragile deployment practices** were creating real operational risk. The median deployment, consisting of a SSH-to-prod workflow, audited by one engineer from a call, averaged 47 minutes from initiation to production confirmation. Post-deployment hotfixes were a weekly occurrence, typically triggered by a degraded endpoint detected by customers before the internal monitoring did. An internal incident from June — where a faulty database migration was inadvertently re-run on production during a rolling restart, briefly exposing duplicate transaction records to 4,700 users — was the turning point. Finance regulators issued a formal notice, and the engineering team went into remediation lock-down. The incident revealed that deployment code was duplicated across three internal repositories and managed through text-file instructions in a shared Wiki. **Observability was theoretical**. The platform ran a basic CloudWatch dashboard, a Datadog monitor for the API server, and a PagerDuty trigger for CPU above 90%. There was no tracing, no structured-log ingestion pipeline for production databases, no alert on failed write transactions, and no shared dashboard for cross-squad incident review. Engineers diagnosing a latency spike during peak transaction windows on a Sunday night spent upto three hours chasing through ad-hoc log queries, AWS Console sessions, and SSH requests without finding a root cause — only to see the issue resolve itself. The stakes were not just cost and performance. FinFlow carried regulatory compliance requirements (particularly around transaction tamper-evidence and audit logging) that mandated precise infrastructure controls, audit trails, and the ability to reproduce any state within 24 hours. The existing setup fell short on all three counts. --- ## Strategic Goals Instead of treating the challenges as a straightforward DevOps modernization exercise, the leadership team defined a set of goals designed to be both backward-compatible with existing commitments and forward-looking for the planned market expansion. **Primary goal: Zero-downtime cloud migration.** The engineering team was to move the transactional pipeline — consisting of the account API, ledger backend, and settlement processing layer — to a containerized Kubernetes architecture without any user-facing impact or transaction-time degradation. This migration was to be completed within 24 weeks and validated through chaos engineering tests against a production data clone. **Secondary goal: 50% reduction in infrastructure cost by Quarter 1, 2025.** Beyond the obviously wasteful resources, this required right-sizing all compute, migrating batch workloads to spot instances where appropriate, and implementing granular caching layers for read-heavy endpoints. Target was set at $72,000/month recurring spend by January 2025. **Tertiary goal: Production-grade observability and compliance.** Every cross-cutting requirement — end-to-end distributed tracing, structured immutable logging, database-per-minute health gating, and hardened audit trails — had to be operationalized to meet the regulator's expectations. The target was to reduce mean-time-to-detect (MTTD) to under 90 seconds and mean-time-to-resolve (MTTR) to under 20 minutes for all P1 incidents by year end. **Enabling goal: Developer experience transformation.** Engineering manager Priya Nair noted in the project kickoff that the most important outcome was not infrastructure per se — it was the team's ability to ship with confidence. The CD pipeline, standardized infrastructure definitions, and automated test harnesses had to reduce deployment duration to under 120 seconds and eliminate post-deployment hotfixes to a rare exception (target: less than 2% of all monthly deployments). --- ## Approach The team adopted a seven-phase approach, sequenced to minimize risk, maintain business continuity, and learn and iterate at each phase before entering the next. ### Phase 1 — Foundation and Discovery (Weeks 1–2) The first pass was not about building; it was about understanding with sufficient fidelity to build the right thing. The team conducted a comprehensive infrastructure audit using AWS Cost Explorer, CloudAsset, and internal usage logs. They identified **23 separate EC2 instances that were running at less than 15% average CPU utilization** over a 90-day window, representing approximately $18,000/month in wasteful spend. A review of the deployment pipeline revealed that 62% of the cost-to-deploy overhead was manual QA waiting time on non-scoped releases. Simultaneously, the SRE team instrumented a proof-of-concept OpenTelemetry collector attached only to a replica, producing end-to-end traces for a single critical path and establishing the first baseline for transaction latency: the median p99 for a settlement transaction was 340ms — well above the 200ms service-level objective (SLO). With a clear baseline and a prioritized list of wasteful spend, the team could finally quantify the work. ### Phase 2 — Infrastructure as Code and Environment Parity (Weeks 3–6) The second phase was the foundation everything else depended on. The team adopted Terraform for all infrastructure definitions and OpenTofu as a Terraform-compatible open-source alternative for module management. Unlike the existing hand-configured AWS resources, every piece of infrastructure — VPC subnets, security groups, EKS node pools, RDS Aurora PostgreSQL instances, ElastiCache Redis clusters, and IAM role bindings — was codified into version-controlled modules with explicit environment promotion gates. Environment parity was the non-negotiable implementation contract: **staging was required to match the production topology at the resource type, version, and connectivity level before the staging sign-off could proceed to the next phase.** This was enforced using Terragrunt workspace promotion gates and a linting pass that required at least 95% parity between environments before a PR could be merged. By the end of Phase 2, the team had migrated all 73 AWS resources to Terraform modules, eliminated all manual infrastructure changes, and brought a six-hour environment reproduction task down to a single 40-minute `plan`/`apply` flow. The biggest immediate win, though, was one of discovery: one forgotten IAM role — `finflow-analytics-reader` — which had been granted S3 public-read access to a bucket containing three years of transaction analytics reports was detected and immediately revoked, closing a potential data breach vector without which the regulatory consequences alone could have been catastrophic. ### Phase 3 — Kubernetes Core and Migration Gate (Weeks 7–12) With infrastructure defined as code and validated across environments, the Kubernetes core deployment began. The team chose Amazon EKS on ARM-based Graviton3 nodes for the compute layer — an architecture change projected to cut per-vCPU pricing by 20% relative to x86 while maintaining identical workload performance as determined by a week-long performance comparison run in staging. The choice also aligned with FinFlow's sustainability commitments, a secondary but non-negligible requirement from the venture capital board. The migration strategy was a **blue/green sidecar pattern**: the new Kubernetes service was deployed alongside the existing fleet, sharing an identical NLB ingress, with a canary traffic-shifting layer that progressively moved 5% → 25% → 50% → 100% of API traffic over a 10-day window. During each traffic percentage, comprehensive health checks including transaction integrity validation against a shadow write target ran automatically before the next increment was permitted. The decision to use canary promotion rather than a cut-over approach was vindicated multiple times. On the seventh day of the blue/green run, the canary layer detected a 3% elevated error rate on a currency conversion endpoint triggered by an unhandled edge-case in the new container image. Because only 5% of production traffic was affected, the rollback triggered automatically — and the team was able to fix, re-image, and re-deploy the corrected version during the same window. A manual cut-over would have exposed 4,200 users to corrupt exchange rate data. By the end of Phase 3 (Week 12), **all production transactional workloads were running on Kubernetes** with zero user-facing downtime recorded. The deployment pipeline, now configured using ArgoCD with policy-based promotion gates, was adjudicating deployments across staging and production from a single GitOps source of truth. ### Phase 4 — Observability, Tracing, and Alerting (Weeks 13–16) The observability overhaul was technically ambitious and operationally foundational. The team implemented a three-pillar observability stack: **metrics via Prometheus + Thanos for long-term retention**, **logs via Loki + Fluent Bit for inexpensive structured log ingestion**, and **traces via Jaeger + OpenTelemetry auto-instrumentation for distributed request visibility across service boundaries**. All services were instrumented at the application level (Go and Node.js respectively) using OpenTelemetry SDKs, with automatic span generation for every database query, I/O operation, and external service call. Each trace was enriched with six standard resource attributes — service name, pod name, cluster, environment, tenant ID, and request ID — enabling engineers to filter directly to specific customer tenants or environment contexts without leaving the Jaeger UI. Alerting was redesigned entirely. The team adopted an **SLO-alerting methodology** rather than threshold alerting, with per-service SLO compliance burn-rate alerts at 5%, 14%, and 30% burn windows. This approach, detailed in Google's SRE Handbook and championed internally by SRE lead Akshay Patel, meant that alerts only fired when service level objectives were genuinely at risk — not simply when CPU was above 90%. After two weeks of tuning, the false-positive rate dropped from 40% to fewer than 5%, bringing the median time-to-acknowledge for P1 incidents from 18 minutes to 6 minutes. The observability portal — unified Grafana dashboards covering transaction success rates, infrastructure cost trends, Kubernetes node health, and SLO burn rates — became the default first stop for any post-incident review. During a December stress test simulating Diwali-period transaction volumes (2.8× peak load), the dashboards provided the precise signal that led the team to identify and fix a Redis connection pool saturation issue 37 minutes into the test — before artificial traffic injection was complete. ### Phase 5 — Right-Sizing, Auto-Scaling, and Caching (Weeks 17–20) Right-sizing and cost optimization were broken into three parallel workstreams. **Compute rightsizing.** The team used Thanos query engine metrics from a 90-day window to calculate each workload's actual resource requirements, replacing the original over-provisioned pod requests with values derived from the 95th-percentile utilization — a margin that absorbs reasonable traffic spikes without headroom waste. This work alone eliminated approximately $18,000/month of wasted compute within six weeks. **Horizontal and vertical pod auto-scaling.** Kubernetes Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) policies were tuned using custom metrics from the Prometheus adapter, ensuring that the platform could absorb seasonal traffic spikes — including Black Friday and December settlement windows — without over-provisioning. The EKS Cluster Autoscaler was configured to add, use, and terminate spot instance nodes during high-traffic windows, with automatic fallback to on-demand capacity if spot interruption warnings were detected within five minutes. **Read-path caching.** Architecture for FInFlow's API was **62% read-traffic dominated** — account balances, transaction history, and statement downloads — none of which required strictly real-time read consistency. Redis-enabled read-through caching was introduced, with a TTL strategy designed by the SRE team that writes-through to PostgreSQL only on cache eviction. An edge caching layer via CloudFront with aggressive cache-control headers was also applied for public attachment endpoints (statement PDFs, account summaries), redirecting approximately **38% of peak API traffic away from backend services entirely**. ### Phase 6 — Security Hardening, Compliance, and Secret Management (Weeks 21–22) Security and compliance were phased deliberately late in the transformation, deliberately delayed until the platform and infrastructure were stable — but with the understanding that Phase 6 could not ship before it was truly production hardening complete. This led to an internal week-long **"security sprint"** in which the SRE team was joined by one of the security auditors, using AWS Inspector, Trivy for container image scanning, and IAM Access Analyzer to systematically identify and close gaps before putting the final sign-off. The key hardening changes included: migration of all secrets from plaintext environment variables and `.env` files to AWS Secrets Manager with per-service IAM policies and automatic rotation for database credentials every 30 days; enforced mTLS between all internal services using a per-issue ESP service mesh inside EKS; and a hardened audit logging pipeline streaming all `CREATE`, `UPDATE`, `DELETE`, and `SET SESSION` events from PostgreSQL through Kinesis Data Firehose into an append-only S3 bucket with seven-day immutability-flag support via S3 Object Lock. — but one of the most impactful changes was arguably the least flashy: the deprecation of all long-lived IAM access keys and mandatory assumption of role credentials through identity federation. In the four-week window after enforcement, **zero long-lived access key attestation incidents** were recorded — a significant reduction from the four confirmed credential-exposure incidents in 2024 alone. ### Phase 7 — Chaos Engineering, UAT, and Full Go-Live (Weeks 23–24) The final phase was dedicated to stress-testing, user acceptance testing, and a carefully sequenced production promotion. Chaos engineering, introduced via **Gremlin** as a SaaS platform, ran at breakpoints across the system: simulated pod kills forced Karpenter to re-provision nodes; simulated RDS failover tested timeout and retry logic across the API layer; simulated 80% bandwidth throttling tested the circuit breaker end-to-end. 19 individual chaos experiments ran over the final two weeks, with 98% recovery within the defined SLO window. Three experiments surfaced previously undetected failure modes: a DNS resolution race condition in the event processing worker, and a bug in the retry logic that would have cascaded into a full database lock during a concurrent settlement batch run. Each was fixed before Go-Live. User acceptance testing (UAT) involved the product and banking operations team exercising the settlement pipeline end-to-end against a production-identical cloning environment (`env=production-clone`) refreshed nightly from the production snapshot, validating that transactions matched the expected ledger within five minutes of processing — as required by the business SLA contract. The Go-Live itself, which executed on a ranked Saturday night at 01:00 when transaction volumes were at ~12% of their weekday peak, proceeded with no incidents across the 17 services migrated. A network diagnostics and integration validation framework executed automatically in the five minutes following promotion, verifying that 100% of critical API endpoints were returning 2xx responses, all database read and write paths were confirmed operational, and all service-to-service mTLS certificates were authenticated. The migration was complete. --- ## Implementation Details Several technical decisions are worth deeper examination given their outsized impact on the results. **Terraform + OpenTofu Module Architecture.** The IaC modules were structured using Terragrunt's DRY (Don't Repeat Yourself) pattern, with a centralized module registry in a private Terraform Cloud workspace. Each service team maintained a small `terragrunt.hcl` overlay to promote environments — staging → production — setting only environment-specific overrides while inheriting all base configuration. This dramatically reduced the probability of configuration drift between environments, which had been a recurring source of staging-prod parity bugs in the older system. **ArgoCD GitOps Pipeline.** The ArgoCD pipeline was configured as the single source of truth for Kubernetes manifests, with each service repository binding to a specific version tag. Release promotion gates required: `staging-health-check` passing (defined by a smokeset of 12 critical-path integration tests), `security-scanning` passing (Trivy scan with no CRITICAL/high findings), and a manual approval step for any manifest change affecting the networking or RBAC layers. **OpenTelemetry Instrumentation.** Service instrumentation adopted a mesh-instrumentation pattern — each service auto-instrumented and propagated traces — with span bag propagation between services. The naming convention for trace attributes (`service.name`, `service.namespace`, `awskms.key.id`, `db.system`, `db.query.text`) followed the OpenTelemetry semantic conventions standard, enabling the team to query across service boundaries without custom querying logic. **SLO Alerting with Multi-window Burn-Rate Logic.** Alerting thresholds were calculated using real error budgets in the Grafana SLO dashboard, with PromQL queries comparing observed error rates with the error budget burn rate. The three burn-rate windows (five-minute, 14-minute, and 30-minute windows) were designed to catch critical incidents at different alerting classifiers: burn above 14.2x over five minutes triggered a P1; burn above 6x over 14 minutes triggered a P2; burn above 3x over 30 minutes triggered a P3. This approach reduced the alert fatigue that had led to PagerDuty alert silencing during periods of high noise. **Spot Instance Pricing Strategy.** Hybrid on-demand/spot fleets were used for stateless services (API, event processing, reporting), with automatic fallback to on-demand spot if interruption signal was detected within five minutes. Stateful workloads (PostgreSQL on Aurora, message ordering guarantees in Amazon SQS) continued to run on dedicated on-demand fleets to avoid zonal loss during spot interruption events. --- ## Results Six months after completing the Kubernetes migration and cost optimization initiative, the results substantially exceeded every stated business and engineering goal. ### Cost Reduction Infrastructure costs, at $112,000/month at the start of the project, dropped to **$42,500/month** by January 2025 — a reduction of $69,500/month and **62% below the baseline**, with a further $18,000/month in ongoing waste reduction identified but put in a staging validation queue. Annualized, this represents **$834,000/year in recurring savings** against the project original run-rate. Engineering amortized approximately $108,000 against Phase 1 right-sizing and autoscaling tuning alone, representing a **payback period on tooling and engineering investment of roughly 10 weeks** — far ahead of the original conservative estimate of 28 weeks. ### Performance and Reliability TLR (Throughput at Low RPS) on the settlement pipeline — a critical path benchmark replicating 500 concurrent settlement transaction audit requests — climbed from 7 transact/second to 19 transactions/second. The p99 transaction latency benchmark improved from **340ms to 138ms**, a **59% improvement**. All SLOs (availability, latency, throughput) were revised and collectively red-flagged at the most stringent tier. System availability, as measured across the settlement processing pipeline for the full year of 2025, trended from 99.65% at baseline to **99.97% sustained**, representing a reduction of approximately **9.7 hours of unplanned outage per year** across the entire platform. ### Developer Experience and Deployment Velocity Before the transformation, the average deployment took **47 minutes** from start to confirmation, including manual SSH work, database sanity checks, and post-deploy smoke tests. After: the average total deployment cycle — from code merge to observed production deployment — averaged **82 seconds** for services running on the ArgoCD GitOps pipeline. Deployment frequency increased from **2 deployments/week to 22 deployments/week** per squad, with zero post-deployment hotfixes required in the first eleven weeks post-transformation. The satisfaction score on the engineering team's bi-annual culture survey — which measures team-level confidence in our ability to ship safely — jumped from **2.8/5 to 4.6/5** in the follow-up ticket cycle. ### Compliance and Security The platform cleared its Q1 2025 regulatory audit with zero findings and a **"no open items" letter** from the lead security auditor — the first time the platform had entered an audit cycle without remediation obligations in its six-year history. --- ## Metrics | Metric | Baseline | Target | Achieved | Δ | |---|---|---|---|---| | Monthly Infrastructure Cost | $112K | $72K | $42.5K | −62% | | Annual Cost Savings | — | — | **$834K** | N/A | | Payback Period | 28 weeks | <20 weeks | **~10 weeks** | 2.8× faster | | Avg Deployment Time | 47 min | <2 min | **1.4 min** | 97% faster | | Deployments / week | 2 | — | **22** | 10× faster | | Settlement Pipeline p99 Latency | 340ms | <200ms | **138ms** | 59% faster | | Settlement Throughput (TLR low RPS) | 7/s | — | **19/s** | 171% increase | | System Availability | 99.65% | 99.90% | **99.97%** | +32 bps | | Unplanned Outage Hours / year | ~30h | <9h | **~1.2h** | 96% reduction | | MTTD (P1 incidents) | 18 min | <90 sec | **6 min** | 67% faster | | MTTR (P1 incidents) | 85 min | <20 min | **12 min** | 86% faster | | Post-deployment Hotfix Rate | ~8/month | <2/month | **0** (first 11 wks) | 100% elim. | | Engineering Team Dx Score | 2.8/5 | >4/5 | **4.6/5** | 64% improvement | | Regulatory Audit Findings (open items) | 4+ | 0 | **0** | — | | False-positive alert rate | ~40% | <10% | **~5%** | 875% improvement | --- ## Lessons and Takeaways The FinFlow transformation succeeded not because any single tool or technology choice was dramatically superior, but because a deliberate, phased, and deeply collaborative human process allowed the team to learn iteratively and validate against business context at each checkpoint. **1. Cost cannot be managed at the resource level without ownership accountability.** The single biggest discovery of Phase 1 was not a technical one — it was organizational. The `finflow-analytics-reader` bucket, the orphaned ELB, the idle c5.xl instances: they all existed because no single person or team was measuring them on a timeline that mattered. Implementing mandatory monthly cost reviews by squad and a **chargeback/shameback model** — where each engineering team's monthly infrastructure contribution was surfaced in All-Hands — created the accountability necessary to sustain the cost reductions. **2. Parallel work in phases — not siloed work — accelerated the schedule.** The team deliberately structured Phase 2 (IaC) and Phase 1 (discovery) with overlapping check-ins; the SLO framework defined in Phase 1 directly informed the canary splitting logic in Phase 3, and the early OpenTelemetry traces from the Phase 1 pilot ran through the Loki logging pipeline designed in Phase 4. Interleaving dependencies and work streams compressed the schedule by approximately four weeks relative to a strict waterfall plan. **3. The instrumentation was the safety net for the migration.** The decision to not cut the migration over — and instead to canary at 5% increments with automated shadow write validation — was only possible because the observability infrastructure was designed first. The team could not have safely migrated a platform handling $2.1M transactions/day without being able to trace, measure, and audit every request in real time. The investment in observability before migration paid for itself at least twice in avoided outages and post-migration incidents. **4. Developer confidence is a deployment velocity multiplier — and a safety net.** The deployment velocity gains correlated precisely with confidence gains on the engineering culture survey. When engineers no longer feared deployments — because ArgoCD provided rollback visibility, the IAC pipeline ensured parity, and the observability platform surfaced problems immediately — they deployed more, learned faster, and caught problems earlier in the cycle. Cliche as it sounds: **if operations are reliable, engineers operate reliably**. **5. Regulators reward the right level of detail — and the platform improved the audit findings.** The regulatory audit that followed the transformation included a reviewer comment noting that the immutability-bucket, secrets manager rotation, and audit-trail pipeline aligned "far beyond the minimum technical requirements of the relevant framework." Technical work done for compliance did not constrain the engineering roadmap — it accelerated it. ## Conclusion FinFlow's zero-downtime Kubernetes migration stands as a case study in what is achievable when cloud infrastructure, platform engineering, and developer experience are treated as a coherent single effort — rather than three separate cost centers managed through quarterly reviews. The 62% cost reduction, 59% latency improvement, and genuine cultural transformation in developer confidence were not the result of a single tool or a single heroic sprint. They were the result of seven sequential phases, each building on the validated foundation of the previous one — and sound judgment about when to take risks and when to take the safer path. Finance infrastructure is, by definition, a trust business. The FinFlow platform builds that trust not just through the completeness of its APIs but through the rigor of its architecture — a rigor that, by the end of the project, engineers rightfully celebrated as one of the most deeply satisfying transformations in the company's history. The complete engineering playbook — spanning IaC module designs, OpenTelemetry instrumentation patterns, SLO alerting specifications, and canary migration scripts — is open-sourced under an MIT license and available at `github.com/finflow/platform-engineering-blueprint`, for any team currently staring down a similar set of challenges. --- *Webskyne editorial is the in-house tech publication arm of Webskyne, covering platform engineering, infrastructure modernization, engineering culture, and developer experience. This case study is part of the FinFlow Transformation Series.*

Zero-Downtime Migration: How FinFlow Cut Infrastructure Costs by 62% While Serving 2M+ Transactional Users

Related Posts

From Legacy to Leading: How FinFlow Modernized Its Payment Pipeline and Cut Processing Costs by 47%

How TechFlow Cut AWS Infrastructure Costs by 62% Without Sacrificing Performance

Breaking the Scale Ceiling: How We Helped ShopStream Handle 10× Flash-Sale Traffic Without Crashing