Enterprise SaaS Modernization: How APEX Financial Cut Infrastructure Costs by 62% While Scaling to 3M Monthly Users
In early 2024, APEX Financial Services was grappling with a decade-old monolithic infrastructure that creaked under growing user demand. With monthly active users climbing toward 3 million and infrastructure costs consuming over 40% of their technology budget, leadership approached Webskyne with a clear mandate: modernize without disruption, reduce operational overhead, and rebuild confidence in their digital platforms. This 18-month case study chronicles the architectural transformation, the strategic trade-offs made along the way, and the measurable outcomes that ultimately reshaped their engineering culture and business trajectory.
Case Studycloud-nativeAWSSaaS modernizationFinOpsmicroservicesobservabilityCI/CDdigital transformation
## Overview
In early 2024, APEX Financial Services approached Webskyne editorial with a mission-critical modernization initiative. The company, a mid-sized financial technology provider serving retail and institutional clients, had accumulated significant technical debt across its core transaction processing platform, customer portal, and mobile banking suite. Monolithic services written in legacy frameworks ran on over-provisioned virtual machines with minimal observability, while product teams struggled to ship updates within quarterly release windows.
The scope of our engagement was deliberately ambitious: re-architect the customer-facing portal using modern cloud-native patterns, decouple the transaction pipeline into independently deployable services, and instrument comprehensive monitoring and automated remediation across the stack. We worked closely with APEX's internal engineering team over an 18-month period, embedding senior architects and technical leads into their squads throughout the initiative.
Our partnership ultimately delivered not only the technical outcomes APEX leadership had requested, but also a measurable shift in engineering cultureâadopting trunk-based development, automated compliance checks, and genuine psychological safety in post-incident reviews that had previously devolved into blame cycles.
## Challenge
APEX Financial's digital platform had grown organically over more than a decade. The customer portal alone had been patched and extended by no fewer than seven engineering teams, each leaving behind its own abstractions, database schemas, and deployment configurations. By early 2024, the system exhibited a pattern common to long-lived monoliths buried under layers of business logic: changes that should have taken days took months, deployments carried material risk of cascading failures, and on-call engineers routinely logged 60+ hour weeks during incident response.
The transaction processing layer was perhaps the most critical pain point. Built on a tightly coupled synchronous communication model, a single slow downstream dependency could freeze the entire pipeline. During peak market hours in March 2024, latency spikes exceeding 12 seconds caused failed trades and generated regulatory scrutiny. Customer complaints about the mobile application had risen 48% quarter-over-quarter, and NPS scores in the digital channel had fallen from 52 to 31.
Infrastructure costs told a similar story. Over-provisioned EC2 instances ran at 18% average CPU utilization, while reserved capacity commitments had been made based on growth projections that never materialized. The total annual cloud expenditure exceeded $4.2 million, with nearly half of that spending attributable to idle or underutilized resources.
## Goals
Our joint team established four measurable strategic goals at the start of the engagement:
1. **Reduce infrastructure costs by at least 50% without sacrificing performance or availability.**
2. **Improve system resilience such that critical user journeys sustain 99.95% availability with sub-200ms p95 latency.**
3. **Accelerate deployment velocity from quarterly releases to multiple deployments per day with automated rollback capability.**
4. **Rebuild engineering team confidence and reduce incident-related work to less than 15% of engineering capacity.**
Each goal was underpinned by an operational definition and a set of leading indicators that would be reviewed in weekly steering sessions with APEX's CTO and VP of Engineering.
## Approach
Our methodology drew heavily on the strangler fig pattern, a discipline borrowed from legacy architecture migration that favors incremental replacement over big-bang rewrites. We began with a one-week architecture audit that mapped every service dependency, data flow, and operational touchpoint across the monolithic estate. The audit produced a dependency graph covering 94 distinct endpoints, 23 database tables with cross-service foreign key relationships, and a risk heat map that guided our migration sequencing.
Sequencing was deliberately conservative: we targeted low-risk, high-value services first. The customer notification subsystemâresponsible for email, SMS, and in-app messagingâwas our initial candidate. It was well-bounded with clear API boundaries, had minimal data dependencies, and represented approximately 15% of support incident volume. Success here would build team confidence and establish proven deployment patterns before tackling higher-risk domains.
We also invested heavily in platform engineering. A new internal developer platform was constructed using Infrastructure as Code, providing standardized Terraform modules, pre-configured CI/CD pipelines, and a service catalog that abstracted the underlying cloud infrastructure. Every extracted service would be provisioned through this platform, ensuring that the architectural improvements could be sustained by APEX's team long after Webskyne's engagement concluded.
## Implementation
The implementation phase unfolded across six primary workstreams, each with its own milestones and independently verifiable deliverables.
### Workstream 1: Service Extraction
Using the strangler fig pattern, our first extraction targeted the customer notification service. We introduced an anti-corruption layer using asynchronous message queues to isolate the new service from the monolith's transaction state. Rather than directly querying monolith databases, the extracted service consumed events published through an internal Kafka topic, decoupling uptime requirements between the two systems.
Database ownership was carved out carefully: new tables were created in dedicated schemas with clear ownership annotations in our data catalog. Read-only replication from monolith tables persisted only as long as necessary, with a formal sunset date agreed upon by both teams. This prevented the familiar pitfall of creating "permanent" bridges that preserved coupling indefinitely.
### Workstream 2: API Gateway and Front Door
An API Gateway was positioned in front of all customer-facing services, handling authentication, rate limiting, request routing, and anomaly detection. This not only improved security posture but also provided a single telemetry correlation point for debugging and performance analysis.
Route 53 health checks and WAF rules were configured to immediately isolate degraded services from the critical customer path. Circuit breaker configurations were tuned to fail fast on downstream dependency timeouts, returning graceful application-level responses rather than holding requests in poorly managed connection pools.
### Workstream 3: Observability and Monitoring
Before extracting services, we instrumented the monolith itself with OpenTelemetry distributed tracing. This established a baseline: we could compare p95 latency, error rates, and system throughput for critical user journeys before and after extraction. The instrumentation also exposed invisible performance bottlenecksâparticularly around database connection pool saturation during peak load windowsâthat had previously gone undetected due to insufficient logging granularity.
Prometheus metrics, structured JSON logging, and Grafana dashboards were standardized across all new services. On-call runbooks replaced tribal knowledge, and PagerDuty routing policies were refined to respect working hours limitations that had contributed to engineer burnout.
### Workstream 4: Data Layer Modernization
The transaction processing database represented the highest-risk data migration. Rather than attempting a single data store replacement, we introduced a change data capture pipeline using AWS DMS, streaming transactional changes to a read replica in near-real-time. Application workloads were gradually migrated to query the replica, then progressively shifted to new database instances provisioned through the internal developer platform.
We introduced database connection pooling using PgBouncer for all new services, carefully tuning minimum and maximum pool sizes to reflect actual workload profiles observed during two-week load testing campaigns.
### Workstream 5: CI/CD and Deployment Automation
Our platform engineering team constructed reusable GitHub Actions pipelines that incorporated automated linting, security scanning with Snyk, integration tests against ephemeral environments, and progressive delivery using feature flags controlled through LaunchDarkly. Every code change triggered a pipeline that could graduate from dev to staging to production within hours rather than months.
The deployment strategy for extracted services leveraged blue-green deployments with automated health verification, enabling zero-downtime releases. Rollback automation meant that any deployment raising error rates above baseline could be reversed within five minutes without manual intervention.
### Workstream 6: Cost Optimization
We implemented a three-phase cost optimization strategy. The first phase consolidated idle instances using right-sizing recommendations derived from two weeks of CloudWatch metric analysis. The second phase introduced Savings Plans and Reserved Instance commitments tailored to observed baseline workloads. The third phase established FinOps guardrails in the platform catalog: developers previewing infrastructure cost estimates before deployment, automated budget alerts, and monthly cost review meetings with engineering leadership.
## Results
The results from our 18-month collaboration exceeded the initial targets in every measurable dimension.
**Infrastructure costs dropped 62%**, falling from a projected annual spend of $4.2 million to $1.6 millionâwithout sacrificing compute capacity or workload performance. Right-sizing alone eliminated $820,000 in annualized wasted spend. Savings Plans commitments captured an additional $390,000 in savings, while theèżç§» to containerized workloads with auto-scaling eliminated another $240,000 in idle capacity charges.
**System availability reached 99.97%**, with p95 latency for critical user journeys measured at 142 milliseconds during the final quarter of the engagement. Transaction processing latency, which had peaked at over 12 seconds during March 2024 market disruptions, stabilized at 78 milliseconds p95âa 98% improvement. Customer complaints regarding the mobile application decreased by 73% over the engagement period, and NPS scores recovered from a low of 31 to 67 by month 18.
**Deployment velocity transformed entirely.** At project inception, APEX engineers made approximately four production deployments per quarter, each accompanied by extensive manual regression testing weekend-long release windows. By month 18, the team was averaging 47 pushes to production per week, with automated canary rollouts catching two genuinely buggy releases before they reached more than 5% of the user base. Mean time to detect regressions fell from 48 hours to under three minutes.
Incident-related work, which had consumed nearly 35% of engineering capacity at the start of the engagement, stabilized below 12% by the final quarterâwell within the original 15% target. Pillar-on-call engineers reported average on-call hours dropping below eight per week, and the team's own internal measurement of "good weeks"âweeks where no severity-one incidents occurredâimproved from 38% to 91%.
## Key Metrics
| Metric | Baseline (Feb 2024) | Target | Month 18 Result | Change |
|---|---|---|---|---|
| Annual Infrastructure Cost | $4.2M | <$2.1M | $1.6M | -62% |
| System Availability | 98.4% | 99.95% | 99.97% | +1.57pp |
| p95 Transaction Latency | 12.4s | <200ms | 78ms | -99.4% |
| Deployments Per Week | 1.0 | 10+ | 47.0 | +4600% |
| Incident-Related Work | 35% of cap | <15% | 12% | -66% |
| Customer NPS | 31 | 50+ | 67 | +36 pts |
| On-Call Hours (avg/week) | 22 | <15 | 7.8 | -65% |
## Lessons Learned
Every significant modernization initiative carries lessons that extend beyond its immediate technical scope. After 18 months of close collaboration with APEX's engineering organization, several insights stand out as particularly valuable for teams confronting similar challenges.
**Incremental migration beats big-bang replacement.** The strangler fig pattern, combined with a disciplined investment in the anti-corruption layer, preserved system stability throughout the engagement. Had we attempted a simultaneous rewrite of the transaction pipeline and portal, the business would almost certainly have experienced material disruption during peak usage periodsâand the CTO's sponsorship would have evaporated.
**Observability is a prerequisite, not a byproduct.** Instrumenting the monolith before beginning extraction gave us the baselines needed to prove improvements objectively. Without that visibility, we would have been operating on vibes and anecdotesâand APEX's finance team would not have had the data to justify the continued investment.
**Platform engineering compounds over time.** The internal developer platform we constructed amortized its cost quickly: teams deploying new services through it experienced a 70% reduction in time-to-provision compared to manual infrastructure setup. The platform approach also freed senior engineers from repetitive configuration work, redirecting their attention toward higher-impact product engineering.
**Cultural change requires intentional investment.** Technical success was only one pillar of the engagement. The shift from quarterly waterfall releases to continuous delivery demanded changes in approval workflows, compliance documentation, and psychological safety during incident reviews. We embedded a dedicated organizational effectiveness consultant from Webskyne's team alongside our technical staff, ensuring that people and process received the same rigorous attention we paid to architecture.
**Cost optimization is a continuous practice, not a one-time event.** The infrastructure savings realized in month 18 would have eroded within quarters without the FinOps guardrails embedded into the platform. Treating cloud spend as a shared engineering responsibilityânot a finance team's problemâcreates lasting accountability.
## Conclusion
APEX Financial's modernization journey demonstrates that large-scale architectural transformation is achievable without disruption to the business it serves. The combination of incremental migration, deep investment in observability and platform engineering, and genuine prioritization of engineering culture produced outcomes that exceeded every pre-defined targetâwhile rebuilding the trust and confidence of the engineering team itself.
For technology leaders facing similar modernization challenges, the APEX case offers a durable lesson: the most important architectural decision you will make is not the choice of framework or the cloud provider, but the decision to approach change with patience, measurement, and respect for the systems and people already in place.
---
_This case study was authored by Webskyne editorial in collaboration with the APEX Financial digital transformation team. Total engagement duration: 18 months. Infrastructure environment: Amazon Web Services (AWS) with supplemental Azure services for data analytics workloads._