Webskyne
Webskyne
LOGIN
← Back to journal

21 June 202611 min read

From Legacy Monolith to Serverless: How PayStream Cut Infrastructure Costs by 60% and Doubled Deployment Frequency

PayStream, a fast-growing payments platform, was crippled by a legacy monolith that required full-team deployments every two weeks and cost over $48,000 monthly in idle infrastructure. This case study documents the six-month migration to a serverless, event-driven architecture using AWS Lambda, API Gateway, and DynamoDB — a transformation that reduced operational costs by 62 percent, increased deployment frequency from biweekly to multiple times per day, and brought p99 latency under 120 milliseconds. We break down the phased migration strategy, the team’s battle with database refactoring, the observability overhaul, and the hard-won lessons about incremental modernization in highly regulated fintech environments.

Case StudyserverlessAWS LambdamicroservicesfintechPCI-DSScloud migrationevent-driven architecturedigital transformation
From Legacy Monolith to Serverless: How PayStream Cut Infrastructure Costs by 60% and Doubled Deployment Frequency

In early 2025, PayStream’s engineering leadership faced a familiar but increasingly dangerous reality: their core payments-processing monolith had become the single greatest constraint on growth. What started in 2019 as a well-architected Node.js service had accreted five years of feature work — settlement logic, fraud checks, compliance reporting, and customer dashboards — all tangled into a single deployable unit. Full regression tests took four hours. Production incidents required coordinating across three time zones. And the monthly AWS bill for provisioned EC2 instances that sat idle 80 percent of the time had climbed to nearly $48,000.

This case study reconstructs how PayStream’s CTO, infrastructure lead, and platform engineering team executed a high-stakes migration to serverless without disrupting real-time payment flows or violating PCI-DSS scope. It is a story about risk-managed incrementalism, the obsession with backward compatibility, and the operational discipline required to modernize a system that processes over $2.3 billion in annual transaction volume.

Company and Product Overview

PayStream provides embedded payment infrastructure for mid-market B2B software companies. Its APIs allow SaaS platforms to accept payments, reconcile settlements, and manage refunds without building their own financial rails. The product sits in a demanding slice of the stack: latency-sensitive, compliance-heavy, and unforgiving of data inconsistencies. A single delayed webhook or mismatched settlement record can break a customer’s reconciliation workflow and trigger costly manual audits.

Before the migration, the monolith handled everything: REST API routing, database transactions, fraud scoring via a third-party ML service, real-time webhook dispatching, and nightly batch jobs for settlement reporting. The team had attempted partial microservices extraction twice before — once in 2022 and again in 2023 — but both efforts stalled because the shared PostgreSQL database became a distributed monolith, and deployment coordination proved more painful than the original problem.

The Challenge

PayStream’s challenges were structural, political, and technical simultaneously:

1. Deployment velocity had hit a wall. The monolith required a 45-minute CI/CD pipeline, mandatory load testing, and a cross-functional change-advisory board for every release. In practice, this meant only urgent bug fixes shipped between scheduled biweekly deploys. Feature teams spent 30–40 percent of their time on merge-conflict resolution and environment synchronization.

2. Infrastructural waste. Peak traffic occurred during business hours in three major time zones, roughly 6:00 AM to 10:00 PM UTC. Yet the monolith ran on 24/7 auto-scaling groups with a minimum of 16 EC2 instances to survive Black Friday spikes that never materialized at those scales. The team estimated wasted capacity at 78 percent annually.

3. Observability gaps. When an incident occurred, the on-call engineer had to sift through CloudWatch dashboards, custom log aggregators, and three different APM tools to reconstruct a request path. Mean time to resolution (MTTR) for severity-1 incidents was 47 minutes — far above the 15-minute target the support team had promised customers.

4. Compliance friction. Because payment data touched every module in the monolith, any code change — including a frontend text update — required the entire application to be re-scoped under PCI-DSS. This added two weeks of audit wall time to every release and made the vendor-certification process prohibitively expensive.

5. Talent retention. Senior engineers who had joined to work on modern cloud infrastructure found themselves managing JVM tuning and hotfixes for legacy code they did not touch. Three principal engineers left in 2024, citing "architectural stagnation" in exit interviews.

Goals and Success Criteria

The executive team approved a six-month modernization program with four quantifiable targets:

  1. Reduce monthly infrastructure cost by ≥ 50 percent without sacrificing performance or availability.
  2. Deploy to production ≥ 10 times per day with less than 1 percent rollback rate.
  3. Reduce MTTR to ≤ 15 minutes for P1/P2 incidents.
  4. Shrink PCI-DSS scope by ≥ 40 percent so that non-payment modules could ship without quarterly audit cycles.

A fifth, non-functional goal was equally important: zero planned downtime during the migration. PayStream’s senior leadership had witnessed earlier migrations fragment customer data, and they insisted that the cutover strategy treat continuity as the primary constraint, not speed.

Approach: Strangler Fig with Event-Driven Backbone

Rather than attempting a big-bang rewrite, the team adopted the Strangler Fig pattern — gradually wrapping the monolith with new services that intercepted traffic, duplicated functionality, and eventually replaced legacy endpoints. To avoid the distributed-monolith trap that had derailed prior attempts, they introduced an event backbone using Apache Kafka, which decoupled services from direct database dependencies.

The migration was sequenced in three phases:

Phase 1: Foundation and Perimeter Services (Months 1–2)

The team began by extracting the simplest, highest-value perimeter endpoints: health checks, rate limiting, and API-key validation. These were migrated to Amazon API Gateway with Lambda authorizers, cutting monolith load by roughly 15 percent. In parallel, they established the event backbone, deployed the Kafka cluster on Amazon MSK, and built a new observability stack on OpenTelemetry, Grafana, and PagerDuty.

This phase also included a critical strategic decision: rather than replatforming incrementally into Kubernetes, the team committed fully to serverless containers (AWS Fargate) for anything requiring long-running processes, and Lambda for request-driven workloads. The rationale was simple — with a small platform team of four engineers, managing a control plane was a distraction. Serverless traded operational complexity for vendor lock-in, and the team accepted that trade-off explicitly.

Phase 2: Core Payment Flow Extraction (Months 3–4)

The payment-intent and webhook-dispatch services were the highest-risk migrations because they touched PCI-sensitive data. The team used dual-write patterns: for two weeks, every new payment event was written to both the monolith PostgreSQL database and a new DynamoDB table. They then ran read-replica comparisons nightly to validate consistency. Once they were satisfied with data parity, they switched the read path to DynamoDB and kept the monolith as a synchronous fallback — meaning if anything went wrong, traffic would revert automatically.

Fraud scoring, which called an external ML endpoint, was extracted into a Lambda function behind an API Gateway endpoint. Because fraud checks already operated asynchronously in the monolith (via a message queue), the migration required no coordination with customer-facing flows. The team added dead-letter queues and idempotency keys to handle duplicate processing gracefully.

Phase 3: Batch Jobs, Reporting, and Sunset (Month 5–6)

Nightly settlement reports and compliance exports were migrated to AWS Step Functions with S3-based intermediate storage. The monolith was repurposed into a read-only reporting engine until all customers had migrated to the new API versions, after which it was decommissioned entirely.

A dark-launching strategy ensured that customer API keys could route to either the monolith or the new services based on feature flags. The product team gradually shifted traffic in 5 percent increments, monitoring error rates and latency p99s at each step. When the monolith handled less than 2 percent of traffic for two consecutive weeks, the team ran a final cutover to 100 percent new-services routing and archived the monolith codebase.

Implementation Details

Several implementation choices deserve special attention because they separated success from failure:

Idempotency everywhere. PayStream’s payment domain cannot tolerate duplicates. The team built idempotency keys into every Lambda handler, stored checkpoints in DynamoDB with conditional writes, and added replay protection at the API Gateway layer. This turned what could have been a devastating retry hazard — inherent in serverless architectures — into a non-issue.

Infrastructure as code with strict boundaries. Using Terraform modules, each service owned its own infrastructure definition. The platform team maintained shared modules for networking, observability, and security, but application teams could not modify cross-cutting concerns. This prevented the "infrastructure drift" that had caused outages in earlier experiments.

Database refactoring with the expand-contract pattern. Rather than trying to split the monolith database into perfectly normalized micro-databases, the team let data duplication exist temporarily. Eventual consistency between the monolith, DynamoDB, and a new read-model in ScyllaDB was acceptable for 90 days. After validation windows closed, they applied schema contracts to prevent legacy reads.

Cost governance baked in. The finance team was integrated into every architectural review. A simple tagging policy tied every Lambda invocation and DynamoDB table to a cost center, and the team set budget alerts at 80 percent of projected spend. This prevented the common serverless pitfall of runaway costs from poorly written functions.

Results

The migration delivered measurable business value within the first 30 days after full cutover:

Infrastructure costs dropped from an average of $48,100 per month to $18,200 — a 62 percent reduction. The savings came primarily from eliminating idle EC2 capacity, paying only for Lambda execution milliseconds, and consolidating three separate monitoring tools into a single Grafana stack. The CTO redirected 40 percent of the savings into hiring two senior platform engineers to maintain the new architecture.

Deployment frequency jumped from two per sprint (roughly every two weeks) to an average of 14 per day. Rollbacks, which had previously required a full redeployment and DB migration review, became a feature-flag toggle: 90 percent of rollbacks completed in under 90 seconds. The engineering team estimated recovering roughly 120 engineer-hours per week that had been spent on release coordination.

Customer-facing reliability improved noticeably. p99 API latency fell from 420 milliseconds to 118 milliseconds. P1 incident MTTR decreased from 47 minutes to 11 minutes, driven by distributed tracing that showed exactly which Lambda function and database query had failed. The support team began resolving 80 percent of incidents before the engineering team was even paged.

Perhaps the most underappreciated win was the shrinkage of PCI-DSS scope. By isolating cardholder-data processing into three audited Lambda functions and a single DynamoDB table, the security team reduced the number of systems in scope from 47 to 19. The annual audit cycle accelerated from 10 weeks to 5 weeks, and external auditor fees dropped by 30 percent.

Key Metrics

  • Infrastructure cost reduction: 62 percent ($48,100 → $18,200 monthly)
  • Deployment frequency increase: 2 per sprint → 14 per day (7× improvement)
  • Mean time to recovery (MTTR) for P1/P2: 47 minutes → 11 minutes (77 percent reduction)
  • P99 API latency: 420 ms → 118 ms (72 percent improvement)
  • PCI-DSS scope reduction: 47 systems → 19 systems (60 percent reduction)
  • Audit cycle duration: 10 weeks → 5 weeks (50 percent reduction)
  • Engineer hours recovered per week: ~120 hours shifted to feature work
  • Rollback time: 45 minutes → under 90 seconds (97 percent reduction)

Lessons Learned

Six months post-migration, the PayStream team identified seven lessons that reshaped how they think about modernization:

1. Strangle, do not rewrite. The most dangerous word in engineering is "rewrite." By strangling the monolith incrementally, the team maintained business continuity, validated assumptions at each step, and retained the ability to roll back. The two prior failed attempts had tried to lift-and-shift everything simultaneously.

2. Events beat shared databases. The decision to introduce Kafka as an event backbone was the single most consequential architectural choice. It freed teams from synchronous database coordination and made the system composable. Without it, the team would have built a distributed monolith in Lambda — slower, harder to debug, and just as brittle as the original.

3. Cost is a feature. Involving finance early, tagging every resource, and setting hard budget alerts prevented the typical serverless cost surprises. The team discovered that unoptimized Lambda cold starts and unbounded DynamoDB queries were real cost drivers only when they were visible in cost-center dashboards.

4. Observability precedes migration. Trying to refactor a system you cannot see is like operating on a patient without an MRI. The investment in OpenTelemetry during Phase 1 paid for itself within the first month of production traffic on the new stack.

5. Regulated environments demand idempotency discipline. In payments, duplication is not a minor annoyance — it is a compliance violation. Building idempotency keys into the data model from day one, rather than bolting them on during testing, saved the team from re-architecting the retry path late in the program.

6. Culture eats architecture. The technical migration was the easy part. The harder work was convincing sales, support, and compliance teams to trust a system they could not see in the same data center. Weekly "what we migrated this week" summaries and transparent latency dashboards built that trust gradually.

7. Sunset is part of delivery. Decommissioning the monolith took as much effort as extracting it. The team allocated 20 percent of engineering capacity for three weeks after cutover to handle edge-case customer integrations that still hit legacy endpoints. Treating sunset as an afterthought is the fastest way to pay hidden infrastructure costs forever.

Conclusion

PayStream’s serverless migration is not a story about technology for technology’s sake. It is a story about aligning architecture with business constraints — cost, compliance, talent, and customer trust — and then executing with surgical precision. The team did not adopt serverless because it was trendy; they adopted it because the math demanded it, and the operational model supported it.

For engineering leaders considering a similar journey, the clearest recommendation is this: start small, instrument ruthlessly, and never sacrifice backward compatibility for speed. The organizations that succeed at modernization are not the ones that rewrite the fastest; they are the ones that learn the most from each incremental step.

Related Posts

How a FinTech Startup Cut Deployment Time by 70% with Microservices and Kubernetes
Case Study

How a FinTech Startup Cut Deployment Time by 70% with Microservices and Kubernetes

NeoVault, a fast-growing fintech startup providing digital wallet and payment processing solutions for over 500,000 users, was crippled by a legacy monolithic architecture that limited deployments to weekly cycles, caused frequent production incidents, and strained infrastructure costs beyond sustainability. As transaction volume grew, the monolith created tight coupling between critical subsystems, database contention during peak hours, and engineer burnout from risky, all-or-nothing releases. Over six months, we partnered with their engineering leadership to execute a strategic microservices migration using the strangler fig pattern, modernize their cloud infrastructure across AWS and Azure, and implement continuous delivery pipelines with Kubernetes and Istio. The initiative reduced deployment time by 81%, cut incident resolution from over four hours to under 45 minutes, and lowered monthly infrastructure costs by nearly a quarter. This case study details the phased approach, technical decisions, and measurable outcomes that transformed NeoVault engineering velocity and system resilience, offering a blueprint for any organization navigating complex legacy modernization.

From API Sprawl to Unified Orchestration: How LogiFlow Cut Integration Costs by 62%
Case Study

From API Sprawl to Unified Orchestration: How LogiFlow Cut Integration Costs by 62%

LogiFlow, a mid-market logistics SaaS serving 1,400 freight carriers, was bleeding engineering hours on a tangled web of point-to-point APIs, bespoke webhooks, and brittle homegrown middleware. By early 2024, the company's integration layer had ballooned from three clean REST endpoints into 42 distinct connectors, each maintained by separate squads with custom retry logic, separate retry policies, and independent schema dialects. New carrier partnerships required six engineering weeks on average, support teams spent fifteen hours weekly reconstructing failed polling histories, and monthly AWS costs on integration infrastructure alone had climbed from $3,200 to $18,700. The engineering squad also suffered 140 percent turnover in eighteen months, with developers describing the middleware as the dumping ground for everything nobody else wanted to own. This case study traces how the team replaced that sprawling matrix with an orchestration-first platform built on FastAPI, Temporal, and AWS Step Functions, cutting infrastructure costs by 62%, shrinking carrier onboarding from six weeks to four days, and lifting customer NPS from 31 to 58 within eleven months.

Scaling for a Million Users: How Telora Finance Cut Latency by 62% and Doubled Daily Engagement
Case Study

Scaling for a Million Users: How Telora Finance Cut Latency by 62% and Doubled Daily Engagement

In early 2024, Telora Finance was growing faster than its infrastructure could keep up. A 12-person engineering team inherited a monolithic backend that served 180,000 monthly active users but struggled during market peaks. This case study walks through the architectural decisions, phased migration strategy, and monitoring overhaul that brought latency down from 420 ms to 160 ms, reduced error rates from 2.4% to 0.3%, and raised daily active users from 68,000 to 142,000 in three months.