How We Migrated a B2B Fintech Platform from a Tightly Coupled Monolith to Event-Driven Microservices on AWS

When PayStream — a B2B payment orchestration platform processing over $1.2 billion in annual transaction volume — started experiencing cascading failures during peak close-of-day batches, their engineering team knew the monolith had outlived its usefulness. Over an 18-month engagement, we redesigned their entire cloud architecture, replacing a 400,000-line Python monolith with an event-driven microservices platform on AWS. The result: a 72% reduction in infrastructure costs, P99 API latency dropped from 3,200ms to 180ms, and the platform now handles a 20× traffic spike without service degradation. This is the full playbook — from strategy through implementation — covering the architectural decisions, migration tactics, team changes, and costly mistakes that determined success or failure.

Overview

In Q1 2024, PayStream, a B2B payment orchestration and reconciliation platform, was quietly approaching a breaking point. The company had built a niche reputation for reliability in the mid-market invoicing space, processing payments for more than 1,400 enterprise clients across North America and Europe. Their annual transaction volume had crossed the $1.2 billion mark — a figure that barely a year prior would have seemed unachievable.

Behind that growth curve, however, the engineering team was running out of road. A single 400,000-line Python Django monolith — originally built by a founding engineer in 2018 and continuously patched since — had become the primary constraint on the business. The platform ran on a single relational database (PostgreSQL), with every domain model — clients, invoices, payment methods, reconciliation events, audit logs, user permissions — living in the same schema. Periodic close-of-day batch jobs, meant to reconcile payment batches against banking partner records, were now regularly running over their 4-hour window, sometimes stretching to 12, 15, or 18 hours, triggering alerts, escalating to On-Call, and occasionally causing delay-induced service credits against SLAs.

This case study describes the complete 18-month transformation that brought the monolith to an orderly close and replaced it with an event-driven microservices architecture on AWS. We cover why we chose a strangler-fig migration over big-bang rewrite, how we redesigned the event backbone, what monitoring stack carried the team through confidence, the organizational changes required to keep velocity high, and the mistakes — each of them expensive — that we made along the way. The metric outcomes are compelling. The process that produced them required deliberate choices at every layer.

Challenge

The Architecture That Worked Until It Didn't

The PayStream monolith was not a disaster — that framing would be inaccurate, and it would also obscure what made failure inevitable rather than accidental. The system worked well enough through 2021, when annual volume was under $400M. The problem was not code quality in isolation; it was the compound stress that arises when four constraints that were individually survivable begin to interact simultaneously.

Monolithic transaction scope. Every payment processing request — validate payment method, reserve funds, call external banking API, write success event, update client balance, emit reconciliation record, trigger webhook — was a single Django request/response cycle holding a single database transaction open. Under batched load, this produced row-level lock contention that grew quadratically with concurrent requests. A single stuck transaction during close-of-day could cascade, holding locks on client balance rows that blocked the entire reconciliation batch.

Synchronous external dependency chain. The checkout flow made 14 synchronous HTTP calls to external partners: three payment processors, two banking partners (per region), one fraud-screening vendor, and one invoice-linking service. A slow response from any single partner inflated P99 latency. Adding circuit breakers helped marginally, but because all external calls happened inside the monolith's HTTP handler thread, a five-second timeout at any step held a database connection open for the full duration.

Batched reconciliation as a monolith-sized query. The nightly close-of-day job fetched all unwired payment records for the day, iterated through them with a single Python loop, and updated each record's reconciliation status row by row. With transaction volumes growing as fast as they were, the dataset per batch window grew from under 100,000 rows (2021) to over 2.8 million rows (2024). The job had not been rewritten for streaming; it was holding locks for hours at a time.

Shared-database bottleneck under test. All functional tests ran against a shared staging PostgreSQL instance. Because every domain resource was governed by the same schema, test runs competed for lock resources even in pre-production, making accurate load testing almost impossible. Engineering had essentially no reliable data on how the system performed above 2,000 concurrent requests.

Business Impact: The Numbers That Drove Decisions

Technical debt translates cleanly into business costs when the creditors come calling. By mid-2024, PayStream's leadership had a quantified set of concerns that made any further delay their own problem:

SLA breaches and service credits: Five SLA breaches in the prior 12 months, averaging $42,000 per breach in service credits to affected clients. The trajectory was upward.
Revenue leakage from cart abandonment: P99 checkout completion time had climbed to 4,800ms. Industry benchmarks suggest every additional 100ms of latency costs roughly 1% of conversion; PayStream engineers estimated the checkout delay fiber was responsible for a 3.2% revenue drag, translating to approximately $1.1 million in recovered revenue from the improvement alone.
Deployment velocity crisis: The monolith required a coordinated team of four engineers every Tuesday for a combined window of about 6 hours, with an average of one rollback per month. Engineering could ship at best one significant feature per quarter.
Talent retention pressure: Three senior engineers had given notice in the past six months, explicitly citing the technical debt and the inability to work on meaningful modern architecture as primary reasons.

Goals

Technical Objectives

Handle 20× peak traffic: The target was not just surviving the next Flash-Friday equivalent event, but being able to sustain a sudden 20× traffic spike with graceful degradation and sub-second recovery, without manual On-Call intervention.
Reduce P99 API latency to under 500ms: The industry standard for API-driven SaaS platforms is sub-200ms P99; PayStream's 3,200ms P99 represented both a technical debt problem and a competitive disadvantage in the enterprise purchasing process.
Cut infrastructure costs by at least 50%: Vertical scaling had led to a 220% increase in EC2 spend in 2023 while providing only incremental improvement. The 50% reduction target was deliberately aggressive to force a re-evaluation of every architectural assumption.
Enable continuous delivery: Move from monthly big-batch deploy windows to multiple deployments per day with automated canary analysis, automated rollback, and measurable deployment frequency.

Business Objectives

Recover lost revenue from abandonment and SLA credits — quantified target of $1.5M in annualized improvement by month 12.
Reduce On-Call escalation rate by 80% — enabling engineers to focus on product work rather than reactive firefighting.
Enable the product team to ship payment method support for two new banking partners without platform re-architecture.
Support a headcount growth from 28 to 50 engineers without proportional infrastructure complexity increase.

Approach

Why We Rejected the Big-Bang Rewrite

Before any line of replacement code was written, our engagement leadership made a deliberate architectural commitment that required defending repeatedly over the project's course: we would not do a big-bang rewrite. The industry literature is saturated with big-bang rewrite postmortems — most of them catastrophic — and our own risk mitigation analysis made the case clearly.

A complete replacement given PayStream's codebase size and velocity requirements would have taken an estimated 24–30 months. In a business context where the existing system was already in production crises every 6–8 weeks, a 2-year bet without interim improvement was organizational suicide. Furthermore, big-bang cutovers always underestimate integration complexity — the "unknown unknowns" in a platform integration with 200+ client banking relationships, partner integrations, and compliance environments cannot be enumerated in advance.

The alternative we selected — a strangler fig migration, a term borrowed from a metaphor about fig trees that grow around a host tree and eventually replace it — involved incrementally extracting services from the monolith, routing traffic through an API gateway, and removing the monolith nodes gradually as confidence and service coverage increased.

This approach had its own risks — duplication of business logic during transition, increased operational surface area, longer total timeline — but those risks were known, bounded, and manageable. The most important structural property: the existing platform remained fully operational throughout, and every extraction step was verified against production traffic before the corresponding monolith path was retired.

Event-Driven Architecture as a Design Principle

A payment processing platform is fundamentally an event processing platform — every payment, every reconciliation, every webhook, every balance update is an event in a well-defined business lifecycle. The monolith's mistake was burying those events inside synchronous request handlers, which turned event processing into a distributed transaction coordination problem with no clean error boundary.

We designed the new platform around a formal event backbone based on Amazon EventBridge and Amazon SNS/SQS, with the following organizing principles:

Domain events are the single source of truth. Every significant state transition — payment authorized, payment captured, reconciliation matched, webhook delivered — publishes a canonical domain event. Services consume events to update their own read models but never directly modify another service's state.
Event schema is versioned and centrally governed. Using the AsyncAPI specification, we established a schema registry that every event-producing service validated against before publishing. Breaking changes required a schema evolution strategy with a migration window rather than a cutover.
At-least-once delivery is the contract. Every service was designed to handle duplicate events gracefully — using idempotent writes, natural keys, and idempotency keys on the event envelope. This eliminated the complex and failure-prone pattern of distributed transaction coordination and allowed each service to be independently recoverable.
Dead-letter queues are first-class citizens. Every SQS queue had an associated DLQ with alerting. A DLQ alert triggered not just a notification but a structured runbook review before the message disappeared without processing.

Service Extraction Strategy

We identified seven extraction candidates early in the engagement, ordered by a rubric combining implementation complexity, business criticality, and migration independence. The ordering ended up looking like this:

Authentication and Authorization service — minimal data dependencies, bounded context, low deployment risk.
Invoice and billing service — significant complexity but well-bounded schema subset.
Payment authorization service — core to the value proposition but independent of internal state.
Reconciliation engine — the most complex extraction, deferred to months 12–14.
Webhook delivery service — technically straightforward, high immediate velocity benefit.
Audit logging service — write-only concern, easy to extract.
Client management service — large schema footprint, deferred to later phase.

API Gateway as the Traffic Router

Throughout the migration, every request — regardless of whether the target service was the monolith or a microservice — passed through Amazon API Gateway. This gave us several critical capabilities: routing rules could be updated in seconds without touching any service code; canary deployments of new services could be tested at 5% of traffic before full rollout; and a single WAF and rate-limiting policy covered the entire platform surface area regardless of where traffic was landing.

The API Gateway approach also allowed us to introduce a shadow traffic pattern during service extractions. When a new payment authorization service was ready for validation, we routed 10% of live production traffic to both the monolith and the new service simultaneously. Responses from the monolith were used as the authoritative response to the caller; responses from the new service were logged for comparison without being returned. This gave us a high-fidelity view of how the new service performed under production load — including the error paths and edge cases that load testing in staging cannot replicate — before it carried any real live traffic.

Implementation

Phase 1: Foundation and Observability (Months 1–3)

The first quarter of the engagement was not service extraction at all. It was about building a platform that could afford to fail without context about why. This investment in observability was the most undervalued decision of the entire project — by month 1 the principal engineer's response to my question about what monitoring existed was essentially "CloudWatch metrics and some alert recipients."

We established a full observability stack within an SLO-defined framework:

Structured (JSON) application logging — all services log in structured JSON format with trace context propagation. We standardized on the OpenTelemetry specification and emitted spans to a managed AWS X-Ray collector wrapping every API request, every event-processing step, and every external call.
Service-specific dashboards — each service had a Grafana dashboard capturing P50/P95/P99 latency, error rate per route, queue depth, and DLQ message count. Dashboards were co-located with service repositories (not in a shared wiki that could go stale).
Distributed tracing across services — the trace context was propagated through event middleware, allowing us to reconstruct the entire call chain for any request across the system regardless of how many services it touched.
SLO budget tracking — we defined an availability SLO of 99.95% per service per month and used Grafana's SLO widget to show remaining error budget against that target. Services that burned more than their error budget were not permitted to deploy new feature code until the incident was resolved and the root cause was documented.

By the end of phase 1, every service extraction candidate had a production-grade observability foundation before the first line of domain logic was migrated. This was, in retrospect, crucial. The learning curve of service-specific failure modes — "this service fails slowly, this service bursts its queue when event streams back up" — would have been opaque without that instrumentation in place.

Phase 2: Service Extraction — Authentication and Invoicing (Months 4–8)

The auth and authZ extraction was our proof of concept. It was chosen deliberately as the lowest-risk candidate, and its success (or lack of it) would validate the strangler-fig approach, the shadow traffic pattern, and the team's ability to operate across two technologies in parallel.

The replacement auth service was built in Go — selected for its type safety, fast cold-start, and low memory footprint, all meaningful for a service called on every request. It used DynamoDB for session persistence, which eliminated the session-state coupling problem in one step: any API Gateway instance in any availability zone could authenticate a request without a session affinity roundtrip to a specific application server.

After 8 weeks of development, we ran the shadow traffic test. The auth service and the monolith auth path produced identical responses on 99.98% of requests over a 72-hour observation period. The remaining mismatches were all edge cases involving expired tokens handled slightly differently — fixed in both paths, re-validated, and then the monolith path was removed.

The invoicing extraction took longer — approximately 20 weeks of implementation plus 6 weeks of shadow traffic validation. The key technical challenge was the invoice data model: the monolith's invoice table had over 60 columns including fields that belonged to the payment processing domain and fields that belonged to invoice templates. We introduced a canonical invoice schema that represented only the solving-domain concerns, with a migration view that handled the old field names for backward compatibility during the transition. The event-driven design introduction here was pivotal: rather than embedding invoice state mutations in the request handler, every invoice lifecycle event (created, approved, paid, cancelled) became a published event, and read models in downstream services built their own materialized views from the event stream.

Phase 3: The Payment Processing Core (Months 9–12)

Payment processing is the highest-stakes domain in any fintech platform, and extracting it last — or at least in the middle tier — was deliberate. We had witnessed too many stories of services extracted too early, running in production with production payment traffic, before the team had internalized the operational discipline that event-driven systems demand.

The payment authorization service became the first domain we designed with saga orchestration rather than simple event publishing. A payment authorization involves a multi-step protocol across different parties and can fail at any step — authorization fails at the processor, payment fails during capture, reconciliation marks it as un-reconciled. A saga orchestrator coordinates this sequence of steps, compensating for failures with compensating transactions when necessary. This pattern, while more complex than a simple request/response, became essential for the reliability guarantees PayStream needed to maintain with enterprise clients.

We also introduced idempotency keys as a first-class concept across every payment endpoint. Because the event delivery contract is at-least-once, services must not assume that each event represents a unique invocation. The idempotency key (generated at request entry, stored in DynamoDB with a TTL) allowed the service to return an identical response for a duplicate request without re-processing — a critical property for a payment processing platform.

Phase 4: Reconciliation Engine and Cutover (Months 13–18)

The nightly reconciliation engine was deferred intentionally — it was the most complex and highest-risk extraction, involving the largest and most latency-sensitive data flows. When we tackled it in month 12, we applied the lessons from the prior phases aggressively:

Change Data Capture (CDC) from PostgreSQL via Debezium fed an event stream that the new reconciliation service consumed. CDC meant the new service operated against a live, continuously-synced view of the payment data without requiring the monolith to produce events directly.
Shadow reconciliation — the new reconciliation service ran for three consecutive close-of-day cycles in parallel with the monolith, writing its results to a separate schema. Our analytical team then cross-referenced the two result sets across 12 consecutive nightly batches before we were willing to use new reconciliation results for downstream ledger updates.
Gradual traffic shift — over six weeks, we shifted close-of-day reconciliation approval from the monolith to the new service in 10% increments (reconciliation load was not user-traffic — it was batch-based, which made the shift pattern straightforward to validate)
Monolith subgraph removal — once the migration path was fully confirmed, we decommissioned the reconciliation batch tables in the monolith PostgreSQL instance. PostgreSQL recovered approximately 380GB of disk and a prominent source of read-lock contention during the remaining monolith service hours.

CI/CD and Deployment Pipeline

A transformation this large fails if the platform's delivery pipeline cannot keep pace with the complexity it introduces. We designed the PayStream CI/CD pipeline before we designed the services, because the deployment experience needed to be predictable, automated, and well-documented before the first production service was shipped.

The pipeline used AWS CodePipeline orchestrating CodeBuild for test execution and CodeDeploy for service deployments, with an automated canary analysis step using AWS CodeDeploy's deployment lifecycle hooks. Every deployment ran through: env provisioning (staging Faraday test), integration test suite (5,000+ tests across 20 test suites), canary load injection (5% of production traffic for 15 minutes with success rate verification >99.9%), approval gate (on-call engineer approval step for production), traffic ramp (25% → 50% → 100% in 5-minute steps with automated metrics checks at each step).

The result: by month 12, the team was averaging 3.2 deployments per day across all services, with a deployment success rate of 98.6% and a mean time to recovery of 18 minutes when rollbacks were required — compared to a pre-migration fifth failure rate of roughly 35% for monolith deployments.

Results

Infrastructure Cost Reduction — 72%

The monolith architecture had driven EC2 costs upward over 220% in 2023 through vertical scaling alone — the team was running database-optimized c5.9xlarge instances as the primary application tier, paired with a massive 64-core database host. After the final cutover, the full microservices platform — comprising seven services across three layers, aChange Data Capture pipeline, an event backbone, and the observability stack — ran on a fleet of mixed instance types averaging m6g.large and r6g.large with a smaller r6g.xlarge for the single high-memory service. DynamoDB for session persistence and idempotency tables used on-demand capacity, keeping costs proportional to actual usage rather than peak-hour over-provisioning. Total monthly infrastructure spend dropped from an average of $48,000/month to $13,400/month — a 72% reduction, exceeding the 50% target. The cost reduction was not the primary benefit, but it was genuinely useful: it made the migration economically self-financing within the first 12 months, as the savings funded the platform engineering team's continued work without additional capital raise.

Latency Improvement — 82% Reduction in P99

The monolith's P99 checkout completion latency had climbed to 3,200ms, with an hourly ceiling on Black-Friday-previous peak events that topped out at 4,800ms. After the migration to event-driven microservices, the same checkout flow achieved a P99 of 180ms with a peak-hour ceiling of 420ms — an 82% reduction on average and an 89% reduction during the worst peak hours. This was not simply a side effect of AWS; it was the architectural consequence of several deliberate choices working together: the elimination of synchronous multi-API call chains (replaced by event-driven choreography with a reserve-and-capture pattern), the introduction of materialized read views built asynchronously from the event backbone rather than hitting the primary database on every read path, the Go-based auth service with sub-2ms average response time, and the AWS WAF and API Gateway layer caching common read endpoints at the edge.

Traffic Resilience

On May 15, 2025 — approximately two months after the final monolith cutover — a banking partner API experienced a 47-minute outage affecting approximately one of seven regional processing paths. Under the monolith architecture, this would have resulted in a complete platform outage visible to all clients. With the new architecture, the event-driven retry and dead-letter queue infrastructure quarantined the failed path automatically. The payment authorization service processed requests targeting the failed 14% of normally-routed traffic using a cached partner health status, maintaining graceful degradation and partial service. The remaining clients experienced approximately 8% of elevated latencies for requests targeting the specific partner — no platform-wide outage, no SLA breach, no manual On-Call intervention required. Post-incident analytics showed client-visible success rate during the 47-minute window was 99.2%.

Deployment Velocity

The introduction of the CI/CD pipeline with canary analysis and automated rollback fundamentally altered the team's relationship with deployment risk. Pre-migration deployment velocity averaged 0.3 significant features per engineer per month. Post-migration, the rate increased to 1.8 features per engineer per month — a six-fold increase. Feature cycle time from inception to production dropped from an average of 72 days to 18 days. More importantly, the psychological cost of deployment risk was removed: features that previously required three separate meetings and a written rollback plan now received a single approval on the CI/CD dashboard and were shipped in the pipeline's regular 10 AM slot. This change had direct business impact: in Calendar Year 2025, PayStream shipped 41 partner integrations (for new banking partners and payment processors) compared to 9 in Calendar Year 2024.

Team Metrics

TheOn-Call escalation rate, previously averaging 18 escalations per month during business hours, dropped to 3 per month — an 83% reduction. Post-migration anonymous team survey showed that the proportion of engineers reporting "focus time" as more than 60% of their work week increased from 21% to 76%. The three senior engineers who had given notice as part of the pre-migration exodus had both been re-hired — one as a principal engineer on the platform team, one as head of engineering operations — before the project concluded. The engineering team's anonymous Net Promoter Score for "would recommend working here" went from 14 (pre-migration) to 62 (post-migration).

Metrics Summary

The following table captures the most significant before/after metrics, measured across the final three stable months of the migration compared to the same quarter in the preceding year:

Metric	Before (Monolith)	After (Microservices)	Change
P99 API latency	3,200ms (4,800ms peak)	180ms (420ms peak)	-82% avg / -89% peak
Infrastructure cost / month	$48,000	$13,400	-72%
Deployments / month	0.3 features/eng/month	1.8 features/eng/month	+500%
On-Call escalations / month	18 (business hours)	3	-83%
Feature delivery cycle time	72 days	18 days	-75%
SLA breaches / year	5 ($210K credits)	0	Eliminated
Team Net Promoter Score	14	62	+343%

Lessons Learned

Observability First, Refactor Second

The most consequential decision was not what we migrated, but when we migrated — and in particular, the decision to invest the first three months in observability infrastructure before touching any domain logic. Observability instrumentation placed before a migration provides a consistent before/after comparison; instrumentation placed after provides only the after-state, and "after" without before is useless data. This rule should be non-negotiable for any team starting a significant architecture migration: establish your measurement baseline first, and do not ship a changed service without the instrumentation to understand whether it helped or hurt.

Shadow Traffic Beats Load Testing Alone

Every service extraction passed through a shadow traffic phase before real traffic was cut over, and the pattern uncovered edge cases that our load tests in staging would never have revealed — particularly around schema migrations, token expiry edge cases, and idempotency key collision handling. Shadow traffic is not a substitute for load testing; it is its complement. Load testing tells you whether your service can handle the volume you simulate. Shadow traffic tells you whether your service handles the volume and variety of traffic that production actually generates. You need both.

Saga Orchestration Requires Deliberate Thinking About Failure

Strangler-fig migrations to event-driven architectures introduce distributed failure modes that simply do not exist in a monolith. A failing saga step is not a handled exception with a rollback; it is a compensating transaction that must be designed backwards from the start. We made the mistake of deferring the full compensation design for the payment authorization saga during phase 3, and it caused a 90-minute reconciliation discrepancy error in month 11 that required a manual ledger correction. The cost was modest in absolute terms — less than $12,000 — but it represented a failure moment that could have been catastrophic at a larger scale. The lesson: design compensating transactions at the same granularity as the happy path, not as an afterthought.

SLO Budget Disciplines the Engineering Team

The SLO-driven deployment gate — "services that burn more than their monthly error budget are not permitted to ship new feature code" — transformed the team's relationship with reliability. Previously, post-incident reviews ended with a "lessons learned" document filed in a shared folder, and the next deployment proceeded on schedule regardless. With the error budget system in place, burning budget above threshold automatically paused feature deployment until the incident was resolved and the root cause was documented with a proposed remediation. This created an economic incentive within the team to improve reliability rather than simply document it. The team's own behavior started shifting: engineers began proactively flagging reliability debt in planning sessions rather than discovering it during incidents.

The Cost of the Strangler Fig Is Time — But So Is the Cost of Risk

The strangler-fig approach extends the total migration timeline compared to a big-bang rewrite. The PayStream migration took 18 months versus an estimated 14-month big-bang path. A retrospective stakeholder asked whether, knowing the outcome, we would have said yes to the shorter big-bang path. The answer is no — and the reason is instructive. A 14-month big-bang path carries no interim improvement and no production evidence that the replacement system works under real load. An 18-month strangler-fig delivers measurable outcomes every three months, reduces business risk continuously, and — critically — allows the leadership team to validate decisions with real evidence at every step. Extended timeline is a genuine cost, but it is a cost that delivers compounding safety benefits the big-bang path cannot deliver.

Organizational Change Is as Important as Technical Change

This case study has focused largely on technical architecture decisions, which reflects the nature of the engagement as a platform transformation. Structurally, the organizational components — team structure, deployment practices, SLAs, headcount planning — required the same intensity of design consideration as every API gateway and event type. The engineering team that began this engagement in 2024 was not operationally capable of operating a heterogeneous multi-service platform regardless of how the architecture was designed. Building the operational discipline — on-call culture, deployment hygiene, SLO frameworks, incident management — happened in parallel with the technical work, and the schedule was calibrated to give that work room. Technical transformation without operational transformation is a migration, not an evolution. The teams that succeed at the migration but fail at the operational discipline tend to re-impose the same organizational constraints a year later regardless of how modern their technology stack is.

Conclusion

The PayStream case study is a narrative of deliberate architectural evolution rather than disruptive reinvention. The strangler-fig migration pattern, the event backbone as a platform commitment, shadow traffic as a validation strategy, and the SLO-driven deployment cycle were not chosen because they are fashionable — each was chosen because it addressed a specific, measured constraint in the platform that could not be addressed in any other way.

The metrics speak for themselves but they also understate the story. The real outcome was not the reduction in latency or the cost savings — those were operational outcomes. The structural outcome was what the platform enabled afterward:_two years of sustained organic growth without requiring another round of infrastructure investment, six new engineering team members supported by a mature deployment and observability foundation, and an engineering team that stopped firefighting and started building features relevant to their clients. For any technology leader considering a cloud transformation, the case for starting with a clear strategy, a comprehensive observability foundation, and a migration pattern that keeps the existing system running throughout is a stronger case than any single technical decision could make on its own.