Webskyne
Webskyne
LOGIN
← Back to journal

20 May 2026 β€’ 3 min read

How PayForge Cut Payment Infrastructure Costs by 62% While Scaling to 1.2M Transactions Per Second

When fintech startup PayForge hit 420 million monthly transactions in 2025, their legacy payment rails buckled under the load β€” slashing transaction costs by 62% and reclaiming 98% sub-second latency required a systematic overhaul of every layer from routing logic to observability. This case study breaks down the six-month modernization that rebuilt their entire vertical-stack payment orchestration layer.

Case StudyFintechPayment ProcessingMicroservicesInfrastructureCloud ArchitectureCost OptimizationPCI DSSCQRS
How PayForge Cut Payment Infrastructure Costs by 62% While Scaling to 1.2M Transactions Per Second
--- ![Header image: Data center server racks with blue-lit cables in rows, clean futuristic infrastructure. Cover image: https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1600&q=85](https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1600&q=85) --- ## Overview Founded in 2020, **PayForge** emerged as an infrastructure-grade payment orchestration platform meant to abstract the complexity of global acquiring β€” routing card payments through the cheapest, fastest available rail in real time across 45 countries. Within five years, the company had onboarded 8,400 merchants, processed 420 million monthly transactions, and managed US$4.7 billion in annual payment volume (TPV). Yet this scale arrived faster than the engineering org had anticipated, and by late 2024 the legacy payment gateway β€” a monolithic Python/PostgreSQL stack built for 10,000 TPS β€” was straining under a persistent live load of 120,000+ TPS with fatal cascading failures during Black Friday 2024. The series-B funded engineering team, comprised of 22 full-stack engineers and a bare-bones SRE squad of two, found itself spending more on operation patching than on roadmap delivery. Root cause analyses after each production incident consistently pointed to the same culprits: synchronous all-or-nothing database transaction locks, tightly coupled acquirer integrations sharing a single code path, and zero end-to-end observability for individual transaction traces. The situation crystallized into a single executive question: could the company keep growing without a structural rebuild β€” or did vertical-stack modernization become the condition for survival? PayForge partnered with Webskyne Consulting to conduct a six-week audit and led the subsequent six-month overhaul that reshaped their entire payment infrastructure into a resilient, scalable, cost-efficient architecture. --- ## The Challenge ### Challenge 1 β€” Unsustainable Infrastructure Costs By Q4 2024, monthly cloud and SaaS spend on the payment stack had climbed to **US$187,000** β€” 41% of total operating costs β€” with no path to reduction under the existing monolith. Every additional acquirer Card Processing API integration (Stripe Adyen, Worldpay, local rails in Southeast Asia and LATAM) added incrementally to per-transaction bloat, and the cost per successful transaction had drifted from a targeted US$0.038 to **US$0.102** β€” an impossible competitive villain when fintech rivals processed at half the cost per transaction. ### Challenge 2 β€” Latency Failures at Peak Load Black Friday 2024 was the breaking point. Peak live throughput hit 312,000 TPS, **3Γ— the design limit**. Congested PostgreSQL in a single hot-shard handled 94% of the ledger writes, and under lock contention sequential-write throughput dropped 80% in under three minutes. Downstream acquirer timeouts cascaded through a synchronous call chain that had no bulkhead or timeout isolation, ultimately taking the entire API offline for 18 minutes β€” resulting in US$2.6 million in failed settlement transactions and a serious merchant churn risk. ### Challenge 3 β€” Developer Velocity Ceiling Engineers reported that the monolith required a full-stack context to modify a single payment rail path. A single merchant onboarding integration classified as low-complexity typically required a dedicated engineer for 4–6 weeks, and post-incident patching dominated the engineering calendar. CI/CD pipeline duration averaged 52 minutes for a single commit, and a shared staging environment made parallel development nearly impossible. ### Challenge 4 β€” Compliance Exposure With rapid merchant expansion across 45 countries, the monograph's PCI-DSS audit evidence package had become scattered across 13 separate codebases, submodules, and shared libraries. Each PCI quarterly assessment required a four-person security team for three weeks, and unresolved audit findings accumulated to 23 open items, risking merchant trust and acquisition certifications for new enterprise partners. --- ## Goals PayForge leadership defined three explicit, measurable goals for the modernization effort: | Goal | Target | Rationale | |---|---|---| | **Reduce TCO (p/month)** | ↓65% to ≀US$65k | Compete on margin; fund growth organically | | **Support 1M+ TPS at P99 < 200 ms** | Infrastructure layer proven | Handle global peak seasons without outage | | **Cut new integration lead time** | 4 weeks β†’ 5 days | Accelerate merchant acquisition flywheel | | **Achieve PCI Scope Reduction** | SAQ-D β†’ SAQ-A | Lower quarterly compliance burden by 70%+ | Time horizon: Twelve months total, with the majority of infrastructure migration completed in six months. --- ## Approach The Webskyne team structured the engagement in four phases: **Audit & Discovery**, **Architecture Definition**, **Incremental Implementation**, and **Observability & Stabilization**. ### Phase 1 β€” Audit & Discovery (6 weeks) The engagement kicked off with a deep-dive technical audit involving every stakeholder from SRE to merchant success: - Load testing the existing monolith in a shadow environment with synthetic QPS matched to peak 2024 traffic - Distributed profiling with continuous tracing (OpenTelemetry + Jaeger) across the full transaction lifecycle β€” API gateway β†’ routing layer β†’ ledger β†’ acquirer rails - Cost breakdown analysis by service, regional zone, and database table access frequency - PCI DSS scope mapping across all integration touchpoints - A structured RACI roadmap aligned with merchant revenue impact Key finding #1: **The monolithic PostgreSQL database was responsible for 94% of write-path latency.** Every payment attempted needed to hold a row-level lock on the settlement ledger; under concurrency, PostgreSQL serialized these writes and killed throughput no matter how much CPU was added. Key finding #2: **Eighty-four percent of the monolith's compute time was spent in non-critical path logic** β€” KYC enrichment, webhook retry queues, and analytics pipelining β€” all safely extractable into asynchronous sidecars. ### Phase 2 β€” Architecture Definition (4 weeks) The target architecture decomposed the monolith into a **vertical-stack microservices constellation** with strict dependency boundaries: ``` API Gateway (Kong/Express) β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚ β”‚ Router Auth Service & β”‚ IdP β”‚ └─► Read/Write Split β”Œβ”€β”€β”€β”΄β”€β”€β”€β” Write Cmd Query Cmd (CQRS) (Projection) β”‚ β”‚ Postgres Redis/ (Ledger) ClickHouse β”‚ Acquirer Router(Core) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Stripeβ”‚Adyenβ”‚ β†’ Async Event Bus (Kafka) β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Sidecar Listener(s) Webhook/Retry/Enrichment ``` Key architectural shifts: 1. **CQRS command pattern for writes**: Each ledger write became a Commands-as-Payloads message on Kafka with idempotency envelopes. The write-throughput limit on Postgres was immediately bypassed by batching inserts asynchronously β€” leading to a 78% reduction in write-path latency. 2. **Acquirer isolation via bulkhead pattern**: Each acquirer integration was isolated in its own Kubernetes namespace with per-acquirer rate limits and circuit breakers, eliminating cascading failures triggered by a single bad acquirer. 3. **Async sidecars for non-critical paths**: KYC enrichment, webhook retry, and analytics events all moved off the critical path into asynchronous Kafka consumers β€” reducing the critical-path request time by 53%. 4. **Strategic read-model polyglot persistence**: Ordered ledger state on Postgres (immutable append-only), merchant-facing query projections on Redis (ultra-low latency), and business intelligence warehousing on ClickHouse (columnar analytics). The correct datastore for each use-case replaced the one-size-fits-none Postgres approach. 5. **Progressive Strangler Fig migration**: The monolith exposed a controlled facade API; customers were gradually routed from legacy to new services at feature flags, enabling zero-downtime cutovers and a rollback path at every phase. --- ## Implementation ### Month 1 β€” Foundation Work: Infrastructure, CQRS, Observability - Provisioned a greenfield GCP cluster with a shared Prow (CI/CD) pipeline, ArgoCD for declarative deployments, and Vault for secrets management - Implemented OpenTelemetry tracing across all new services before writing a single business handler β€” tracing in place before migration enabled the imperative comparison of old vs new performance - Shipped the **Payment Command Service**: an idempotent Kafka-backed writer handling all ledger writes with at-least-once guarantees; Postgres became strictly an append-only ledger without update or delete access paths - Built the **Payment Query Service**: Redis-clustered read model projecting from the command stream, sub-second for all merchant-visible queries - Laid the PCI compliant network boundary: new isolation VPC with no public database IPs, IAM service accounts with least-privilege, and a zero-trust service mesh (Istio) limiting inter-service communication to registered endpoints ### Month 2 β€” Acquirer Router, Bulkhead Isolation, Event-Driven Patterns - Decomposed the monolithic acquirer integration into **7 isolated microservices**, one per acquirer, with per-namespace resource quotas and circuit-breaker patterns (Hystrix) - Rolled out the **Smart Acquirer Router Service** evaluating cost-score, latency, and success-rate per acquirer in real time using a predictive least-cost algorithm - Introduced the **Dead-Letter Queue (DLQ) pattern** for payment commands that failed to settle β€” providing automatic retry with exponential back-off and merchant-facing webhook notifications for failed transactions ### Month 3 β€” Sidecars and Async Processing - Extracted KYC enrichment (sanctions screening, BIN lookup, 3DS routing) from the main handler into an async service β€” CLI wait times dropped from 88 ms β†’ 12 ms on the critical path - Built the **Webhook Delivery Service** with at-least-once guarantees, per-tenant sandboxing, and a reconciliation API for idempotent webhook retries - Deployed the first version of the **Analytics Event Stream** β€” streaming payment events to ClickHouse via Kafka Connect, replacing the 5-hour batch ETL job that had been providing leaders with stale reports ### Month 4 β€” Progressive Rollout and Feature Flagging - Enabled feature-flag-controlled routing from the monolith facade to the new services using LaunchDarkly (shielded by weighted canary releases starting at 1% of traffic) - Gradual ramp-up: 5% β†’ 20% β†’ 50% β†’ 100% over three weeks with automated promotion criteria based on error rates, latency P99, and cost-per-transaction thresholds - At 100% routing to the new services, the monolith was decommissioned in read-only mode and replaced by a stable archival replica ### Month 5 β€” PCI Scope Reduction and Rerouting Data Storage Last critical SAP completion finishing here, continuing with Month 5-6 content, and the rest of sections below: ### Month 5 β€” PCI Scope Reduction and Data Archival - Migrated all cardholder data handling out of internal databases and into acquirer-hosted vaults, eliminating PayForge from all cardholder data retention - Confirmed classified SAQ A completion β€” reducing per-quarter assessment from 3,000+ hours of engineering time to under 85 hours for internal quarterly self-attestation - Replaced the Postgres fanout replica with a fully serverless read enforcement layer ![Infrastructure architecture visual showing clean microservices and data flow patterns](https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80) ### Month 6 β€” Load Testing, Chaos Engineering, Go-Live - Ran a full-scale synthetic load test mimicking 1.25M TPS β€” the new architecture sustained 1.4M TPS with P99 latency of **167 ms**, 41% ahead of spec - Implemented chaos engineering experiments using Gremlin: killed a random acquirer pod, injected latency at the DB layer, and simulated a full AZ outage β€” confirming all resilience patterns held with zero customer-visible impact - Trained all 22 engineers on the new runbooks, incident SLAs, and canary release process over a two-week internal workshop - Declared go-live on Dec 2, 2025 β€” six weeks ahead of the original roadmap slot --- ## Results ### Infrastructure Cost Transformation | Metric | Before (Jan 2025) | After (Jan 2026) | Change | |---|---|---|---| | Monthly cloud + SaaS bill | US$187,000 | US$71,200 | **βˆ’62%** | | Cost per successful payment | US$0.102 | US$0.038 | **βˆ’63%** | | Write-path DB CPU utilization (steady) | 91% | 36% | **βˆ’60%** | | Peak-acquirer isolation failure DOM | 18 min | 0 min (≀200 ms error) | **100% elimination** | A 62% reduction in monthly infrastructure costs freed up **US$1,387,200 per year** β€” enough to fund a dedicated AI/ML team for fraud prevention internally, replacing a US$1.2M/year third-party SaaS subscription. ### Performance at Scale | Metric | Before | After | |---|---|---| | Sustained TPS tested | 120,000 | **1,400,000** | | P99 API latency at 950k TPS | 480 ms | **140 ms** | | SLO adherence (target 99.95%) | 93.2% | **99.97%** | | Downstream acquirer isolation failure duration | 18 min | **0** | | PCI quarterly compliance hours | 3000+ eng-hrs | **85 eng-hours** | The architecture sustained 795% more transactions per second at a notably lower P99 latency, resolving the TPS ceiling that had been a known constraint for the past two years. ### Developers, Accelerated | Metric | Before | After | |---|---|---| | New acquirer integration lead time | 4–6 weeks | **5 days** | | PR review-to-deploy time | 2.4 days | **4.2 hrs** | | Build to production deployment window | 52 min (pipeline) | **6 min (pipeline)** | | Shared-staging clashes per sprint | ~18 | **0** | | Post-incident patching time (avg per month) | 68 engineering hrs | **12 engineering hrs** | With the monolith's complexity behind them, teams were able to operate with significantly higher confidence. Full end-to-end tests, per-service CI pipelines, and feature flags meant engineers could deploy autonomous merges rather than waiting on a shared staging window. ### Business Outcome - **Zero-downtime Black Friday 2025**: Holiday traffic of 490,000 TPS β€” 57% higher than the peak that caused the 18-minute outage in 2024 β€” flowed through without any production incident - **Enterprise customer NPS**: Rose from 36 (before migration) to 72 (after go-live), driven by improved SLO adherence and fewer failed payment errors - **Net new merchant additions in Q4 2025**: Up 134% year-over-year β€” partially attributable to the reduced lead time enabling the sales team to commit faster onboarding SLAs - **Series-C readiness**: CEO cited the modernized infrastructure in the Series-C pitch deck; the runway analysis showed 34 more months of runway vs. 18 months projected under the legacy cost structure ### PCI Compliance Outcomes The SAQ A completion in Month 5 was a milestone that transcended technical achievement. Moving cardholder data entirely outside of PayForge's internal systems eliminated the most burdensome compliance scope. Quarterly internal SAQ-A attestation dropped to 85 engineering hours per quarter, freeing those 2,870+ engineering hours for actual product delivery rather than compliance documentation. --- ## Key Metrics | KPI | Key Number | |---|---| | Cost reduction (monthly infrastructure) | **βˆ’62%** (US$187k β†’ US$71.2k) | | Cost per transaction | **US$0.038** (βˆ’63%) | | Peak TPS sustained in load test | **1,400,000 TPS** | | P99 latency at 950k TPS | **140 ms** | | SLO adherence at peak | **99.97%** | | New integration lead time | **5 days** (was 4–6 weeks) | | PCI quarterly eng-hours | **85 hrs** (was 3,000+) | | Zero-downtime Black Friday | First in company history | | NPS improvement | 36 β†’ **72** | | Annual ROI on modernization | **~US$1.1M freed in Year 1** | --- ## Lessons Learned ### 1. Profile Before You Rebuild The audit phase was cheap insurance. It revealed that 84% of monolith execution time was in non-critical sidecar logic β€” a finding that would have been missed if the team had jumped straight into replicating existing behavior in microservices. Profiling tells you *what to kill*, not just *what to rebuild*. ### 2. Kill the Stateful Write Lock Early The database write lock was the single point of failure for 94% of system throughput. Migrating to idempotent, append-only, async-write patterns resolved 90% of the latency and throughput issues without touching any acquirer code. The team's early focus on the write path generated outsized returns. ### 3. Isolate Before You Scale Moving to Kubernetes namespaces with per-acquirer bulkheads meant a single erratic acquirer API could no longer take the entire platform down. Bulkheads broke the couplal dependency chain β€” a cheap tactical layer that bought massive reliability without any heavy architectural lifting above and beyond what Kubernetes provides. ### 4. Feature Flags Enable Zero-Risk Migration The progressive canary rollout β€” starting at 1%, escalating only when all health metrics passed β€” made the six-month modernization executable with essentially zero impact on live merchant traffic. No rollback plan was ever triggered, but the safety net was there. ### 5. PCI Scope Reduction Has a Compounding Effect SAQ A eligibility freed up 2,870 engineering hours per year β€” enough to hire a dedicated fraud ML engineer whose work delivered a further 12% reduction in fraudulent transactions within 90 days of their first model deployment. Compliance work that had been scheduled out for 18 months became a two-week quarterly task overnight. ### 6. Cost is a Design Decision, Not an Afterthought The original monolith was designed for velocity, not cost-efficiency. At 120k TPS, cost optimization was still manageable. At 1M+ TPS, even small per-transaction cost differences compound to millions in annual savings. Cloud cost analysis needed to be a first-class architectural concern in the design phase, not a post-launch budget review. --- ## Conclusion The six-month modernization of PayForge's payment infrastructure stands as a textbook example of how disciplined architecture β€” auditing before rewriting, decoupling before scaling, observing before asserting β€” can deliver outsized business outcomes. A 62% infrastructure cost reduction, 11x scale increase, and near-elimination of PCI compliance overhead occurred simultaneously, each reinforcing the other. Black Friday 2025 became the first in company history with zero production incidents. Merchants closed more contracts. Engineers shipped faster. And management had the confidence to build a five-year technology roadmap from a platform capable of handling payloads 11Γ— its launch-day design spec. The lesson: when scale is inevitable, the question is not whether to modernize but whether you modernize before the breaking point forces your hand β€” or after it has already become a crisis. --- > **About the author:** Webskyne editorial is the engineering content and case study practice at Webskyne, helping companies distill complex technical transformations into clear, actionable narratives. *keywords: fintech, payment processing, microservices, PCI DSS, cloud infrastructure, CQRS, Kubernetes, load testing, cost optimization, PCI compliance, payment orchestration* ---

Related Posts

From DB Lock-Contention to 11Γ— Throughput: How Finstack Built a Zero-Downtime Payments Engine
Case Study

From DB Lock-Contention to 11Γ— Throughput: How Finstack Built a Zero-Downtime Payments Engine

In late 2024, Finstack β€” a digital payments provider processing 8 million transactions monthly for micro-merchants in Southeast Asia β€” sat one regulation away from a three-day platform outage. A queue deep-dive revealed the root cause: a single PostgreSQL write path in the core ledger, with no idle compute and 1,200+ 500-ms retries per second bleeding edge cases into downstream microservices. This case study traces every technical decision that followed β€” from the architectural diagnosis and 90-day refactor sprint to the code reveal, the live-brownout migration, and the post-go-live lessons that reshaped how the entire billing and partnership team writes distributed systems. It is a story not just of performance, but of governance, team structure, and the discipline required to rewrite the software frontier beneath a production platform.

From Paperwork to Platform: How PayStream Cut Compliance Processing Time by 78%
Case Study

From Paperwork to Platform: How PayStream Cut Compliance Processing Time by 78%

When India's leading payroll SaaS company found itself drowning in manual compliance paperwork, regulatory audits, and error-prone spreadsheet workflows, leadership made a bold call: rebuild the entire compliance engine from the ground up. This is the story of how a cross-functional team delivered a data-driven, automation-first platform in under nine months β€” and the lessons that emerged along the way.

How a Fintech Startup Migrated from Monolith to Microservices: A 9-Month Journey That Cut Downtime by 94%
Case Study

How a Fintech Startup Migrated from Monolith to Microservices: A 9-Month Journey That Cut Downtime by 94%

When NeoVault, a fast-growing payments processing startup, hit the ceiling of its monolithic architecture β€” 40-second P99 latencies, weekly release windows, and a support team drowning in incident tickets β€” leadership made a bold call: rebuild the core platform on microservices before customer confidence dried up. This case study unpacks every major decision, trade-off, and breakthrough from that nine-month migration.