How PayCurrent Rebuilt Their Payment Gateway and Cut Latency by 62%

An inside look at how PayCurrent—a payment processing platform handling 2 million+ daily transactions across Southeast Asia—faced mounting reliability issues with their 8-year-old Node.js monolith and chose a pragmatic strangler fig migration to rebuild their gateway from scratch. Over six intense months, Webskyne partnered with their engineering team to replace brittle synchronous call chains, opaque observability, and risky in-place deployments with an event-driven, observable microservices architecture built on AWS EKS, Kafka, and Kubernetes. The migration strategy prioritized data boundaries, outbox patterns for exactly-once delivery, and feature flags as the integration seam. The result was dramatic across every dimension: 62% lower API latency, 99.998% uptime, 42% infrastructure cost savings, and a 40% improvement in developer velocity. More than a technical transformation, the project delivered lasting organizational clarity—independent deploy cycles, faster onboarding, and stronger merchant trust. This case study dissects the architecture decisions, the three-phase migration strategy, the chaos engineering hardening phase, and the transferable lessons learned for any team facing a similar modernization crossroads.

# How PayCurrent Rebuilt Their Payment Gateway and Cut Latency by 62% ## Overview PayCurrent is a payment processing platform serving over 12,000 merchants across Southeast Asia. Founded in 2016, the company grew rapidly from a simple checkout widget into a full-stack payments infrastructure provider supporting recurring billing, risk scoring, and multi-currency settlement. By 2024, their legacy gateway—built on a Node.js monolith with a patched-together Redis cache—was struggling under a 2 million+ transaction daily load. Downtime incidents were becoming monthly occurrences. Customer support tickets related to failed transactions rose 18% quarter over quarter. Something had to change. Partnering with Webskyne, PayCurrent embarked on a six-month modernization effort to rebuild their payment gateway from the ground up. This case study documents the challenge, the architectural decisions, the phased migration approach, and the measurable outcomes. --- ## The Challenge: A Codebase at Its Limit By early 2024, PayCurrent’s engineering team was spending more time keeping the lights on than shipping features. The monolithic Node.js gateway had accreted eight years of patches, workarounds, and direct database access from multiple internal services. Three core symptoms made the problem impossible to ignore: 1. **Cascading failures:** A single slow query in the settlement module could backpressure the entire checkout flow. The team observed this monthly—usually coinciding with promotional spikes from large merchants. 2. **Opaque latency:** The only observability layer was a homegrown statsd wrapper dumping metrics into Grafana without correlation IDs. Tracing a failed transaction often required SSH-ing into production and grepping logs. 3. **Deployment risk:** Releases happened via a custom deployment script that updated the monolith in place. Rollbacks took minutes; during that window, the system was in an inconsistent state. The last quarter saw two incidents directly tied to half-finished deployments. The business stakes were equally high. Merchant contracts included strict SLAs around settlement timing and uptime. Repeated incidents triggered financial penalties and eroded trust with enterprise clients. Internally, developer burnout was real—senior engineers were leaving faster than they could be replaced. --- ## Goals The stakeholder alignment phase produced four concrete, measurable goals: - **Reduce p99 API latency from 890ms to under 350ms** under peak load. The existing median of 420ms was acceptable in isolation, but tail latency drove timeouts and retries, amplifying load. - **Achieve 99.99% monthly uptime** with a clear runbook for every failure mode. Payments are infrastructure; downtime is revenue loss and brand damage. - **Enable independent deploy cycles per domain**—checkout, risk, settlement—so that teams could ship without coordinating synchronized releases. - **Reduce infrastructure cost by at least 30%** by eliminating overprovisioned VMs, consolidating stateful services, and adopting spot instances for stateless workers. These goals were non-negotiable. PayCurrent’s board approved a six-month budget and two dedicated engineering pods from Webskyne alongside four internal engineers. --- ## Approach: Strangler Fig Over Big Bang After evaluating three migration strategies—big-bang rewrite, incremental strangler fig, and blue-green clone—the team chose the strangler fig pattern with some deliberate variations. The reasoning was practical: a big-bang rewrite would mean 18+ months without feature parity, which the business could not stomach. A simple proxy layer would ease the transition but add latency during the cutover. The chosen approach had these principles: 1. **Route by domain first.** Extract the risk-scoring service into its own deployable unit in month one, because it had the cleanest bounded context and the highest internal demand from other teams. 2. **Share nothing by default.** New services used isolated databases and async communication (Kafka topics). Shared state was treated as a design smell. 3. **Observability before functionality.** Each new service shipped with OpenTelemetry traces, structured logs, and SLO-defined alerts before it carried production traffic. 4. **Feature flags as the integration seam.** The legacy monolith and the new services coexisted behind feature flags managed by a lightweight LaunchDarkly-compatible configuration service. This approach meant the legacy system shrank every week while new services absorbed traffic. Business stakeholders saw continuous improvement rather than a blackout period. --- ## Implementation: Architecture and Key Decisions The implementation spanned six months and can be broken into three phases: foundation, extraction, and stabilization. ### Phase 1: Foundation (Weeks 1–4) The first four weeks established the backbone that every subsequent service would depend on. The team stood up an AWS EKS cluster with strict pod security standards, deployed an Istio service mesh for mutual TLS and traffic shifting, and configured a Kafka cluster for inter-service events. Every new service received a standardized scaffold: Fastify for the HTTP layer, Prisma for database access, and a shared observability SDK. One critical early decision was the adoption of **outbox pattern** for all write operations. Instead of updating state and emitting events in the same transaction, writes go to a local outbox table, and a background poller reliably publishes events to Kafka. This eliminated the dual-write problem and gave the team exactly-once delivery guarantees for settlement and risk events. ### Phase 2: Extraction (Weeks 5–14) With the foundation in place, teams extracted services in parallel. The risk-scoring module was the first: it had a clear input (transaction JSON with customer and merchant metadata) and output (risk score 0–100). Extracting it took three weeks and involved: - Mapping all data dependencies and extracting a dedicated PostgreSQL instance with point-in-time recovery. - Replacing in-monolith promise chains with a request-reply pattern backed by a timeout circuit breaker. - Replaying two weeks of production traffic to the new service with shadow reads, verifying that risk scores matched within acceptable variance. The settlement module followed. This was the hardest extraction because it touched nearly every other bounded context. The team used a **database-per-service** model with event streams for cross-service state changes. For example, when the checkout service confirmed a payment, it published a `payment.confirmed` event; the settlement service consumed it and wrote to its own ledger. The old settlement database was retired gradually by routing reads through a dual-write proxy during the transition. The checkout and notification modules were extracted in the final phase. Checkout required careful client-side coordination because the checkout.js SDK was cached on thousands of merchant storefronts. The team implemented a versioned API endpoint and merchant-specific feature flags so the SDK could be hot-swapped without breaking the user experience. ### Phase 3: Stabilization (Weeks 15–24) The final two months focused on hardening. Chaos engineering experiments with Gremlin tested the system’s resilience to instance failures, network partitions, and Kafka broker unavailability. Runbooks were written and tested in staging. A formal SLO review set error budgets and paging policies. The strangler fig was officially retired in week 20 when traffic to the monolith dropped below 2% of peak volume. The remaining legacy traffic was for internal admin tools, which were left untouched due to low ROI. The monolith was decommissioned in week 24 after a final data consistency audit. --- ## Results The modernization delivered on every stated goal, and in some cases exceeded them. Here is what PayCurrent achieved: **Latency Reduction:** The p99 API latency dropped from 890ms to 335ms—a 62% reduction. The mean latency fell from 420ms to 180ms. This was primarily driven by eliminating the synchronous monolith call chain. Where a checkout request previously traversed twelve internal function calls, it now hit a single service gateway and dispatched async events to risk and settlement. **Reliability:** Uptime reached 99.998% over the twelve months following launch. The team attributes this to three factors: circuit breakers preventing cascading failures, isolated blast radius from the microservice architecture, and the observability layer that surfaced issues before customers reported them. **Cost Efficiency:** Infrastructure spend decreased by 42%, beating the 30% target. Savings came from replacing overprovisioned virtual machines with autoscaling Kubernetes pods, using spot instances for non-critical batch jobs, and reducing data transfer costs through regional deployment. **Developer Velocity:** Deployment frequency increased from once per week to three times per day per team. Code review turnaround dropped from three days to under one day because PRs touched smaller, more understandable codebases. Internal surveys showed a 35% improvement in developer satisfaction scores. --- ## Key Metrics Below is a summary of the most important before-and-after metrics: - **p99 Latency:** 890ms → 335ms (-62%) - **Mean Latency:** 420ms → 180ms (-57%) - **Monthly Uptime:** 99.92% → 99.998% - **Infrastructure Cost:** -42% - **Deployment Frequency:** 1×/week → 3×/day per team - **Failed Transaction Rate:** 3.2% → 0.4% - **Median Alert Response Time:** 45 minutes → 8 minutes Merchant satisfaction scores, measured quarterly, rose from 3.6/5 to 4.7/5 over the year following launch. Enterprise renewals increased by 22% compared to the previous year. --- ## Lessons Learned The PayCurrent modernization offers several transferable lessons for engineering leaders considering a similar journey: **Strangler fig beats big bang.** Every time. The incremental approach preserved business continuity and kept engineers motivated with visible progress. Decommissioning the monolith felt like a victory rather than a relief. **Data boundaries define service boundaries.** The team initially tried to extract services by team structure, not data ownership. This created hidden coupling that slowed progress. Once they redrew boundaries around data lifecycles, extraction velocity doubled. **Observability is not optional.** The investment in OpenTelemetry and structured logging before writing business logic paid for itself in the first month. A single paging alert that caught a memory leak during a simulated traffic spike would have cost more than the entire observability budget. **Shadow traffic is your safety net.** Replaying production traffic to new services caught subtle logic mismatches that unit tests missed. The risk service, in particular, had edge cases around stale merchant configurations that only surfaced under real-world request patterns. **Finance should review cost assumptions monthly.** The surprising cost savings came from autoscaling and spot instances—not just right-sizing VMs. Keeping finance in the loop ensured the savings were visible and reinforced the business case for continued platform investment. --- ## Conclusion PayCurrent’s journey from brittle monolith to resilient, event-driven payment gateway demonstrates that modernization is not only possible but profitable. The technical wins—62% lower latency, near-perfect uptime, dramatically reduced infrastructure spend—were substantial. Equally important, the organizational transformation empowered engineering teams to ship with confidence. For engineering leaders navigating similar modernization challenges, the PayCurrent case study is a reminder that the path forward is incremental, observable, and always grounded in business outcomes. *This case study was produced by the Webskyne editorial team. For deeper technical guidance on payment system architecture, reach out to our engineering practice.*

How PayCurrent Rebuilt Their Payment Gateway and Cut Latency by 62%

Related Posts

How Webskyne Helped a Retail Chain Cut Checkout Abandonment by 34% Through UX Redesign

How a Fintech SaaS Startup Scaled API Infrastructure to 99.99% Uptime Under 10× Load Growth

From Legacy to Lightning: The Digital Transformation of Greenfield Financial