How SkyPay Cut Payment Latency by 62% Without Touching Their Checkout Flow

SkyPay was processing 2 million transactions per month across Southeast Asia when a 380ms p95 payment latency began silently eroding merchant revenue and support resources. The root causes were structural: database contention between fraud analytics and checkout queries, an incoherent Redis caching strategy, regional routing based on DNS ping rather than payment corridor latency, and message broker coupling that let batch reconciliation back up real-time payment streams. This case study details an eight-week strangler-fig migration that reduced p95 latency to 122ms — a 68% improvement — while maintaining 99.98% uptime through automated rollback triggers, dual-write settlement verification, and stratified canary traffic shifting. The team discovered that incremental migration beats big-bang replacement not by making fewer mistakes, but by designing the system so mistakes surface quickly, roll back cleanly, and turn into institutional learning. Three recurring pitfalls — deterministic canary sampling bias, hidden cache boundaries after decoupling, and rollback as a substitute for staging discipline — are examined with concrete mitigations that any platform team can adopt without a greenfield rewrite.

Overview

SkyPay operates a multi-tenant payment orchestration platform serving over 4,200 merchants across Southeast Asia. By early 2025, the platform had grown well beyond its original architecture: transaction volume had doubled in fourteen months, regional edge nodes were added without a unified caching strategy, and the primary PostgreSQL cluster was handling both transactional and analytical reads from the same hot replicas.

The business impact was measurable but diffuse. Merchants reported checkout timeouts that correlated with failed payments rather than success. Support logs showed a consistent "Payment took too long" category that had grown 40% quarter over quarter. Meanwhile, the engineering team was fighting fires without a clear migration path — any change to the payment core required a full maintenance window.

Challenge

The core challenge was not simply that the system was slow. It was that the slowness was structural, interleaved through four compounding layers:

Database contention: The primary read replica was being queried by both the fraud detection service (heavy analytical queries) and the checkout pipeline (low-latency row lookups). At peak hours, the replication lag spiked to 800ms.
Cache incoherence: Redis was used for session data and recent transaction lookups, but invalidation logic was inconsistent. Merchants saw stale authorization states, and the team had added aggressive TTLs as a workaround — which only made cache misses worse.
Regional routing: Edge nodes in Singapore and Jakarta were routing based on DNS latency, not payment corridor latency. A card-present transaction from Kuala Lumpur would sometimes hit the Jakarta edge even when Singapore had a 20ms faster path to the issuing bank.
Tight coupling: The checkout flow, fraud scoring, and settlement modules shared a single message broker cluster. A burst of batch reconciliation jobs could back up the payment.authorized event stream, introducing unpredictable delays for end users.

The team had explored targeted fixes — read replica splitting, cache TTL tuning — but each improvement delivered single-digit percentage gains. Structural change felt risky. A failed migration could block settlements or double-charge merchants. The business had explicitly told engineering that any downtime window had to be under four minutes, and only during low-traffic UTC windows.

Goals

Before any spike was drawn, the team converged on four measurable goals that would define success:

Reduce p95 payment latency from 380ms to below 150ms — measured at the edge node, covering the full authorize-settle lifecycle for card-present transactions.
Eliminate maintenance windows as a dependency for core changes — using blue-green deployments and feature flags to ship without planned downtime.
Decouple checkout from settlement reconciliation — so batch jobs never again block real-time user flows.
Retain 99.98% uptime SLA through the transition — with automated rollback triggers at failure-rate thresholds above 0.5%.

The goals were deliberately narrow. "Rewrite the platform" was not on the list. The constraint was to improve the system in place, migrating load and responsibility without rebuilding from scratch.

Approach

The chosen approach was a strangler-fig migration combined with progressive traffic shifting. Rather than performing a risky big-bang replacement, the team would incrementally route specific transactions through a new processing pipeline, validate correctness under production load, and grow traffic until the old pipeline drained to zero.

The migration was divided into three phases, each with its own canary group and acceptance criteria:

Phase 1 — Isolation: Spin up the new pipeline alongside the old one. Route a fixed 0.1% of traffic (tokenized, hash-based) to the new stack. Zero change to the user experience beyond a narrower latency distribution.

Phase 2 — Expansion: Ramp to 20% traffic by region. Introduce feature flags for regional rollout. Enable dual-write to both old and new settlement sinks, comparing outputs nightly.

Phase 3 — Cutover: Push to 100% traffic. Decommission old pipeline after a 72-hour burn-in with full observability still attached.

This phased approach addressed the team's primary fear — that a change could silently corrupt settlement data — by making correctness a continuously verified invariant rather than a hoped-for outcome.

Implementation

The implementation spanned five workstreams that ran in parallel for eight weeks. Each workstream had a dedicated engineering lead, a product owner, and a daily 15-minute sync with the overall migration program manager. Because the workstreams were independent, the team could progress rapidly without creating bottlenecks. Weekly architecture reviews ensured that individual decisions stayed aligned with the broader migration strategy.

1. Caching Layer Redesign

The team replaced the ad-hoc Redis invalidation logic with a write-through cache backed by a single source of truth: the payment authorization database. Every write to the authorizations table now emits a CDC (Change Data Capture) event via Debezium, which updates a dedicated Redis cluster with a structured key schema: tx:{merchantId}:{txnId}.

Read paths were split into two tiers. Hot data — transactions less than 30 minutes old — lives in Redis with a 10-minute sliding TTL. Historical lookups go directly to the read replica. This reduced cache size by 60% and eliminated the memory pressure that had triggered eviction storms during peak hours.

2. Regional Routing Overhaul

Instead of DNS-based routing, the team introduced a payment-corridor-aware load balancer. Each edge node maintains a 30-second rolling average of latency to each issuing bank corridor. The routing layer selects the edge with the lowest projected corridor latency, not the lowest network ping. Corridor latency is measured via synthetic test transactions that do not affect production financials.

A/B testing on the router showed an 18% reduction in inter-edge hop latency for Malaysia-originated transactions. More importantly, the router self-heals: if an edge node's corridor latency rises above threshold, traffic is drained automatically within three health-check cycles.

3. Message Broker Partitioning

The RabbitMQ cluster was partitioned by message priority and domain. A dedicated checkout queue with low prefetch and consumer scaling now handles real-time events. A separate reconciliation queue uses delayed exchanges and dead-letter routing for batch jobs. The two queues share a federation link for audit events but nothing else.

Consumer scaling was automated via KEDA (Kubernetes Event-Driven Autoscaling). During the Lunar New Year traffic spike, the checkout consumer pool scaled from 8 to 47 pods in under ninety seconds. The reconciliation consumers scaled on a fixed daily schedule.

4. Dual-Write Settlement Pipeline

The most sensitive change was settlement. The team built a parallel settlement sink in the new PostgreSQL logical replication cluster. For every payment.settled event, the system writes to both the old financial ledger and the new ledger, marking the record with a source_pipeline tag.

A nightly reconciliation job compares the two ledgers. Any discrepancy triggers a PagerDuty incident with a full diff. In the first three weeks, the diff rate was under 0.02% and exclusively clock-drift artifacts. This gave the team enough confidence to expand traffic ramps.

5. Observability and Rollback

New pipeline metrics were injected into the existing Datadog dashboard with a pipeline:new tag. Rollback was defined as a feature-flag switch: if the new pipeline's error rate exceeded 0.5% for more than two minutes, or its p95 latency exceeded 200ms for more than five minutes, the system automatically reverted to 100% old-pipeline traffic and raised an alert.

The rollback was tested weekly during off-peak hours. In one exercise, a misconfigured Redis connection pool caused a 4% error rate. The system rolled back in 47 seconds. The team's on-call engineer received the PagerDuty alert 12 seconds after the threshold breach.

Results

Seventy-two hours after hitting 100% traffic, the metrics were unambiguous:

p95 latency dropped from 380ms to 122ms — a 68% improvement.
p99 latency dropped from 950ms to 210ms — eliminating the tail-of-distribution timeouts that had been frustrating merchants.
Cart abandonment related to payment timeouts fell by 41% — based on anonymized merchant funnel data across the top 50 accounts.
Support tickets in the "Payment took too long" category dropped by 54%.
Database replication lag stabilized at 15ms or below across all regions, because the new pipeline's read profile was now 80% cache hits.
Zero double-charges, zero settlement mismatches — the dual-write verification held through the entire transition.

There were secondary gains that the team only realized in retrospect. Alert noise from database-related incidents dropped by roughly 35% because the new pipeline's separation of concerns eliminated many intermediate-failure auto-recovery loops. Team confidence rose: engineers who had previously avoided changes to the payment core began submitting improvements through the normal pull-request process.

Metrics Summary

The full dashboard snapshot is embedded below for reference. Key figures are reproduced in the table:

Metric	Before	After	Change
p95 Payment Latency	380ms	122ms	-68%
p99 Payment Latency	950ms	210ms	-78%
Payment Timeout Errors	3.2%	0.4%	-87%
DB Replication Lag	340ms avg	12ms avg	-96%
Settlement Mismatches	0.31%	0.02%	-93%

Lessons Learned

Not everything went smoothly. The team made mistakes that became equally important to the successes.

Lesson 1: Canary selection was too uniform at first

Initial traffic shifting sent 0.1% of transactions based on a hash of the merchant ID. Unfortunately, one merchant accounted for 22% of all payloads and had the most aggressive fraud rules. Their traffic dominated the canary metrics, making the new pipeline look worse than it was. The team wasted two days debugging latency spikes that were entirely caused by sampling bias. The lesson: sample traffic should be random or stratified by volume-class, not purely deterministic on a single field.

Lesson 2: Observed correctness does not guarantee eventual consistency

The dual-write reconciliation was catching discrepancies nightly, but the pipeline was still reading stale data from the old cache during the expansion phase. A small subset of merchants experienced duplicate pre-authorization holds for about two weeks — not because the settlement was wrong, but because the fraud engine was querying two slightly divergent states. The lesson: when decoupling systems, cache invalidation boundaries become as important as data boundaries.

A secondary insight from this mistake was the importance of monitoring not just the new system but the integration points between old and new. The team had assumed that because the dual-write reconciliation was clean, the entire system was consistent. In reality, the fraud engine was a third consumer that sat outside the canonical data flow and had its own caching layer. This pattern of hidden consumers shows up frequently in legacy systems and is worth auditing early in any migration.

Lesson 3: Rollback is not a strategy, it is a safety net

The automated rollback saved the team once, but they had originally designed it as the primary safety mechanism. After the first rollback (caused by a misconfigured pool size), the team added manual pre-flight checks and a 10-minute observation window before enabling any ramp beyond 5%. The lesson: rollback is there to catch the unexpected, not to substitute for careful staging.

Looking back, the team also wishes they had invested more in load testing before the first traffic exposure. Synthetic testing had shown the new pipeline handling double the expected peak load comfortably. However, synthetic workloads do not perfectly mimic production transaction shapes, and the team spent the first week of canary exposure tuning connection pools that had been sized for ideal distributions rather than chaotic real traffic. Future migrations now include a week of shadow traffic before any real traffic is introduced.

Conclusion

The SkyPay migration proves that a well-managed strangler-fig approach can deliver dramatic performance improvements without sacrificing reliability. The path was not a rewrite — it was a disciplined, incremental transfer of responsibility from an aging monolith to a modern, decoupled pipeline.

The same pattern is now being applied to the platform's reconciliation engine and merchant reporting API. Each migration reuses the playbook: stratify canaries, verify correctness continuously, automate rollback, and measure every step against the baseline.

For teams facing similar legacy-performance debt, the message is simple. You do not need a greenfield project to achieve order-of-magnitude gains. You need a clear migration boundary, a reliable canary strategy, and the patience to let the traffic shift reveal the truth — rather than the hope that a big-bang cutover will somehow work.