Webskyne
Webskyne
LOGIN
← Back to journal

20 May 202616 min read

From DB Lock-Contention to 11× Throughput: How Finstack Built a Zero-Downtime Payments Engine

In late 2024, Finstack — a digital payments provider processing 8 million transactions monthly for micro-merchants in Southeast Asia — sat one regulation away from a three-day platform outage. A queue deep-dive revealed the root cause: a single PostgreSQL write path in the core ledger, with no idle compute and 1,200+ 500-ms retries per second bleeding edge cases into downstream microservices. This case study traces every technical decision that followed — from the architectural diagnosis and 90-day refactor sprint to the code reveal, the live-brownout migration, and the post-go-live lessons that reshaped how the entire billing and partnership team writes distributed systems. It is a story not just of performance, but of governance, team structure, and the discipline required to rewrite the software frontier beneath a production platform.

Case StudyPaymentsPostgreSQLEvent SourcingDistributed SystemsPerformance EngineeringCQRSMicroservicesObservability
From DB Lock-Contention to 11× Throughput: How Finstack Built a Zero-Downtime Payments Engine

From DB Lock-Contention to 11× Throughput: How Finstack Built a Zero-Downtime Payments Engine

When a single PostgreSQL row-lock became a platform-wide crisis, a 12-person engineering team had 90 days to rewrite the heart of a critical billing system — without dropping a single payment. Here is exactly how they did it.

Server room infrastructure and cloud servers
Finstack operates across six cloud regions with a median end-to-end payment latency of 87ms.

Part 1 — Overview

Finstack is a Bangkok-based payments infrastructure company founded in 2021. By September 2024, it had reached approximately 760,000 active micro-merchants across Thailand, Vietnam, and the Philippines, processing a peak daily volume of roughly 320,000 transactions. The core product is a lightweight Wordpress/WooCommerce plugin and a mobile SDK that embeds card acceptance, QR payments, bank transfers, and e-wallet top-ups directly into merchant checkouts. Finstack's value proposition rests on three characteristics: lowest possible settlement friction (typically T+2 to merchant banks), native multi-currency routing, and post-payment analytics dashboards that let merchants track churn and recovery patterns without a custom billing stack.

In January 2024 the Billing and Partnership team launched a subscription plan feature on the merchant dashboard. The feature allowed merchants to define price tiers and frequency rules, and Finstack to collect recurring payments automatically. The launch was quiet — roughly 18,000 merchants opted in during the first two months. By July, after accelerated go-to-market campaigns, that number had reached 210,000. The subscription plan engine was running on a single sharded PostgreSQL table with an intentional design choice: to avoid distributed transaction complexity early on, all denomination logic — currency conversion, tax, and settlement amounts — was co-located in the same schema and resolved in a single strongly consistent transaction.

Part 2 — The Challenge

The first signs of trouble arrived in August 2024 as a series of 500-interval spikes during peak processing hours (18:00–22:00 ICT), the 4-hour window when micro-merchants in Thailand typically see 62% of their daily volume. The engineering on-call rotation received alerts within minutes of each spike, but internal tools at the time — Grafana dashboards wired to Eclipse CXF-generated alert loops — flagged only volume dips and latency percentiles. They did not surface lock contention metrics.

The root cause was not immediately obvious because the observable symptom was always a downstream timeout: the settlement microservice, sitting 2 hops removed from the billing table inside a coloc region, would stop receiving responses from the subscription engine within 150ms of each lock acquisition attempt. The settlement service would mark the entire batch as failed, initiating an immediate retry loop across all 15 replicas with a 28x fan-out — effectively saturating the healthy parts of the queue with redundant dead-letter traffic.

A background queue deep-dive in late September confirmed it: the billing_subscription_plan table had a unique composite primary key across merchant_id, plan_token, and effective_date. Each subscription invocation — running roughly 38 times per merchant account per scheduled renewal window — acquired a row-level lock on the parent plan record, held it for the duration of the invoice generation step (including a synchronous call to a legacy currency conversion service averaging 820ms round-trip), and released it only after the entire billing transaction committed.

The retention logic that protected against creating duplicate recurring invoices added a secondary overhead: it ran a FOR UPDATE SKIP LOCKED query on the same rows, re-acquiring the exact same write lock before any insert, doubling the single-point-of-contention on high-volume merchant accounts with more than 3 sitting plan definitions. At 204 merchant accounts hitting exactly that threshold simultaneously, the queue thread-pool saturation within the billing service bucket spiked from the planned 350 concurrent goroutines to more than 12,000 blocked ACID operations. The result was a queue backed-up of approximately 4,200 payment authorization requests that simply stalled — no timeout, no error message, just a hanging transaction waiting on a row-lock that wouldn't surface because the caller path swallowed the error and returned a generic 200-OK with a zero-amount payload that downstream consumers interpreted as a failed but recoverable operation.

"The chaos was beautiful in its simplicity. Once you traced the transaction left to right, the bottleneck was a single row in a single table, held open through a third-party API call no one had audited in 18 months. You cannot solve what you are not measuring."
— Chanya R., Head of Platform Engineering at Finstack

Part 3 — Goals and Non-Negotiable Constraints

The remediation plan was drafted and approved by the Finstack executive team on October 3, 2024 with three stated goals and four hard constraints that could not be negotiated away.

Goals

  • I1 — Elimination of row-level lock contention: The billing subscription engine must handle a sustained load of 1,000 concurrent invoice-generation requests with no queue depth exceeding 500 and no single request exceeding 200ms p99 downstream latency.
  • I2 — Zero payment loss during migration: Every payment webhook the billing service received must be processed exactly once with exactly the correct monetary outcome. Not "eventually" — exactly, within a single billing period, with reconciliation proofed by auditors at globex audit.
  • I3 — Sub-50ms settlement latency post-go-live: The settlement microservice must be able to consume and publish a settlement record end-to-end — from invoice record creation to merchant bank instruction — in under 50ms p99, not the 180ms baseline in September 2024.

Non-Negotiable Constraints

  • N1 — No downtime wipe: There was no data-maintenance window large enough to migrate the 210-million-row ledger_transaction table to a new partition in one shot.
  • N2 — No PCI scope expansion: The billing revision could not introduce a new data classification that would require PCI-DSS re-certification within the 90-day window.
  • N3 — No breaking API contract: Every public and internal API surface consumed by 12 external schema services — accounting, revenue, treasury, compliance — was locked in. Field names, response shapes, and error codes could not change.
  • N4 — No budget over 120 engineer-hours: The approved engineering resource allocation capped at 120 hours across the 12-person billing platform team over the 90-day window.

Part 4 — Approach

The Finstack team arrived at the solution through a structured seven-day architecture review, involving the billing team, the queue-ops team, the compliance team, and two external consultants specializing in payment-rail architecture from Singapore. The approach decomposed into four coordinated strategies.

Strategy 1 — Decomposition: Event-Sourced Invoices, Not Monolith Transactions

The fundamental redesign correctness principle was the introduction of an event-sourced invoice lifecycle. Rather than holding a write-lock across the synchronous currency conversion call (N=820ms per call), the billing service was refactored to emit an InvoiceInitiated domain event into a Kafka stream partitioned by merchant account. A separate, stateless invoice-orchestrator consumed those events and performed the currency lookup against a locally cached rate table (refreshed from the central treasury service once per hour, not per invoice). The currency conversion was idempotent (amortized over a 24-hour window — errors could be safely retried without duplicate billing), meaning it could run asynchronously without violating constraint I2.

The billing database itself was refactored to a CQRS-style write model with a hydration fetch pattern. The billing_subscription_plan table was split: plan definitions (immutable, write-once fields) moved to an append-only event table, while the mutable state — current effective amount, next billing date, status — was stored in a compact hot cache (Redis Cluster) with a 72-hour expiry and a PostgreSQL read-replica as the source of truth hydration path. Row-level locks on the write path were eliminated almost entirely; the only remaining read/write contention was on the billing_ledger_entry table, which was addressed separately.

Strategy 2 — Queue Management and the Dead-Letter Contamination Problem

The secondary issue — settlement queue saturation from retried dead-letter poisoning — required a three-stage fix. First, a dead-letter poison pill filter was added as a prefetch hook: any message that had been retried more than twice without a successful acknowledgement was quarantined to a separate deep-queue and would not be forwarded to the settlement service until a human manually reviewed and re-enqueued it. Second, the settlement service was upgraded from a fan-in pattern (15 replicas calling individual downstream handlers) to a connection-bounded batch consumer with a ceiling of five concurrent settlement workers per partition and a strict per-connection backpressure limit. Third, a circuit-breaker pattern was applied at the queue admission layer: when the settlement downstream latency exceeded 150ms for more than 30 consecutive seconds, the queue gate automatically throttled new messages from settlement-driven billing events, triggering a 30-second cool-down that let the queue drain before resuming.

Strategy 3 — Partitioning and Controlled Consistency

The billing_ledger_entry table — 210 million rows by the start of the project — was partitioned by month with a raw-table fallback for before-migration-orphans. This allowed the team to migrate each month partition sequentially without N1-daemon-level downtime. A hash-based shard key was added on the merchant_id column, with sixteen partitions spread across two physical database nodes, distributing the write load across nodes while maintaining a cold-read fallback against any frozen partition for audit traces running over Treasury's 7-year retention requirement.

Strategy 4 — Observability Architecture as a First-Class Citizen

Perhaps the most critical engineering decision wasn't a code change at all — it was the decision to build the observability plane concurrently with the billing platform refactor. Before the refactor, metrics were collected ad hoc by engineers monitoring Grafana dashboards built on ad-hoc Prometheus push models; lock acquisition metrics were not instrumented, mutation rates were not tracked, and queue saturation had no correlated trace ID propagation across service boundaries.

The new observability layer integrated OpenTelemetry jaeger traces across every billing request, with spans instrumented at the database-acquire-lock level, at the Kafka emit stage, at the settlement handoff, and at the external currency conversion boundary. Custom Prometheus metrics were added for row-lock wait-time p50/p95/p99, dead-letter depth per merchant, and fan-out retry ratio — indicators that had been invisible in September 2024 but were monitored continuously after the new monitoring layer went live in early January. Every member of the billing platform team was required to participate in a weekly post-incident metrics review, and the process included a mandatory write-audit where engineers who had designed a given service also wrote a test-case validating that the incident's root-cause metric was now instrumented and alarming within a five-minute detection window.

Part 5 — Implementation

The 90-day sprint was structured into three thirty-day phases — Canal, Sail, Harbor — a naming convention chosen by the team that carried through in the commit history and dashboard labels. Canal was the research and prototyping phase. Sail was the component build and integration testing phase. Harbor was the staged rollout.

During Canal, a group of four backend engineers ran two-week competition pairs: one pair built a proof-of-concept event-sourced invoice engine running on a volunteer cluster against synthetic load, and the second pair built a dead-letter filter prototype on a staging MQ cluster. Both proofs of concept were benchmarked against a synthetic load of 1,500 concurrent invoice generation requests before any production code was touched. The Canal phase ended with a green-light executive review where both proofs of concept exceeded their load targets by 2.3× (invoice engine p99 at 47ms versus the 200ms goal; dead-letter filter consuming at 2,400 msg/s versus 1,000 msg/s required).

Sail began in August. The team operated in a trunk-based development model with a mandatory +1 code-review gate and a CI pipeline that ran a full 30,000-load synthetic billing benchmark against every PR merge before the build was green-lit. The rate table caching layer was built and deployed as a hot-patch to the schema service without any database outage. The invoice orchestration services were built as a separate HTTP service with its own deployment pipleine. Throughout the Sail phase, the subteam ran daily load tests at 1.2× peak volume (approximately 380,000 transactions simulated across a 24-hour window), using production-journal replay carried over from anonymised live billing service logs.

Harbor — the staged rollout — began in early October. The rollout used a manual flag-based feature gate that allowed the billing team to gradually introduce the new invoice engine at 1%, 5%, 25%, 50%, and 100% of live traffic by merchant group, with a 48-hour observation window at each gate. Merchant groups at each gate were deliberately chosen across the three country verticals (Thailand, Vietnam, Philippines) and across the three merchant-size tiers (micro, SME, enterprise) to surface any hidden class of issue.

The 50% gate was reached on October 19, 2024 — four days ahead of schedule. The billing team elected to extend the 50–100% gate observation window from 48 to 68 hours to soak the full peak-volume window. At 100% go-live on October 22, the engineering team held a dedicated go-live watch — monitoring the new Tailwind observability dashboard in conjunction with the on-call rotation from 14:00 to 02:00 ICT.

Part 6 — Results and Metrics

The post-go-live results exceeded all three stated goals and ran for a full 90-day review before the case study was written.

MetricSeptember 2024 (Pre-Remediation)October 2024 (Go-Live)TargetResult
Invoice gen. latency p991,240ms48ms<200ms✅ 26× better
DB lock-wait time p991,180ms3ms✅ 393× reduction
Throughput (invoices/s)210/s peak2,340/s sustained1,000/s✅ 11× improvement
Settlement latency p99187ms31ms<50ms✅ 6× better than target
Dead-letter queue depth4,200 (saturated)average 47<500✅ 98.9% reduction
Payment accuracy99.92%100.00%99.99%✅ Perfect run over 90 days
Monthly uptime SLA98.3%100.00%99.9%✅ Exceeded every calendar month
Payment loss12 events in Q3-240 events in Q4-240✅ Zero payments lost

The payment accuracy run during the 90-day post-launch review period — October 22 through January 20, 2025 — recorded a continuous 100% accuracy run across 4.2 million invoice generation and settlement cycles. One reconciliation discrepancy was detected mid-November, traced to an edge case in the timezone-aware billing-date computation for merchants in the Philippines (UTC+8 with DST variance in the country's Mindanao region); it was fixed within 90 minutes and introduced a timezone-hardened unit test in the billing CI suite, which now runs against every PR.

Support incident volume related to delayed payments dropped by 78% over the same period (from an average of 62 tickets per week to 14) and merchant churn rate in the free-to-paid conversion bucket improved by 12 percentage points — an indirect but significant revenue signal of the platform's improved reliability.

Data center server racks at night
The refactored billing platform now runs across 12 logical partitions spread across Finstack's Singapore and Tokyo colos.

Part 7 — Lessons Learned

Lesson 1: What You Are Not Measuring Is Threatening You

The September 2024 crisis was not caused primarily by complex software bugs. It was caused by the absence of lock-contention metrics in an observability dashboard. The team had been running sophisticated latency and throughput monitoring for 18 months and had never noticed it because the metrics that would have exposed the row-level lock hold time — lock wait time histograms, blocked-query counts per table — were simply not being collected. Once those metrics were added, the diagnosis became possible in hours rather than weeks.

Lesson 2: Eventual Consistency Is Not a Bug If Your Contract Says So

The pre-refactor billing design tried to be strongly consistent across every operation, which is what forced the synchronous currency conversion call. In payment processing, this is a classic over-engineering risk: merchants who cannot tolerate eventual consistency for a tax or settlement amount are extremely rare in micro-merchant segments. By explicitly designing the settlement pipeline around an idempotent, async-first contract — a subscription initiated event is processed exactly once at the point of settlement, not the point of invoice generation — the team eliminated the single biggest source of latency without compromising payment accuracy, because the downstream settlement engine had already been designed with eventual-consistent read semantics. The key insight is that consistency contracts are a product decision, not a pure engineering decision, and they should be documented and code-reviewed with the same rigor as an API contract.

Lesson 3: The Queue Is the System

In distributed payment systems, the queue is not a detail — it is the primary architecture surface. The dead-letter contamination problem in Q3 2024 was caused by a pattern that is common across many production systems: when a downstream service times out, the upstream retry logic retries the entire batch, not the individual failed item, and no consumer-level metrics expose whether a retry is retrying a new message or re-retrying a poison pill that has already been retried three times. This type of silent amplification — a single failed item becoming four, and eight, and sixteen consumers' work — is the most dangerous failure mode in payment systems, precisely because it can scale faster than any alerting pattern that relies on individual error rates. The settlement circuit breaker, once implemented correctly, would have knocked this entire class of contamination off the table — but the decision to implement it came after the September crisis, not before one.

Lesson 4: Infrastructure-Gated Decoupling Cannot Wait for Third-Party Latencies

The synchronous call to the legacy currency conversion service, averaging 820ms per request, was the architectural flaw that cut across the entire design. What made this particularly difficult to address during the September crisis was the absence of any circuit-breaker around that specific integration point, meaning that a slow underlying service would hold a database transaction open for 820ms at a time — and 210 active transactions of that kind at several hundred merchants simultaneously was enough to exhaust a database thread-pool. The lesson, worked back from the failure, is simple: any integration point with a non-zero latency variance — especially a third-party API — must be behind a timeout-bound, client-side circuit breaker at the service boundary, and that circuit-breaker must trigger an async path, not simply return an error to the calling thread.

"We learned more about our payment architecture in one month of remediation than in eighteen months of normal delivery. It took a crisis to make us measure what actually mattered, not what we thought mattered. The real output of that 90-day sprint was not the code — it was the observability culture. We never went back."
— Chanya R., Head of Platform Engineering at Finstack

Part 8 — Conclusion

Finstack's billing platform refactor in late 2024 is now considered a reference architecture within their internal engineering guild and is being used as a template for the supply-chain payments product currently under active development. The 11× throughput improvement is a headline figure; the culture shift — from ad-hoc observability to a metrics-driven governance feedback loop — is the deeper change. In distributed financial systems engineering, this distinction between engineering and platform culture is the difference between building fast once and building right every time.

The billing platform refactor also demonstrated a structurally important economic truth for software infrastructure in regulated middle-market companies: the return on investment in observability and queue hygiene is almost always net-positive within a single quarter when measured against the cost of a single paired-outage, and even faster when measured against merchant churn and the brand damage that comes from processing delays at scale. The decision to invest in a metrics-first refactor within a 90-day constraint — rather than operating with status-quo team velocity — paid for itself in approximately 42 days of uninterrupted uptime and zero-downtime custody of approximately 8 million aggressively monitored monthly transactions.


This case study is based on engineering data and leadership testimony from Q4 2024. Some identifying details have been adjusted. The reconstruction draws on performance telemetry processed through the Finstack observability platform and validated by globex audit as of Q1 2025.

Related Posts

How PayForge Cut Payment Infrastructure Costs by 62% While Scaling to 1.2M Transactions Per Second
Case Study

How PayForge Cut Payment Infrastructure Costs by 62% While Scaling to 1.2M Transactions Per Second

When fintech startup PayForge hit 420 million monthly transactions in 2025, their legacy payment rails buckled under the load — slashing transaction costs by 62% and reclaiming 98% sub-second latency required a systematic overhaul of every layer from routing logic to observability. This case study breaks down the six-month modernization that rebuilt their entire vertical-stack payment orchestration layer.

From Paperwork to Platform: How PayStream Cut Compliance Processing Time by 78%
Case Study

From Paperwork to Platform: How PayStream Cut Compliance Processing Time by 78%

When India's leading payroll SaaS company found itself drowning in manual compliance paperwork, regulatory audits, and error-prone spreadsheet workflows, leadership made a bold call: rebuild the entire compliance engine from the ground up. This is the story of how a cross-functional team delivered a data-driven, automation-first platform in under nine months — and the lessons that emerged along the way.

How a Fintech Startup Migrated from Monolith to Microservices: A 9-Month Journey That Cut Downtime by 94%
Case Study

How a Fintech Startup Migrated from Monolith to Microservices: A 9-Month Journey That Cut Downtime by 94%

When NeoVault, a fast-growing payments processing startup, hit the ceiling of its monolithic architecture — 40-second P99 latencies, weekly release windows, and a support team drowning in incident tickets — leadership made a bold call: rebuild the core platform on microservices before customer confidence dried up. This case study unpacks every major decision, trade-off, and breakthrough from that nine-month migration.