Webskyne
Webskyne
LOGIN
← Back to journal

22 May 202622 min read

How PayNest Built a Sub-Millisecond Payment Engine to Process $4B in Annual Transactions

When PayNest, a fast-growing Indian fintech startup processing 200,000 daily transactions, faced a 5% failure rate during UPI spike windows and a looming PCI DSS compliance deadline, they had just three months to rebuild their payment processing core before a mandatory audit. Against merchant churn risk and a reconciliation engine that collapsed mid-run every night, the engineering team chose a disciplined strangler-fig route over a greenfield rewrite — introducing event-driven domain boundaries, idempotency enforcement, and observability before the first new service shipped. This case study covers the nine-month journey: from PCI scope isolation and DynamoDB-based idempotency enforcement, through the four-stage event-driven reconciliation engine that slashed nightly batch duration from 18 hours to 42 minutes, to the staged traffic migration that caught a floating-point settlement discrepancy before it ever reached production customers. The result: a fintech backbone designed to handle 10× projected transaction volume at 37% lower monthly cost, with idempotency as the single most consequential architectural decision made.

Case StudyFintechAWSMicroservicesPayment-ProcessingPCI-ComplianceEvent-Driven-ArchitectureSystem-DesignCloud-Migration
How PayNest Built a Sub-Millisecond Payment Engine to Process $4B in Annual Transactions

Overview

In mid-2024, PayNest — a Bangalore-based payments infrastructure startup powering digital wallets, UPI collect requests, merchant settlement, and peer-to-peer transfers — reached a breaking point. Their transactional volume had grown 3× year-over-year, pushed by aggressive merchant onboarding and a successful Series B funding round. At 200,000 transactions per day, their aging Python monolith was delivering a 5% failure rate during peak windows, payment reconciliation took 18 hours post-cutoff, and their PCI DSS compliance deadline loomed in eight weeks with no clear remediation path visible through the codebase.

Our engagement with PayNest spanned nine months. We designed and delivered a complete replacement of the payment processing core — from its monolithic PostgreSQL schema to a fully event-driven microservices architecture running on AWS — while keeping zero downtime for active transactions. The result: 99.98% uptime in production, sub-millisecond average payment completion latency, reconciliation that now finishes in 42 minutes instead of 18 hours, and a 37% reduction in monthly infrastructure cost. The architecture was designed to handle projected 2027 transaction volumes of 2M transactions per day, a 10× growth target, with headroom to spare.

This case study covers the full journey: the technical debt that made legacy operations untenable, the architectural choices that turned ambiguity into runway, the implementation phases of the strangler-fig migration, the monitoring and reliability investments that caught critical issues pre-production, the metrics that measured real impact, and the lessons — some painful — that shaped every subsequent design decision.

Challenge

The Fragile Python Monolith

The existing platform was a 320,000-line Python 2.7 application running on Gunicorn + Nginx over a single 4TB Amazon RDS PostgreSQL instance. It had grown organically since 2019 without a defined service boundary — what started as a wallet ledger had absorbed merchant settlement, UPI routing, fraud scoring, notifications, and administrator dashboards into a single codebase sharing one database. The symptoms were extensive and compounding:

  • Transaction failure rate: Sustained 5–8% during peak hours (6 PM–10 PM), driven by database lock contention and synchronous request flooding from downstream UPI callback endpoints.
  • Reconciliation lag: Merchant settlement runs launched at midnight processed ~20 million rows in under six hours before crashing, requiring a manual restart — extending daily reconciliation to 18+ hours against a regulatory SLA requiring settlement confirmation within 6 hours.
  • PCI scope bleed: Cardholder data was stored, processed, and served by the same processes that handled public APIs. A single layer-7 penetration test would have required the entire monolith to undergo audit remediation — a multi-week effort the small compliance team couldn't absorb.
  • No replay or audit: Mutating operations wrote directly to PostgreSQL without any event log. If reconciliation failed mid-run, engineers had no deterministic way to reconstruct what had or hadn't been applied.
  • Single-threaded async: Celery was used for background jobs, but task retries were configured with exponential backoff without a dead-letter queue, meaning failed notifications silently disappeared until merchants complained.

Business and Regulatory Impact

The technical debt was not merely an engineering preference problem — it was directly impacting regulatory obligations and business credibility:

  • PCI compliance deadline: The Payment Card Industry Data Security Standard required all cardholder environment audits to be completed by end of Q3 2024. Under the existing architecture, the audit would have required reviewing 320K lines of Python code plus shared database access controls — estimated at six full-time engineer-weeks, against a three-month deadline.
  • Merchant churn signal: Three of PayNest's top 20 merchants (representing 23% of transaction volume) were in renewal talks citing reconciliation delays and and delayed project settlement statements as reasons for competitive evaluation.
  • Incident velocity: The platform experienced 2–4 production incidents per week, primarily related to database connection pool exhaustion during peak. On-call engineers spent 40% of their shifts responding to alerts rather than advancing product roadmap items.
  • Failed vertical scaling: The previous quarter's attempt to handle growth involved upgrading the primary RDS instance from db.r5.4xlarge to db.r5.12xlarge — a 200% cost increase that reduced the failure rate for only two weeks before contention again drove it back to 5% under peak load.

What Previous Attempts Had Missed

Before our engagement began, PayNest engineering had spent 120 engineer-days on two incremental initiatives:

  1. Read replica offloading: Adding two RDS read replicas to offload reporting queries. The 22% performance improvement lasted two weeks; as analytical workloads grew, replica lag spiked to 15 seconds, causing stale balances to appear on customer dashboards during high-activity periods.
  2. Celery queue partitioning: Splitting the notification celery queue into priority tiers. This reduced notification delivery failures marginally but introduced task starvation on the priority queue, causing urgent settlement confirmations to lag behind promotional in-app messages.

Both approaches attacked symptoms with surgical patches rather than re-examining the core design assumptions. The monolith had grown beyond the point where incremental surgery could realistically produce lasting improvement without the effort approaching a rewrite anyway.

Goals

TechnicalGoals

We agreed on four non-negotiable technical objectives alongside PayNest's leadership and compliance team:

  1. Achieve PCI DSS compliance within scope-reduced perimeter: Isolate all cardholder data handling into a single, independently auditable subset of services — reducing the audit surface from the entire monolith to a single service boundary.
  2. Sustain 10× transaction volume with sub-500ms p95 completion latency: The 2027 target of 2M daily transactions required the architecture to handle 23–25 TPS sustained with peaks up to 200 TPS during payment-link sharing campaigns, while completing 95% of transactions in under 500ms regardless of load.
  3. Reconciliation SLA compliance within 6 hours: The settlement engine must be able to fully reconcile, validate, and confirm all end-of-day merchant and platform accounts within 4 hours of midnight cutoff — down from an 18-hour SLA gap under the monolith.
  4. Blast-radius isolation between transaction domains: A failure in the notification service must never cascade into payment execution. A surge in UPI collect requests must never delay card billing reads. Service-level isolation must be architectural, not merely aspirational.

Business and Risk Goals

Technical targets were expressed in business language as the following:

  • Negate merchant churn risk: Demonstrate reconciliation within 6 hours to active merchant renewal candidates within 60 days of launch.
  • Enable incremental, non-breaking feature launches: Product teams should be able to ship payment flow improvements without coordinating a platform-wide deployment window — previously a blocking constraint every quarter.
  • Infrastructure cost ceiling of $58,000/month: Given projected transaction growth, the monthly AWS bill must not exceed this threshold at projected 2027 volumes, keeping gross margin per transaction above 12%.
  • Downtime exposure below 43 minutes/year: This corresponds to a 99.93% uptime SLA — a meaningful improvement from a 99.2% baseline, sufficient to meet the reliability commitments in PayNest's enterprise merchant contracts.

Explicitly Out of Scope

To keep the effort tractable, the following were explicitly scoped out:

  • No AI/ML fraud engine change: The incumbent fraud detection model and its Sqlite-based feature store were treated as a read-side extension; it would be migrated as a compatibility layer, not rebuilt.
  • No customer-facing UI changes: Wallet balance display, merchant dashboard, and payment flow frontends were unchanged throughout the migration — frontend teams were engaged only for instrumentation and health-check additions.
  • No data center transition: All work ran within the existing AWS account environment; physical infrastructure migration was outside scope.

Approach

Architectural Choice: Event-Driven Microservices with Polyglot Persistence

We rejected three alternatives early: a fresh greenfield rewrite (estimated 24+ months), a service-mesh-first approach (infrastructure-lead, premature operational complexity), and a versioned modular monolith (would not satisfy PCI scope reduction). We settled on event-driven microservices with command-query responsibility segregation (CQRS), using Amazon EventBridge as the central nervous system and a polyglot persistence layer where each service chooses storage appropriate to its workload.

The event-driven backbone replaced the synchronous request-chain that had made database contention inescapable. Rather than checkout calling wallet which calls ledger which calls notify — each step blocking until the previous completed — each operation emitted an event and returned. Downstream consumers processed events at their own rate, writing to their own databases. The wallet service didn't need to know about ledger; it needed only to publish a FundsDebited event and wait for confirmation that the event was durable in the queue.

Server room with network cables and modern infrastructure

Technology Stack

LayerTechnologyRationale
API GatewayAmazon API Gateway (HTTP APIs)FinOps-controlled low-latency routing; built-in JWT validation at the edge
Compute (stateless)AWS Lambda (TypeScript)Sub-millisecond cold starts with provisioned concurrency; pay-per-use aligns with bursty transaction traffic
Compute (stateful/long)Amazon ECS on FargateReconciliation engine and batch settlement jobs with guaranteed memory and predictable execution
Write storeAurora PostgreSQL (writer + 2 readers)PostgreSQL for transactional consistency in the payments and settlement domains; row-level security for PCI scope boundaries
Read storeDynamoDB (on-demand)Single-millisecond reads for wallet balance lookups, transaction history queries, and idempotency lookups
Event busAmazon EventBridge + SQS FIFOGuaranteed at-least-once delivery with exactly-once processing on the writes side; FIFO queues preserve ordering for all payment events
ObservabilityDatadog (APM + Logs + RUM), AWS X-RayDistributed tracing across Lambda and Fargate; business-level SLA tracking on the same dashboard as infrastructure metrics
Infrastructure as CodeAWS CDK (TypeScript)Versioned, peer-reviewed infrastructure; reproducible environments across dev/staging/prod with drift detection
CI/CDGitHub Actions + AWS CodeDeploy (Lambda)Automated canary releases; automated rollback on p95 latency or error-rate threshold breach

Migration Pattern: Strangler Fig with Event-First Decomposition

Rather than a cutover migration that required a maintenance window before going live, we used the strangler fig pattern — routing incoming API requests via a Kong API gateway that inspected the URL path and forwarded either to a new service or the legacy monolith. This let us migrate service by service, validate in production with real traffic, and roll back individual services without affecting others.

Each new service consumed legacy table data via Change Data Capture (CDC) from the existing PostgreSQL instance — using Debezium to stream inserts, updates, and deletes to a Kafka topic, then materializing into each service's local datastore. This meant the monolith continued to be the authoritative writer while new services gradually took write responsibilities over their own domains.

Implementation

Phase 1: Foundation and PCI Perimeter Establishment (Weeks 1–4)

The first priority was physical and logical PCI scope reduction. As a regulated entity, PayNest's PCI Self-Assessment Questionnaire (SAQ) scope was determined by which systems touched cardholder data. Under the monolith, every service in the environment potentially participated in that scope.

Step 1 — VPC segmentation and network ACLs: We stood up three distinct VPC subnets: one for public-facing API services, one for internal service-to-service communication, and one isolated subnet for cardholder data processing with no internet egress. AWS Network Firewall rules enforced bi-directional inspection between the PCI subnet and outer environments, and the monolith ran in a non-PCI subnet receiving only anonymized callback events rather than direct card data.

Step 2 — Row-level security on the payments schema: We applied PostgreSQL Row-Level Security (RLS) policies so that even processes running within the same database could not SELECT card_number columns unless a specific role was explicitly assumed. This reduced cardholder exposure audit surface by approximately 80% within days of implementation.

Step 3 — Observability before services: Before a single new service was written, we instrumented the monolith with Datadog APM traces, business-metric DogStatsD counters, and structured JSON log emission with correlation IDs. This gave us a baseline: we knew exactly what p95 latency looked like before migration so we could credibly claim improvement after.

Phase 2: Idempotency Layer and Transaction Core (Weeks 5–10)

The payment engine's foundational problem was idempotency. UPI collect requests arrive more than once; card issuing banks retry; customer apps retry on network timeout. Without a deterministic deduplication mechanism, every retry risked a double-charge — the highest-priority incident type.

Idempotency key enforcement: Every incoming payment request required an idempotency key (Idempotency-Key header) with a 72-hour retention window. We used a DynamoDB table keyed on (idempotency_key, customer_id) with conditional writes — if the key already existed, the request returned the stored 200 and response body without reprocessing. This table was queried by the API Gateway Lambda authorizer before the handler function executed, eliminating duplicate processing before the business logic ran.

Transaction core — write-side design: The write-side (command side) was a pair of Lambda functions — DebitWalletCommand and CreditWalletCommand — that each ran as wrapped DynamoDB transactions (TransactWriteItems), enforcing that wallet balance decrements and transaction log entries were written atomically in a single 8ms operation. The wallet read-model was projected into DynamoDB from the EventBridge event stream, giving p95 wallet balance lookups in 2ms.

Dead-letter queue for failed deductions: If a debit command failed, the raw event entered an SQS FIFO dead-letter queue (payment-dlq) that triggered an alert on PagerDuty rather than silently disappearing. Visibility timeouts and receive-count tracking ensured we could distinguish transient failures from genuine data issues.

Phase 3: Event-Driven Reconciliation Engine (Weeks 11–18)

Reconciliation was the monolith's slowest and most failure-prone component. It scanned 20 million rows nightly, compared them against bank statement payloads received by SFTP, applied manual adjustments, and generated settlement reports — all in a single Python process that typically crashed 3–4 hours in and required a manual restart.

We rebuilt reconciliation as a four-stage event pipeline:

  1. Ingestion stage (Lambda): SFTP drops from partner banks are picked up by a Lambda that parses each record and emits a BankStatementReceived event per row to an SQS standard queue.
  2. Matching stage (Fargate ECS task): A long-running Fargate task pulls unmatched BankStatementReceived events from the queue, queries the DynamoDB transaction history table for candidate matches using a composite key of reference ID + timestamp ± 30 minutes, and emits TransactionMatched or TransactionUnmatched events.
  3. Settlement stage (Fargate ECS task): Pulls matched events, performs balance arithmetic (net debit/credit per merchant), and writes settlement records to Aurora PostgreSQL within a transaction.
  4. Reporting stage (Lambda): Listens on a SettlementReportRequested event, queries PostgreSQL, and generates a ZIP of CSV reports delivered to merchants via S3 pre-signed URLs.

This pipeline transformed a 18-hour batch job into a streaming process: reconciliation was done within 90 minutes of SFTP drop, and settlement confirmations were available via API rather than requiring a PDF download that auditors couldn't index.

Phase 4: Staged Traffic Migration Using the Strangler Fig (Weeks 19–32)

Migration of live traffic to new services was governed by a formal risk matrix with explicit rollback triggers at each stage:

StageTraffic %DurationRollback triggers
Mirror (shadow)0% (mirrored)1 weekAny metric divergence >2%
Canary (internal)1%3 daysError rate >0.05%, p95 latency >650ms
Gradual rollout5% → 25% → 50%7 days eachReconciliation failure rate >0.01%
Full cutover100%Sustained 30 daysReconciliation not completing within 4h

At the 25% stage, we discovered a critical payout mismatch: the reconciliation engine had computed a different net figure for one merchant than the monolith did for the same data window. The root cause was a floating-point rounding error in the settlement stage — the Fargate task accumulated fractional paise per-transaction over 180K rows; after ~80K rows, the cumulative error exceeded regulatory tolerance. The fix was a simple BigDecimal rewrite, but the incident was caught during the 25% stage due to our reconciliation comparison alert — without it, the bug would have reached 100% cutover and caused a regulatory compliance breach.

Phase 5: PCI Audit Cutover and Decommission (Weeks 33–36)

Ahead of PCI DSS audit week, we ran a parallel cutover: the legacy monolith continued processing while a shadow copy of the payments service processed live events. Auditors validated the isolated cardholder environment against our documentation, confirming that only the new payment service touched card data — a scope reduction that saved eight engineering weeks and approximately $35,000 in external audit costs.

Once audit sign-off was confirmed, we:

  1. Switched EU API Gateway routes exclusively to new services over a four-hour window.
  2. Redirected the monolith's UPI callback subscriber to publish to EventBridge instead of its internal HTTP endpoint.
  3. Ran both reconciliation pipelines in parallel for one full daily cycle to compare outputs — finding zero discrepancy.
  4. After 30 days of parallel run with no incidents, decommissioned the monolith application servers and its primary RDS database, saving $14,200/month in EC2/RDS costs immediately.

Results

Quantitative Performance Improvements

MetricBefore MigrationAfter MigrationImprovement
Peak-hour transaction failure rate5–8%0.01%↓99.9%
Reconciliation duration18+ hours42 minutes↓96%
Average payment completion latency (p95)2,300ms380ms↓83%
Wallet balance read latency (p99)1,800ms8ms↓99.6%
System uptime99.20%99.98%↑0.78pp
Monthly infrastructure cost$88,900$55,900↓37%
Deployment frequency2–4 per quarter3–5 per week↑20×
Incident rate2–4/week0.2/week↓93%
On-call engineer firefighting time40% of shift8% of shift↓80%
Daily transaction throughput200K TPD470K TPD (produced, post-cutover)↑135% (accepted 10× runway)

Business Impact

In the 90 days following full cutover, PayNest recorded three direct business outcomes driven by the new platform's reliability:

  • Merchant contract renewal: Two of the three merchants in renewal conversations (together representing roughly 15% of total transaction volume) signed renewed 24-month contracts within the first 45 days, citing resolved reconciliation performance as the primary factor. The third merchant — a large edtech platform — redeployed its integration team to new payment flow features rather than platform remediation.
  • Rapid onboarding velocity: New merchant onboarding, which had previously required manual database review and a reliability confirmation step from the tech team, was reduced from 48 hours to 90 minutes with fully automated onboarding checks and a self-service settlement report portal.
  • Developer focus time: Developer surveys conducted at 30, 60, and 90 days post-launch showed feature development focus time rising from 30% to 68% — a 127% increase — as on-call and maintenance burden dropped dramatically.

Metrics

SLI/SLO Framework

We established three tiers of signal — business SLIs tracked alongside infrastructure SLOs on the same Datadog dashboard — so that when a latency spike occurred, the business impact was immediately visible alongside the technical signal.

Tier 1 — Business SLIs (time window: rolling 30 days):

  • Payment completion rate: Target ≥ 99.97%. Definition: payments that reached a terminal state (succeeded or failed permanently) — omitting those still processing eliminates window of measurement ambiguity.
  • Settlement reconciliation duration: Target ≤ 4 hours for all daily cycles. Alert fires at 3 hours 45 minutes as a warning, escalating to PagerDuty at 4 hours exactly.
  • Wallet balance staleness: Target ≤ 1 second lag on wallet balance reads. Derived from the DynamoDB stream-to-read-table projection lag.

Tier 2 — Service-level SLOs (time window: rolling 7 days):

ServiceError Rate TargetLatency P95 TargetThroughput Baseline
Payment core (Lambda)< 0.01%< 300ms50 TPS
Settlement engine (Fargate)< 0.1%N/A (batch)Completes in <4h
Idempotency layer (DynamoDB)< 0.001%< 10ms1,000 RPS
Notification service (Lambda)< 0.5%< 1,000ms20 TPS

Tier 3 — Operational health signals:

  • SQS DLQ depth: Alert when unrecoverable messages accumulate beyond 100 across all queues.
  • Aurora connection pool utilization: Alert at 75% saturation; auto-scaling journal writer adds capacity at 65%.
  • DynamoDB consume capacity burst ration: Alert when sustained usage exceeds 80% of table capacity to prevent throttling.

Chaos Engineering Practices

Beginning four weeks before migration cutover, we instituted a weekly fault injection practice using AWS FIS (Fault Injection Simulator):

  1. Lambda invocation throttling: Inject 50% throttling on the payment core for 2 minutes; verify SQS DLQ captures retries and that manual replay can recover all affected transactions.
  2. DynamoDB serve-latency injection:
  3. Introduce artificial read throttling by setting 300ms artificial latency on the idempotency table; confirm end-to-end retry SLA is not breached (<500ms p95 total for the retried path).
  4. AZ-level failure simulation:Terminate all instances in one availability zone supporting Fargate reconciliation tasks; verify tasks immediately reschedule in another AZ with no message duplication.

These experiments were conducted in staging using real synthetic transaction data but never in production without a pre-approved runbook and approved rollback plan. The DynamoDB latency injection experiment ran in production exactly once — during a scheduled 2 AM maintenance window — and produced no impact on customer traffic, confirming the circuit breaker configuration worked as designed.

Lessons Learned

Technical Lessons

Lesson 1 — Idempotency is not a feature; it is a payment system's foundation. We built the idempotency enforcement layer before the first transaction handler rather than adding it later. This decision saved us an estimated six weeks of effort that would have been needed to retrofit the pattern across services already in production. In financial systems, idempotency design sits at the same priority as correctness of arithmetic — both are formality requirements, not optimizations.

Lesson 2 — Rectangular data shapes defeat floating-point drift. The settlement discrepancy bug was traced to accumulating floating-point arithmetic over 180K rows. Using java.math.BigDecimal-equivalent representations (stored in PostgreSQL numeric(18,6) columns, never float) is not cosmetic for monetary systems — the difference between correct and non-compliant. More broadly: if a value represented currency, the wire format must support cent-precision arithmetic end to end without rounding.

Lesson 3 — Observability as a migration precondition accelerated velocity. By instrumenting the monolith baseline before new services were running, we avoided the "is it actually better?" debate. Anyone on the team could query the baseline p95 latency for the old checkout call and compare it to the new service's real-time p95 on the same dashboard. Decision-making around traffic ramp percentages ceased to be subjective and became evidence-based.

Lesson 4 — Event ordering is a first-class contract, not a detail. Using SQS FIFO queues for payment events was more expensive than standard queues, but the alternative — individual checks arriving before corresponding credits, creating temporary negative balance states visible to merchant dashboards — was a support and regulatory risk. The per-event cost difference (~$0.03 per million events) was negligible relative to the cost of unwarranted negative balance disputes.

Process and Organizational Lessons

Lesson 5 — A dedicated platform engineering team is not overhead; it is the foundation of product velocity. In the first six weeks, individual product teams owned their own CI/CD pipelines, log formats, and monitoring alert design. The result was five inconsistent alert rules, three different ways of emitting correlation IDs, and a backlog of idle SQS queues nobody was monitoring. Creating a four-person Platform Engineering team at week 7 (one infrastructure specialist, one observability specialist, one CI/CD specialist, one security specialist) paid for itself in days by eliminating redundant alerting noise, standardizing how services emitted logs and metrics, and catching the reconciliation bug in staging before migration cutover.

Lesson 6 — Contract testing prevented a production event before it happened. Using Pact for consumer-driven contract testing, a shipping partner's notification schema change was caught as a contract violation in a pull request check rather than at 8 PM on a Saturday when real shipping callback events started failing. This avoided an incident that would have affected 15,000 merchant deliveries in a single day.

Lesson 7 — "Don't migrate data in production windows" is insufficient guidance. The migration began by streaming CDC from monolith tables into each service's local datastore in real time, meaning services were seeing a mixed picture: monolith writes for domains not yet migrated, CDC-sourced writes for domains in migration. The transition from CDC-based reads to independent writes still required a one-time deterministic sync job during a scheduled window. Planning that job as a formal one hour window rather than "whenever it feels ready" necessitated a full script dry-run, checksummed row count verification, and a documented rollback job that demonstrated zero impact when voluntarily tested during staging.

Lesson 8 — Cost allocation tagging upfront pays for itself in unexpected ways. Lack of per-service cost allocation tags until month 6 meant the team spent six weeks post-launch retroactively attributing costs to services. Within 48 hours of implementing cost allocation tags, they identified that one Fargate task was perpetually running at 100% CPU due to a log-formatting loop consuming 2 vCPU unnecessarily — corrected in a single commit, saving $3,200/month in ongoing waste.

What We Would Do Differently On a Future Engagement

  1. Implement SCA (Software Composition Analysis) in CI by week 1: A vulnerable third-party dependency in a payment utility library was discovered 8 months in and required urgent coordinated rollout across services — a disruption that would have been avoided with automated SCA gates in the CI pipeline from the beginning.
  2. Design transactional workflows before extracting any service: The checkout and settlement saga pattern was designed iteratively in months 4–8, during service development. A 48-hour architecture design sprint ahead of the project kickoff would have captured the full causal graph of payment events before a single service was written, reducing the rework that came from mid-pattern adaptation.
  3. Pair on-call engineers from week 1, not week 8: Engineering on-call rotations started after services had been in production for eight weeks. In hindsight, pairing platform engineers with each product team during service development — traveling to production together — would have embedded on-call thinking into design and reduced the post-launch incident rate in weeks 1–4 by an estimated 40%.
  4. Build a shared event versioning contract before publishing any events: Without a formal event schema registry (e.g. AsyncAPI registry with CI integration), consumers made assumptions about event shapes that broke silently when providers changed field names. Implementing a registry before the first FundsDebited event was published would have prevented three separate production incidents involving backward-incompatible event schema changes across services.

Conclusion

Three months after full migration cutover, PayNest processed 470,000 transactions in a single day — more than 2× its pre-migration peak — with a 0.01% failure rate and reconciliation completing in 42 minutes, well inside the 4-hour regulatory SLA. The PCI DSS audit, which the engineering team had feared would require abandoning other roadmap priorities, was completed in three weeks with zero non-conformities. Infrastructure cost was 37% below pre-migration levels despite 135% higher transaction throughput. Developer focus time had increased from 30% to 68% of engineers' weeks.

The decisions that determined outcome were not all obvious in advance. The choice to use SQS FIFO queues over a cheaper but order-breaking standard queue literalized the team's commitment to payment correctness over infrastructure cost. The choice to shadow-run the reconciliation engine at full production volume before mirroring it to real production traffic gave the settlement discrepancy bug a window to fail before the migration reached end-users. The choice to form a platform engineering team during the migration rather than deferring it until after… made the team feel like something was being cut rather than delayed.

For teams beginning their own cloud-native transformation, the pattern that proved most predictive of success was this: fix the notes before changing the piano. Baseline measurement, idempotency-first message design, and blast-radius isolation are the three guardrails a financial platform must earn before shipping to production. Never write a payment service without all of them from day one.

Related Posts

How PropStack Scaled to 50,000+ Concurrent Users — A Microservices Journey
Case Study

How PropStack Scaled to 50,000+ Concurrent Users — A Microservices Journey

When a real estate SaaS startup hit a wall at 1,000 concurrent users, they engaged Partners Tech to rebuild their monolith into a resilient, event-driven microservices platform. Exhausted queues, Cassandra migrations, and Kubernetes — read how they reached 99.97% uptime and cut infrastructure cost by 42% in under six months. Here's everything we learned, from the mistakes we made to the decisions that actually mattered.

How Meridian Retail Replaced a Fragile Monolith with a Cloud-Native Microservices Platform
Case Study

How Meridian Retail Replaced a Fragile Monolith with a Cloud-Native Microservices Platform

Meridian Retail spent 18 months migrating from a 350,000-line PHP monolith to an event-driven microservices architecture on AWS — led by Webskyne. Platform uptime jumped from 99.4% to 99.95%, deployment cycles fell from 4–6 weeks to under one week, and infrastructure costs dropped 42%. Here's the full story: the challenge, the architecture, the implementation phases, the results, and the hard-won lessons every engineering leader should read.

From Fragile Monolith to Resilient Microservices: How a Fintech Platform Cut Downtime by 95%
Case Study

From Fragile Monolith to Resilient Microservices: How a Fintech Platform Cut Downtime by 95%

When a regional fintech platform serving 2.3 million users faced escalating downtime and crippling release cycles, the engineering team made a bold bet: decompose the legacy monolith into a production-grade microservices architecture. Over eighteen months, that bet yielded not just system recovery — it delivered a 1,414% improvement in deployment velocity, a tenfold unit-cost reduction on infrastructure, and an ROI that paid for itself in six months. Here is the full story of what it took, what went wrong, and what every engineering team considering a similar path should know before they start.