Webskyne
Webskyne
LOGIN
← Back to journal

23 May 20269 min read

From Chaos to Clarity: How a FinTech Startup Built a Real-Time Transaction Pipeline Processing 1.2M+ Events Per Second

Medflow Partners, a fast-growing health-tech startup, was drowning in real-time patient vital data. Three disparate backends, a legacy monolith, and a growing backlog of delayed alerts put clinical accuracy and patient safety at risk. This case study walks through the 12-week transformation — event-driven architecture, Kafka-backed pipelines, and a custom anomaly-detection engine — that cut end-to-end latency by 91%, improved data reliability to 99.98%, and earned the team a Stark Healthtech Award. A detailed look at the decisions, tools, and missteps that made the difference.

Case Studyreal-time-streamingapache-kafkaapache-flinkfintech-infrastructureevent-driven-architectureobservabilityplatform-engineeringhealth-tech
From Chaos to Clarity: How a FinTech Startup Built a Real-Time Transaction Pipeline Processing 1.2M+ Events Per Second
## Overview In early 2026, Medflow Partners — a Series B health-tech startup serving 47 hospitals across three continents — faced a crisis that threatened both its business and its mission. The company's real-time patient monitoring platform ingested live vital-sign streams from bedside IoT devices (heart rate, blood oxygen, respiration rate, blood pressure) and surfaced actionable clinical alerts within seconds of anomaly detection. What they actually had was a crumbling monolith that took an average of 18 seconds to process those same alerts — a gap that, in a hospital setting, could mean the difference between a nurse catching a code blue event in real time and discovering it when reviewing a dashboard 20 minutes later. This is the story of how Medflow's platform engineering team, working under a compressed 12-week timeline and a $180,000 technical debt budget, rebuilt their entire ingestion and processing pipeline from scratch, achieving a 91% reduction in end-to-end latency and a 99.98% data reliability score — all while maintaining zero downtime across their 47 live hospital deployments. --- ## Challenge ### The Symptoms When Medflow's CTO Priya Ramanujan and her team first mapped their infrastructure, the picture was sobering: - **Monolithic ingestion API** — A Node.js and Express monolith named vitals-sync ran on a single m5.large EC2 instance. Every new hospital onboarding added more subscribers to a single Redis queue, creating an unbounded memory pressure situation. - **Three backends, one truth** — Patient vital data was being written simultaneously to Amazon DynamoDB, a legacy MongoDB shard, and a Postgres read replica via asynchronous replication. No two sources agreed on the same patient record more than 72% of the time. - **Alert storms** — During high-traffic periods (Monday mornings and post-surgical recovery units), the system failed to deduplicate redundant vital updates. A single patient with a stable heart rate of 72 BPM could generate 200 identical readings in 60 seconds, each triggering a separate alert rule evaluation. - **A年的 technical debt backlog** — The engineering team was operating on a stale copy of the codebase — a fork from October 2022 that never received the security patches and dependency updates applied to the main branch. - **No observability** — The team had full dashboards but no alerting. A PagerDuty runbook described the "Vitals Sync Latency Incident" procedure — the same procedure had been executed manually 18 times in the previous quarter without anyone bothering to automate the fix. ### The Business Impact The consequences were measurable and severe. Medflow's own data showed: - **18 SLA breaches** in the previous 30 days — the contract with hospital partners required alerts to fire within 5 seconds of a vital crossing a threshold. - **$240,000 in contractual penalties** paid to hospital clients in Q4 2025. - **2,340 false-positives per month** — alert storms overwhelmed hospital nurse stations and reduced the system's clinical trust score to 68%. - **3 months to onboard a new hospital** — infrastructure provisioning alone took 6–8 weeks because every onboarding was a bespoke manual process. - **C-level goal risk** — The CPO had publicly committed to processing 2 million events per second by mid-2026, a target that looked unreachable on the existing codebase. "We were not just behind on our technical roadmap — we were actively damaging relationships with our customers and creating patient safety risks every day," said Ramanujan. "We needed to accept that the cost of the wrong choice would far exceed the cost of rebuilding." --- ## Goals The rebuild project, internally codenamed Project Vitals-Reborn, established five non-negotiable goals before a single line of code was written: | Goal | Metric | Target | |------|--------|--------| | End-to-end latency | Time from device → alert fired | < 2 seconds | | System reliability | Uptime across all 47 hospitals | ≥ 99.95% | | Data accuracy | Delta between sources reconciled per minute | ≥ 99.9% | | Alert precision | False-positive rate | ≤ 2% | | Hospital onboarding time | From signed contract → live data ingestion | ≤ 2 weeks | "The goals were set by the product and customer success teams, not engineering," recalled lead engineer Ankit Mehta. "That was intentional — it forced us to design from the outside in rather than picking tools and features that felt convenient." --- ## Approach ### Architecture Decision: Event-First, Not API-First Rather than patching the existing monolith with additional load-balancers, the team made the radical decision to treat events as the **primary unit of truth** — treating the API layer as a durable, fire-and-forget ingestion layer rather than a synchronous request-response interface. This meant moving away from REST-based device callbacks to an **event-driven architecture (EDA)** centered on Apache Kafka as the streaming backbone, with the following layers: ``` IoT Devices → Device Gateway → Kafka Topics → Stream Processors → Alert Engine → Action Console → Hospital Dashboards ``` ### Why Kafka, Not Kinesis or RabbitMQ? The team evaluated three streaming platforms before settling on Kafka running on a self-managed Confluent Platform cluster on AWS: - **Amazon Kinesis** was ruled out due to retention limits — Medflow needed to be able to replay a full 30 days of raw vital data for machine learning training without re-reading from source devices. - **RabbitMQ** was dismissed due to its lack of built-in stream partitioning semantics and the operational complexity of maintaining a partitioned topology at 1M+ events/second. - **Apache Kafka (Confluent)** offered exactly-once semantics with idempotent producers, configurable retention policies (set to 30 days), dynamic partition scaling, and a rich ecosystem of stream processing frameworks (KSQL, ksqlDB, and Flink). "The 30-day replay requirement was non-negotiable — hospitals need to audit every alert for regulatory compliance," said data engineer Li Wei. "Kafka was the only platform that could scale to that requirement without costing more in operational overhead than the alternatives." ### Processor Choice: Apache Flink Over Spark Streaming For stream processing, the team chose **Apache Flink** running on AWS EMR over Spark Structured Streaming. The deciding factors: - **Flink-native state management** — Flink's managed keyed state and RocksDB state backend allowed the team to maintain per-patient sliding window aggregations without external storage. - **Exactly-once guarantees** — Flink delivered at-least-once processing semantics that could be elevated to exactly-once via Kafka's transactional APIs. - **Checkpoint recovery** — 5-second recovery time objectives (RTO) after failures without data re-processing. - **Flink SQL support** — Allowed the team to express alert rules as declarative SQL rather than writing custom Flink jobs per rule type. ### The Data Schema Problem One of the more subtle challenges was that the existing three-backend schema was inconsistent — the DynamoDB, MongoDB, and Postgres sources each used slightly different field names, units, and structures for the same clinical concept. Before streaming could begin, the team had to build a **canonical schema registry** using Confluent Schema Registry with Apache Avro serialization, enforced strictly at the Device Gateway. "The schema was the hidden bottleneck," Mehta explained. "If you have inconsistent schemas in your Kafka topics, you can scale your cluster all you want — your stream processors will be reconciling fields instead of computing alerts." --- ## Implementation ### Phase 1 (Weeks 1–3): Device Gateway and Kafka Infrastructure **Infrastructure as Code:** The team used Terraform to provision the entire Kafka cluster — 6 broker nodes across 3 AZs using AWS MSK (Managed Streaming for Kafka). Each broker ran on an m5.xlarge instance with 16GB of RAM and EBS gp3 volumes configured for 5,000 random IOPS. **Partition Strategy:** The team opted for a **patient-ID-based keyed partitioning strategy** — every vital event was keyed by patient UID. This ensured that events for a single patient were always routed to the same partition, maintaining ordering at the patient level — a clinical requirement given that out-of-order vital events can trigger incorrect alert states. **Topic Hierarchy:** A 4-tier topic layout was adopted: ``` raw-vitals (raw, unprocessed) TTL: 30 days cleaned-vitals (schema-normalized) TTL: 14 days alert-events (generated alerts) TTL: 7 days audit-log (system and clinical events) TTL: 365 days ``` **Kafka Connect Offset Explorer:** To aid debugging, the team built a custom Grafana dashboard that surfaced consumer lag per partition in real time, color-coded by alerting threshold (< 100 messages = green, < 1,000 = amber, > 1,000 = red). **Device Gateway Implementation:** The gateway was a Go program using gRPC for device registration and efficient binary protocol serialization. After evaluating Protobuf, Thrift, and Avro, the team selected **Protocol Buffers** for the wire format due to its smaller payload size (approximately 60% smaller than JSON) and built-in schema definition — reducing per-event bandwidth at 1M+ events/second. ```go // Canonical Vital Event Protobuf Schema message VitalEvent { string hospital_id = 1; string patient_id = 2; int64 timestamp_unix_ms = 3; double heart_rate_bpm = 4; double spo2_pct = 5; double respiration_rate = 6; double systolic_bp_mmhg = 7; double diastolic_bp_mmhg = 8; DeviceSource source_device = 9; uint32 sequence_number = 10; } enum DeviceSource { BEDSIDE = 0; PORTABLE = 1; AMBULATORY = 2; } ``` --- ### Phase 2 (Weeks 4–7): Stream Processing with Apache Flink **Flink Cluster on AWS EMR:** The team deployed a Flink cluster with 1 JobManager and 4 TaskManagers on EMR 6.15.0, each TaskManager running on r5.xlarge instances (4 vCPU, 32GB RAM). Flink was configured with: - **Checkpoint interval:** 5 seconds - **Checkpoint timeout:** 10 seconds - **Minimum pause between checkpoints:** 2 seconds - **RocksDB state backend** for managed stateful processing - **Event time semantics** with a watermarks strategy of 500ms to handle late-arriving messages **Sliding Window Alert Rules:** Each alert rule was implemented as a Flink job with a tumbling or sliding window expression. For example, a "sustained tachycardia" rule: ```sql SELECT patient_id, AVG(heart_rate_bpm) AS avg_heart_rate, COUNT(*) AS event_count, TUMBLE_END(eventTime, INTERVAL '60' SECOND) AS window_end FROM raw_vitals WHERE heart_rate_bpm > 100 GROUP BY TUMBLE(eventTime, INTERVAL '60' SECOND), patient_id HAVING COUNT(*) >= 5 AND AVG(heart_rate_bpm) > 110 ``` **Alert Deduplication:** To solve the alert storm problem, the team implemented a **Flink CEP (Complex Event Processing)** pattern that maintained a per-patient alert state and suppressed duplicate alerts within a configurable cooldown window (default: 300 seconds). Alerts were written to the `alert-events` Kafka topic with their original PatienceWindow hash, enabling downstream systems to deduplicate across the pipeline. **State Size at Scale:** At peak load (47 hospitals, 1.2M events/sec), the Flink cluster maintained approximately 28GB of managed keyed state across its 4 TaskManagers — well within the capacity of EMR-managed instance groups. --- ### Phase 3 (Weeks 8–10): Canonical Data Layer and Observability **Canonical Record Service:** Rather than forcing stream processors to write to both legacy backends, the team built a new "Canonical Record Service" — a Python FastAPI service backed by Amazon Aurora PostgreSQL. The service maintained a single unified patient state table that resolved the historical multi-backend divergence. FastAPI was chosen for its ASGI-grade async performance and Pydantic data validation layer, which reduced schema-integrity bugs by an estimated 60% compared to a traditional Flask or Express equivalent. **Observability Stack Overhaul:** The team replaced the existing CloudWatch-and-spreadsheet approach with a **three-pillar observability stack**: - **Metrics — Prometheus + Grafana** — 47 custom dashboards covering device ingestion rate, alert fire rate, and pipeline lag with 10-second granularity. - **Logs — ELK Stack (Elasticsearch + Logstash + Kibana)** — All logs tagged with `hospital_id`, `patient_id`, and `event_type` for targeted search and temporal correlation. - **Traces — OpenTelemetry + Jaeger** — Distributed traces for every vital event from device to alert console, enabling per-patient tracing in < 50ms. "We spent almost as much time on observability as we did on the pipeline itself," said platform engineer Rachel Kim. "You cannot optimize what you cannot measure correctly. There was a period in week 7 where we had no visibility at all, and it was the weakest point of the project — we learned that the hard way." **Canary Deployments:** Each Kafka topic migration used a canary approach — the team ran the old and new pipeline in parallel for a single hospital for 48 hours, comparing alert outcomes before promoting globally. The approach caught two production-grade bugs: a memory leak in the Flink window processor and a race condition in the Device Gateway's gRPC path that corrupted sequence numbers during high-latency bursts. --- ### Phase 4 (Weeks 11–12): Rollout, Validation, and Go-Live **Blue/Green Trading Hours — Zero Downtime:** Rather than a single maintenance window, the team executed a two-phase blue/green deployment across Medflow's 47 hospital deployments: - **Phase 1 (Week 11):** 15 non-critical hospitals onboarded to the new pipeline during overnight trading hours (IST 02:00–06:00). The team ran a 72-hour parallel run against the old pipeline for all patient alert outcomes. - **Phase 2 (Week 12, Days 1–4):** Remaining 32 hospitals migrated in batches of 4–5 per 4-hour window, with on-call engineers monitoring Prometheus alerting in real time. **Clinical End-user Validation:** Before go-live, the team ran a 2-day validation with a clinical advisory panel of 8 practicing nurses. The panel tested 22 realistic alert scenarios (tachycardia, bradycardia, hypoxemia, hypotension) against both the old and new pipeline. "The nurses said they couldn't tell the difference in alert quality — which was the goal," said Mehta. "They just wanted reliable, timely alerts." **Go-Live Moment:** At 08:30 AM IST on April 18, 2026, the team cut over the final hospital, marked the old monolith's kill switch, and opened a champagne bottle. The streaming pipeline immediately scaled to 1.17M events/second across 47 hospitals with an average end-to-end latency of **1.4 seconds** — below the 2-second SLA target. --- ## Results ### Quantitative Outcomes The metrics speak for themselves: | Metric | Pre-Rebuild | Post-Rebuild | Change | |--------|-------------|--------------|--------| | End-to-end latency | 18.2s | 1.4s | **–92.3%** | | Data reliability (source agreement) | 71.8% | 99.98% | **+28.2 pp** | | Uptime (rolling 30-day) | 96.4% | 99.97% | **+3.57 pp** | | Alert precision (false-positive rate) | 27% | 1.6% | **–25.4 pp** | | Hospital onboarding time | 3 months | 12 days | **–88%** | | SLA breach rate | 6/month | 0 | **–100%** | | Customer penalties paid | $240K/quarter | $0 | **–100%** | | Peak events processed | 320K/sec | 1.17M/sec | **+266%** | ### Qualitative Feedback Within 30 days of go-live, Medflow's Net Promoter Score among hospital partners jumped from **34 to 78** — one of the single-largest quarter-over-quarter improvements in the company's history. Two hospitals cited improved alert response as a contributing factor in reduced cardiac arrest response times in their internal Q2 performance reports. "Nurses told us that they trust the system again," Priya Ramanujan shared in the Q2 all-hands. "For a platform whose job is patient safety, that is everything." --- ## Deployment & Architecture Snapshot ### Technology Stack | Layer | Technology | Rationale | |-------|-----------|----------| | Device Gateway | Go + gRPC + Protobuf | High-throughput, low-latency binary serialization | | Streaming Backbone | Apache Kafka (Confluent on AWS MSK) | Exactly-once semantics, 30-day retention, managed scale | | Stream Processing | Apache Flink on AWS EMR | Native state management, CEP support, SQL mapping | | Canonical Data Store | Amazon Aurora PostgreSQL | Single source of truth, ACID guarantees | | API Layer | FastAPI (Python 3.12) | Async performance, Pydantic schema validation | | Alert Deduplication CEP | Flink CEP + Kafka Streams | Cooldown window, cross-partition deduplication | | Observability | Prometheus + Grafana + Elk Stack + Jaeger | 3-pillar coverage, 10-second Prometheus scraping | | IaC / Deployment | Terraform + GitHub Actions CI/CD | Immutable infrastructure, automated rollouts | ### Key Architecture Diagram ![Medflow Real-Time Pipeline Architecture](https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1600&auto=format&fit=crop&q=80) > The new Kafka-backed event pipeline routes vital events from hospital IoT devices through Device Gateways (Go + gRPC), into partitioned Kafka topics, processed by Apache Flink in real time, and delivered to hospital dashboards via FastAPI with a single-canonical-record Aurora datastore — delivering 99.98% reliability and sub-2-second end-to-end latency across 47 hospitals. --- ## Lessons Learned ### 1. Penalize "API-First"Thinking When Latency Dominates When every millisecond matters, the REST client-server pattern — reliable, well-understood, but inherently synchronous — becomes a liability. Moving to an event-first, fire-and-forget ingestion model with a persistent tailed queue was the single most impactful architectural decision of the rebuild. **Lesson:** If your SLA is measured in seconds, treat every synchronous call as a debt you are accumulating. --- ### 2. Schema Registry Is Worth 3 Weeks of Effort Even for "Small Changes" The team initially viewed schema work as a "phase 2" concern and moved straight into building stream processors. Within days, the lack of schema enforcement had caused two production-grade bugs in Flink window expressions — one that caused a patient's vital history to be silently partitioned across different schema versions. Fixing the schema registry first — even at the cost of schedule — saved an estimated 3–4 weeks of post-launch debugging. **Lesson:** Schema disagreement is the hidden source of most distributed systems bugs. Invest early. --- ### 3. Observability Is Architecture During week 7, in between the Kafka migration and the Flink Cluster deployment, the team had no insight into pipeline performance at all. That week represented the project's largest delay window. "We were flying blind," said Rachel Kim. "It is impossible to have confidence in a rebuild you cannot watch working." The team's final observability stack — Prometheus for high-frequency metrics, Jaeger for traces, and ELK for log aggregation — was added after the streaming architecture was substantially built. But the lesson took hold: observability is not a feature — it is architecture **Lesson:** Prometheus scraping intervals, Jaeger trace instrumentation, and log envelope structure are as important to design as your Kafka topic partitions. --- ### 4. Stateful Stream Processing Requires Rigid Partitioning Discipline Flink's keyed state is **partition-key dependent**. A single mis-partitioned event (e.g., a null patient ID) could cause Flink to drop a patient's entire 30-day sliding window state into an unknown keyed state, making it unrecoverable. The team implemented strict validation at the Device Gateway that rejected malformed events before they entered Kafka, and added a dedicated "dead-letter" Kafka topic for all invalid events that were captured for later schema analysis. **Lesson:** In stateful streaming systems, the cheapest fix is refusal of malformed events rather than risky repairs downstream. --- ### 5. The On-Ramp Is as Hard as the Core Architecture The Infrastructure-as-Code approach (Terraform modules, environment-specific overlays, automated pipeline deploys) transformed hospital onboarding from a 3-month bespoke manual process to a **12-day Terraform-applied operation**. Without that on-ramp simplification, the new pipeline would have been limited in its reach — the clinical value was in scale, not just in latency. **Lesson:** The ease of adding the 48th hospital is as important as the performance of the first 47. --- ## Conclusion Medflow's rebuild of its real-time patient vital monitoring platform proved that a health-tech startup — even one operating under strict clinical and contractual constraints — could execute a disciplined, performance-obsessed platform rebuild in a compressed timeline without sacrificing reliability. The 91% reduction in end-to-end latency, 99.98% data reliability, and the restoration of clinical trust are not the product of a single choice. They are the result of every decision — event-driven architecture, Kafka-with-30-day-retention, schema-first design, Flink stateful processors, Prometheus-level observability — reinforcing each other. Ramanujan's team is now building the next phase: a reservoir sampling-based anomaly detection model trained on 18 months of canonical vital data, running as a secondary Flink job alongside the existing alert pipeline — with a 3-month live-A/B test to validate clinical signal to noise ratio before full rollout. "We thought this rebuild would take six months — we did it in 12 weeks," she said. "We underestimated the difficulty. We overestimated the cost. And that is a formula I hope every engineering team gets to apply to their own hardest problems." --- **Tags:** real-time-streaming, apache-kafka, apache-flink, fintech-infrastructure, event-driven-architecture, observability, platform-engineering, health-tech

Related Posts

How a 200-Person SaaS Startup Cut Churn Rate by 42% in Six Months: A Full Case Study
Case Study

How a 200-Person SaaS Startup Cut Churn Rate by 42% in Six Months: A Full Case Study

When its monthly churn climbed from 3.1% to 5.2% across four successive quarters, PeopleFlow — a fast-growing B2B SaaS HR Tech platform serving 3,400 mid-market companies and 800,000 end-users — faced a quiet revenue crisis that silently erased over $1.2M in annualized recurring revenue. Deep-dive diagnostics revealed that the culprit was not competitive pressure or pricing dissatisfaction, but three compounding failures: a monolithic onboarding flow with a 73% drop-off rate, a batch-blast lifecycle messaging engine achieving barely a 12% open rate, and a hidden support crisis where a sampling bias was masking genuine customer frustration behind an inflated 90% CSAT score. This case study reconstructs the full six-month turnaround: how a milestone-gated onboarding redesign raised completion from 27% to 58% in real time, how a behavior-triggered lifecycle stack doubled open rates to almost 25% while cutting first-response support times by over 75%, and how a seven-signal churn risk engine enabled CSMs to go from reacting to proactively intervening — delivering a 42% reduction in monthly churn, a +6pp NRR swing, and $720K in incremental retained ARR within a single growth window.

How FinPulse Migrated 2.4 Million Users Off a Monolith in 90 Days Without Downtime
Case Study

How FinPulse Migrated 2.4 Million Users Off a Monolith in 90 Days Without Downtime

When FinPulse's legacy billing platform started buckling under Black Friday traffic, the engineering team faced a classic dilemma: patch the monolith and hope it holds, or bet the company on a zero-downtime migration. This is the full story of how they pulled it off — on time, under budget, and without losing a single transaction.

Building a Real-Time Battery Intelligence Platform for a 12,000-Vehicle Electric Fleet
Case Study

Building a Real-Time Battery Intelligence Platform for a 12,000-Vehicle Electric Fleet

When India's largest shared mobility platform approached us with a dire problem — their 12,000-vehicle EV fleet was haemorrhaging money through unplanned breakdowns at 38% above pre-electric benchmarks, a support team drowning in battery-related tickets, steadily rising range anxietqueries, and 41% fleet layover meaning nearly every second vehicle sat idle — we knew this was no ordinary engineering assignment. Solving it required a six-month sprint to build a real-time battery intelligence platform that would touch every layer of the distributed stack, from edge firmware normalisation on an ageing heterogeneous fleet to an ML forecasting engine predicting degradation ninety days out. Two years of historical telemetry data was too noisy, three vendors had built the IoT firmware stack independently, and every layer demanded its own hard trade-offs and quiet lessons before it could ship to production. The result — 44% fewer breakdowns, 70% faster swap layovers, 71% fewer range complaint tickets, and 86% revenue leakage reduction — came not from one silver bullet but from obsessive rigour across every layer simultaneously.