Building a Real-Time Analytics Platform for a Fintech Unicorn: From Batch Reports to Live Dashboards

When a rapidly scaling fintech company needed to replace their overnight batch reporting pipeline with real-time analytics, we architected a streaming data platform that reduced insight latency from 14 hours to under 3 seconds. This case study details the technical journey, architectural decisions, and measurable business impact of transforming a legacy analytics stack into a live,event-driven system serving over 200,000 dashboard requests daily.

Overview

In early 2025, a Series C fintech company processing $2.8 billion in quarterly transactions approached us with a problem that sounds familiar to many scaling enterprises: their analytics stack had become a bottleneck. The founders had built the initial product fast, focusing on core banking functionality rather than reporting infrastructure. Financial reports ran on a night batch job, meaning stakeholders were making decisions based on data that was often 14+ hours old.

By Q4 2024, this had become untenable. Operations teams needed live transaction flows to detect fraud patterns in real time. Customer success needed instant visibility into account health. The CFO needed morning reports at 8 AM, not noon. The existing PostgreSQL reporting database couldn't scale to meet demands, and the single reporting server was struggling under query load.

Our mandate was clear: build a real-time analytics platform that could ingest, process, and serve analytical queries on transactional data with sub-second latency, while maintaining data consistency and supporting the existing 47 scheduled reports.

The Challenge

The existing architecture was a classic SaaS growth horror story, though not an uncommon one. The main PostgreSQL database handled both OLTP transactions and analytical workloads — a pattern we'd seen before in rapidly shipping startups. Nightly ETL jobs pulled data into a reporting schema, running from midnight to 9 AM to refresh the materialized views that powered the dashboard.

The problems were compounding. First, data freshness: the 14-hour lag meant no one could see same-day transaction patterns. When a fraud wave hit on a Friday afternoon, the team didn't see it until Monday morning. Second, query contention: heavy analytical queries against the primary database caused transaction latency spikes, and the single reporting server was maxed out during business hours. Third, scaling failures: adding read replicas only partially helped since the write-heavy ETL window created replication lag.

The stakeholder requirements were ambitious but reasonable: sub-second dashboard loading for 200+ simultaneous users, support for ad-hoc SQL queries, alerting on anomaly patterns, and retention of the 47 existing scheduled reports. Budget was constrained — no seven-figure data warehouse contracts — but the engineering team was strong and motivated.

Goals

We established four measurable objectives at project kickoff:

Latency: Reduce dashboard data freshness from 14 hours to under 5 seconds for key metrics
Scale: Support 500 concurrent dashboard users without performance degradation
Continuity: Maintain all 47 scheduled reports with identical outputs during the transition
Cost: Keep infrastructure costs under $4,000/month for the first year

We also had an implicit goal: the system should feel invisible to end users. No new training, no changed workflows, just faster data.

Approach

We chose a streaming-first architecture over periodic micro-batches. The reasoning was straightforward: sub-second freshness requires event streaming, not batch windows. After evaluating Apache Kafka managed services, Confluent Cloud, and AWS MSK, we landed on a Decant (change data capture) layer feeding into Apache Druid for the analytical serving layer.

The core architecture had five layers: source, ingestion, processing, storage, and serving. Let me walk through each decision point.

Source Layer: CDC with Decant

We used Decant (formerly cloudcannon, now part of the Neon ecosystem) for change data capture — reading the PostgreSQL write-ahead log and emitting CDC events. This was cleaner than Debezium for their specific PostgreSQL version and had a friendlier operational model. The CDC stream captured inserts, updates, and deletes with before/after states, enabling accurate downstream processing.

Ingestion Layer: Apache Kafka

Events flow through a three-topic schema: raw transactions, account updates, and alert trigger events. We used AWS MSK (managed Kafka) for operational simplicity — the team didn't have the Kafka expertise to run self-managed clusters, and the fully managed option reduced operational burden significantly.

Processing Layer: Apache Flink

Stream processing happened in Apache Flink, running on AWS Flink (previously Kinesis Data Analytics). We processed three key event types: transaction aggregation windows, account health scoring, and fraud pattern detection. The Flink jobs used five-minute tumbling windows for aggregations and a custom scoring algorithm for account health that we developed in collaboration with the risk team.

Storage Layer: Apache Druid

Druid was the serving layer — column-oriented storage optimized for time-series analytical queries. It ingested both streaming data from Flink and batch historical data from the existing reporting pipeline. Druid's architecture suited their query patterns perfectly: high-cardinality dimension filters, time-based groupings, and approximate count distinct operations.

Serving Layer: Preset

The dashboard connected to Preset (formerly Preset Cloud) for the visualization layer — essentially a hosted Superset instance. This was faster than building custom dashboards and provided 47 pre-built scheduled reports out of the box. We extended Preset's SQL editor to connect directly to Druid, giving analyst teams ad-hoc query capabilities.

Implementation

The implementation ran across 10 weeks, split into four phases. Here's how it unfolded.

Week 1-2: Discovery and Schema Design

We spent the first two weeks deeply understanding the existing reporting schema and query patterns. This was the most valuable investment of the entire project — we analyzed six months of query logs from PostgreSQL, interviewed seven power users, and mapped every scheduled report to its underlying data sources. The key finding: 31 of 47 reports could be served from the streaming pipeline, but 16 required historical aggregations that needed batch backfills.

We designed the Druid schema with input from the analytics team — seven dimensions and 23 metrics, carefully chosen to support the known query patterns. We also established data quality contracts: each CDC event included a schema version, enabling backward-compatible schema evolution.

Week 3-4: CDC Pipeline and Kafka Setup

Decant was configured to tail seven tables from the primary PostgreSQL database. We ran Decant in a separate AWS RDS instance to avoid any performance impact on the production database. The initial load was challenging — replicating 14 months of historical data took 72 hours, but we completed it over a weekend with minimal production impact.

Kafka topics were partitioned by account_id, ensuring that all events for a single account landed in the same partition. This was essential for the account health scoring algorithm, which needed a complete view of all events for an account within each processing window.

Week 5-7: Flink Processing and Druid Ingestion

The Flink jobs were where the business logic lived. The transaction aggregation job used five-minute tumbling windows, computing running totals, average transaction amounts, and velocity checks. The account health scoring job was more complex — it computed a rolling 24-hour health score based on transaction patterns, balance changes, and login frequency, then wrote the score to Druid.

The fraud pattern detection job ran independently, checking each transaction against six rule patterns. When a pattern matched, it emitted an alert event that flowed to both Druid (for dashboards) and a separate SNS topic (for paging). We achieved sub-second pattern detection by using Flink's stateful streaming with RocksDB backend.

Druid ingestion ran in dual mode: streaming from Flink for real-time data (less than 5 seconds freshness) and batch historical backfills from S3 for pre-existing data. The historical backfills were loaded in three batches over two weeks, with the final batch completing 24 hours before go-live.

Week 8-10: Integration and Cutover

The final three weeks were integration-heavy. We built a shadow reporting system that ran parallel to the existing reports, comparing outputs at row level. Discrepancies were rare — most traced to timing differences in the CDC capture, which we documented and accepted.

The cutover was staged: we flipped the Preset dashboard data source at 2 AM on a Tuesday, with a 4-hour rollback window. The existing ETL jobs continued running in parallel for two weeks, providing a safety net. At 6:15 AM, the first user loaded the dashboard and saw live data — 14 hours fresher than they'd ever seen it.

Results

The platform launched in late April 2025 and immediately delivered value. Within the first week, the operations team detected and stopped a fraud pattern that would have cost an estimated $180,000 — something that would have been invisible under the old batch system. By the end of month one, the platform was processing 2.3 million events per day.

The most significant result was intangible but profound: the company culture shifted from reviewing yesterday's data to discussing today's patterns. Meetings that started with "what happened yesterday" shifted to "what's happening now." That change in conversation was the real transformation.

Metrics

Let me offer the hard numbers:

Data freshness: 14+ hours → 3 seconds (99.99% improvement)
Dashboard load time: 8.4 seconds → 0.7 seconds
Concurrent users supported: 127 → 512
Monthly infrastructure cost: $2,340 (under budget at $4,000 target)
Fraud detection speed: 14 hours → 0.8 seconds
Scheduled report count: 47 maintained → 47 maintained
Query contention incidents: 23/month → 0
Peak daily events processed: 4.1 million

Lessons Learned

Three things we'd do differently. First, we underestimated the historical backfill complexity — budget an extra two weeks for data migration and validation. Second, the Druid segment migration between environments caused an hour of downtime; we'll use blue-green deployments in future projects. Third, we should have involved the security team earlier — compliance review added a week to the timeline.

Three things we nailed. The CDC-first architecture was the right call — it avoided the dual-write complexity that trips up many streaming migrations. Preset accelerated the dashboard rebuild by 60%. And the-shadow-report validation caught 14 data quality issues before go-live.

The platform continues to evolve: the risk team has added three new fraud detection patterns, and the product team is exploring real-time customer segmentation. The architecture is built to extend — additional event types can be added to Kafka topics without altering the core pipeline.

If you're wrestling with batch-to-stream migration, the single most important decision is your CDC tool. Get that right, and everything else follows. Get it wrong, and you'll spend months on data quality issues.

This project was delivered by a team of four engineers over 10 weeks. Architecture, implementation, and performance optimization were core deliverables. If you're facing similar analytics latency challenges, we'd love to discuss your specific requirements.