Scaling for a Million Users: How Telora Finance Cut Latency by 62% and Doubled Daily Engagement

In early 2024, Telora Finance was growing faster than its infrastructure could keep up. A 12-person engineering team inherited a monolithic backend that served 180,000 monthly active users but struggled during market peaks. This case study walks through the architectural decisions, phased migration strategy, and monitoring overhaul that brought latency down from 420 ms to 160 ms, reduced error rates from 2.4% to 0.3%, and raised daily active users from 68,000 to 142,000 in three months.

## Overview When Telora Finance’s CTO Priya Raman first reached out to Webskyne in January 2024, the numbers were alarming but not unusual for a well-funded fintech that had outgrown its first codebase. The flagship mobile trading platform was handling roughly 180,000 monthly active users, yet every peak-hour dashboard load averaged 420 milliseconds—an eternity in a market where latency directly affects trade execution and customer trust. The backend, a Node.js monolith backed by a single sharded MongoDB cluster, had served the company faithfully since Series A. But Radio-frequency identification, A/B testing feature flags, and real-time quote streaming had turned a clean architecture into an entangled dependency graph. Over eight weeks, a joint team of four Webskyne engineers and Telora’s lead backend squad executed a phased replatforming: introducing a read-optimized API layer, decoupling streaming from transactional writes, and replacing a brittle caching strategy with a hierarchical invalidation model. The result wasn’t just faster APIs—daily active users rose from 68,000 to 142,000, and revenue-impacting errors dropped by an order of magnitude. ## Challenge The symptoms were obvious to anyone watching error dashboards: latency p99 climbed from 230 ms to 410 ms between September 2023 and January 2024. The alert that finally triggered executive action wasn’t latency, though—it was a 17-minute outage on the equities watchlist page during a quarterly earnings announcement. The root-cause analysis revealed a cascading failure: the monolith’s synchronous quote-publishing path held a mutex across three unrelated modules, and under just six thousand concurrent connections, thread starvation spilled into the authentication service, making the entire platform read-only for a third of the team. But the technical debt was only one layer of the problem. Telora’s product team had grown to twenty-three people, shipping new features on a weekly cadence, yet every deployment required a full regression run and a 45-minute maintenance window. Rolling deployments were impossible because the monolith bundled risk-engine scoring, market-data subscriptions, and user-notification triage in the same process. A single bad query in the notification service could bring down trade execution. There was also the data-model problem. The original MongoDB schema had been designed for a simple watchlist experience. By 2024, Telora had added portfolio-level risk scoring, AI-driven watchlist recommendations, regulatory audit logging, and a social feed that let users follow other traders. All of these lived in the same collections, with partial indexes and duplicated fields, making even simple queries unpredictably slow. ## Goals The project brief laid out three business goals, each tied to measurable metrics: 1. **Latency:** Reduce p95 API latency from 420 ms to under 180 ms during peak load (8,000+ concurrent users). The target was chosen after reviewing the latency distribution of top-tier competitors and benchmarking what the frontend could tolerate before frame drops exceeded 2% on mid-range Android devices. 2. **Reliability:** Lower the error rate—defined as 5xx responses and failed WebSocket connections—from 2.4% to under 0.5%. This mattered because Telora’s compliance team had documented that every failed trade request required a 48-hour manual review, costing an estimated $180 in operational overhead per incident. 3. **Engagement:** Increase daily active users (DAU) from 68,000 to 100,000 within 90 days. Product research showed that users with session durations under three minutes dropped off after two weeks; faster responses were hypothesized to extend session length. The engineering goals were equally clear but less codified in writing: reduce deployment risk to the point where feature flags and canary releases could happen without a maintenance window, and create monitoring that could answer “why is this query slow?” within five minutes rather than five days. ## Approach Webskyne’s lead architect, Karthik Nair, recommended a strangler-fig pattern rather than a big-bang rewrite. The existing monolith would remain online while new endpoints were shadowed and gradually redirected. This minimized business risk and let the team validate changes against live traffic. The replatforming was broken into four phases: **Phase 1 — Read-API layer:** The heaviest loads came from watchlist, portfolio summary, and leaderboard queries. These were all read-heavy and idempotent, making them ideal candidates for a dedicated read API backed by materialized views in PostgreSQL. Webskyne built a sync job that streamed MongoDB change events into Postgres using Debezium, then exposed aggregated endpoints behind an API gateway. **Phase 2 — Decouple streaming:** Real-time trade notifications and quote updates had been mixed into the monolith’s HTTP request loop via long-polling fallbacks. This was replaced with a dedicated NATS JetStream microservice that published market data through WebSockets, letting the monolith focus purely on transactional writes. **Phase 3 — Cache invalidation overhaul:** The previous caching strategy used a naive 30-second TTL on all key endpoints, which meant stale prices lingered mid-negotiation and users mistook delayed data for platform bugs. Webskyne implemented an invalidation-on-write model: whenever a price update or portfolio change occurred, the relevant cache keys were evicted immediately rather than waiting for TTL expiry. **Phase 4 — Observability and deployment pipeline:** Datadog APM was configured with custom spans around the new read layer and JetStream consumers. Deployment was moved to ArgoCD with progressive delivery, enabling canary releases that compared p95 latency and error rate between baseline and candidate pods. ## Implementation The technical execution ran from late January through mid-April 2024. On the infrastructure side, Telora provisioned a new Amazon EKS cluster, while Webskyne managed the PostgreSQL schema design, Debezium connectors, and NATS JetStream consumer groups. The schema migration was the most delicate step. Telora’s MongoDB held 1.2 billion documents with no formal migration history. Webskyne wrote an idempotent backfill script in Python that used change-data-capture to reconcile new documents with Postgres, then ran a two-week validation pass comparing record counts, checksums, and query result sets between old and new systems. The script was deliberately conservative—it shipped records in batches of 5,000 and logged any mismatch to a dead-letter collection for manual review. On the application side, the read API layer was written in TypeScript using tRPC, chosen because Telora’s frontend team was already using React Native and shared type safety between mobile and API reduced integration bugs. API gateway routing rules directed 10% of read traffic to the new layer on day one, stepping up to 100% over four days once error rates remained below 0.2%. The caching invalidation model required discipline: any service that mutated user data—trade execution, watchlist edits, risk-scoring recalculations—had to emit an `invalidate` event. Webskyne worked with Telora to annotate the domain events already in the monolith’s EventEmitter, then routed those events through a lightweight Redis pub/sub that the new cache layer subscribed to. The average invalidation lag dropped from 28 seconds (old TTL) to 340 milliseconds. Deployment pipelines were the final piece. Telora’s previous CI job took 47 minutes because it integrated lint, unit tests, and a full staging environment spin-up. Webskyne split the pipeline: lint and unit tests run in four minutes against a shared Postgres test container, while staging deployments happen only for release candidates. The monolith still receives daily cron-driven regression tests, but feature branches merge when they pass unit tests and a canary runs for two hours in production. ![Server room monitoring dashboard on multiple displays](https://images.unsplash.com/photo-1553877522-43269d4ea984?auto=format&fit=crop&w=1600&q=80) The photo above shows the kind of real-time visibility the new observability stack enabled: watchlists of p95 latency, JetStream consumer lags, and database replication lag all updating every fifteen seconds. ## Results By April 18, roughly twelve weeks after kickoff, the metrics told a clear story: - **p95 API latency** dropped from 420 ms to 158 ms, a 62% improvement. - **p99 latency**, which matters most for traders on slow networks, fell from 1,120 ms to 420 ms. - **Error rate** fell from 2.4% to 0.29%, an 88% reduction in failed requests. - **Daily active users** climbed from 68,000 to 142,000 within the first ten days of the new frontend launching, and stabilized near 138,000 by week five. - **Mean session duration** increased from 4 minutes 12 seconds to 9 minutes 47 seconds—a 2.3× improvement. - **Maintenance windows** dropped from a weekly four-hour outage to zero; deployments now occur ad-hoc during business hours via canary promotions. - **Trade execution failure rate**—the metric that had prompted the initial executive review—fell from 1.1% to 0.08%. The business impact translated directly to revenue: the customer-success team reported a 31% reduction in trade-review tickets, saving roughly 480 engineering hours per month. Telecom analyst coverage of Telora’s Series C noted that the platform reliability upgrade was cited as a key factor in the round closing at a 1.4× premium over original valuation. ## Metrics | Metric | Before | After | Change | |--------|--------|-------|--------| | p95 API latency | 420 ms | 158 ms | −62% | | p99 API latency | 1,120 ms | 420 ms | −62% | | Error rate (5xx + failed WS) | 2.4% | 0.29% | −88% | | Daily active users | 68,000 | 138,000 | +103% | | Mean session duration | 4m 12s | 9m 47s | +133% | | Trade execution failures | 1.1% | 0.08% | −93% | | Weekly maintenance windows | 4 hrs | 0 hrs | −100% | | Support ticket volume (trade review) | 1,240/mo | 860/mo | −31% | These numbers were independently verified by Telora’s finance team using Salesforce CRM ticket tags matched against deployment timestamps, eliminating survivorship bias. ## Lessons Learned The project produced seven durable maxims that Telora’s engineering team still cites in architecture reviews today: 1. **Incremental replatforming beats big-bang rewrites** whenever the existing system is still making money. The shadow-traffic phase caught schema mismatches that would have caused a catastrophic data inconsistency had Telora cut over cold. 2. **Move from TTL-based caching to event-driven invalidation** as soon as read traffic starts serving stale data. The three-week TTL experiment introduced real confusion during earnings season; users genuinely believed the platform was showing wrong prices. 3. **Observability is not a phase—it’s a prerequisite.** The first week after read-traffic cutover generated 200+ latency alerts because the synthesizer query for portfolio summaries ran unindexed table scans. Datadog dashboards caught this in minutes rather than days. 4. **Invest in type safety across service boundaries.** Telora’s frontend engineers found three backend bugs in the first two weeks of tRPC integration that had never been caught by Jasmine tests because the mocked responses had drifted from the real schema. 5. **Schema migrations need explicit rollback plans.** The backfill script’s dead-letter queue saved the team twice when unexpected null values in the legacy audit log caused Postgres upserts to fail silently. 6. **Cache invalidation is hard, but event ordering is harder.** NATS JetStream’s exactly-once semantics prevented two race conditions where a price update could arrive before its corresponding cache-invalidation event—a bug the previous polling-based architecture had masked with its predictable TTL window. 7. **Engineering metrics should ladder directly to business metrics.** Telora’s board cared about DAU and error rate, not p95 latency. Framing the replatforming narrative around revenue-impacting trade failures is what unlocked the engineering headcount and AWS budget that kept the project on schedule. ## Looking Ahead Twelve months after launch, Telora’s engineering team has extended the pattern to risk scoring and regulatory reporting—previously the most contentious module in the monolith. The same read-API layer now powers the retail web portal, driving an additional 18,000 desktop-lite users from emerging markets. Webskyne remains engaged as an embedded advisor, running quarterly architecture reviews that catch design debt before it accumulates past the point of a one-sprint fix. --- *Categories: Case Study | Tags: Fintech, Platform Engineering, Performance Optimization, Microservices, AWS, PostgreSQL, Real-Time Systems, Observability*

Scaling for a Million Users: How Telora Finance Cut Latency by 62% and Doubled Daily Engagement

Related Posts

How a FinTech Startup Cut Deployment Time by 70% with Microservices and Kubernetes

From Legacy Monolith to Serverless: How PayStream Cut Infrastructure Costs by 60% and Doubled Deployment Frequency

From API Sprawl to Unified Orchestration: How LogiFlow Cut Integration Costs by 62%