Scaling a Real-Time Analytics Dashboard: How We Handled 10x Traffic Growth Without Breaking a Sweat

When a mid-sized SaaS client approached us with a dashboard choking on 50,000 concurrent users, we knew traditional caching wouldn't cut it. This case study walks through our end-to-end approach: from architectural refactoring and edge computing adoption to real-time WebSocket optimization, container orchestration tuning, and multi-tier caching strategies. Over four intense months, we transformed a fragile Node.js dashboard into a resilient platform handling half a million concurrent connections. The result? 99.97% uptime, sub-100ms API latency, and a 3.4x improvement in data freshness. Along the way, we learned hard lessons about premature optimization, the perils of shared database connections, and why observability isn't optional—it's foundational.

# Scaling a Real-Time Analytics Dashboard: How We Handled 10x Traffic Growth Without Breaking a Sweat ## Overview In early 2025, Webskyne was tasked with a high-stakes infrastructure overhaul for an SaaS analytics platform serving mid-market e-commerce businesses. The client's real-time dashboard was a critical revenue touchpoint—merchants relied on it to monitor sales velocity, inventory turns, and customer sentiment. What began as a promising product had quietly become a scalability bottleneck. During peak shopping seasons, the dashboard slowed to a crawl, WebSocket connections dropped without recovery, and the operations team found themselves firefighting instead of innovating. This case study documents the full lifecycle of the engagement: from discovery through implementation, and the measurable outcomes that followed. ## The Challenge At the start of the engagement, the client's infrastructure looked familiar enough for a Series A SaaS company. A single Kubernetes cluster running a monorepo of Express.js services, a shared PostgreSQL database, and Redis for session caching. On the surface, everything seemed manageable. But beneath that surface, critical problems had accumulated. First, the real-time dashboard relied on polling—clients requested data every 3 seconds, creating a thundering herd effect during peak hours. Second, WebSocket connections were handled in-process, meaning any memory leak in a socket handler would eventually crash the worker. Third, database queries were largely unindexed, and the connection pool was shared between real-time streams and administrative functions, causing cascading failures when one query ran long. Finally, there was no circuit breaking or graceful degradation. If any downstream dependency hiccuped, the entire dashboard went down. The client's metrics told a grim story: average API response time of 420ms, WebSocket reconnection rate of 18%, and three major outages in the previous quarter alone. They needed a solution that could handle 500,000 concurrent dashboard sessions with sub-200ms latency and zero data loss during failover. ## Goals We defined success with four concrete, measurable goals. - **Performance:** Reduce p95 API latency from 420ms to under 150ms, and p99 to under 250ms. - **Reliability:** Achieve 99.95% uptime for the dashboard API and WebSocket layer. - **Scalability:** Support 500,000 concurrent WebSocket connections and 10,000 requests per second without manual intervention. - **Data Freshness:** Deliver real-time analytics with no more than 5 seconds of lag during normal operations. These goals were not aspirational—they were contractual. The client's renewal cycle depended on meeting these SLAs, and their board had made a public commitment to investors about platform stability. ## Approach Rather than greenfield the entire platform, we chose a pragmatic, iterative approach that minimized business risk. Our philosophy was simple: observe, refactor, validate. We would instrument first, make targeted architectural changes, and validate each change against SLA metrics before moving on. Our technical strategy rested on four pillars: 1. **Edge-First Delivery:** Move real-time data aggregation as close to the client as possible using edge computing and regional WebSocket gateways. 2. **Event-Driven Architecture:** Replace polling with a proper pub/sub model using Redis Streams and Kafka for durable event ordering. 3. **Database Isolation:** Separate real-time read workloads from transactional write workloads using read replicas and materialized views. 4. **Observability-Led Development:** Instrument every component with distributed traces, custom metrics, and structured logging before optimizing. We deliberately avoided over-engineering. Early workshops surfaced a bias toward microservices at all costs, but our analysis showed that a well-structured monolith with clear bounded contexts would serve them better at their stage. We convinced the client to delay a full microservices migration and instead focus on process isolation and modular boundaries. ## Implementation The implementation spanned four months and three distinct phases. ### Phase 1: Observability and Baseline Before changing a single line of production code, we deployed OpenTelemetry collectors across the cluster, instrumented every service with tracing spans, and set up custom dashboards in Grafana. We wanted to know exactly where time was spent, where connections failed, and which queries caused the most contention. Within two weeks, we had a clear picture. The database was the primary bottleneck—specifically, the users table and the events table, which had no composite indexes and were both victims of sequential scans under load. WebSocket handlers were leaking memory due to unhandled promise rejections in message processing, and the Redis cache had an eviction policy set to `allkeys-lru` but no monitoring to alert when the hit rate dropped below 95%. ### Phase 2: Database and Cache Refactoring We began with the highest-impact change: database optimization. We added composite indexes on the events table for (user_id, created_at), and on the users table for (tenant_id, last_active). We introduced read replicas, routing all dashboard read queries to a replica cluster while writes remained on the primary. For caching, we implemented a multi-tier strategy. The first tier, Redis, served hot dashboard configurations and recently viewed analytics. The second tier, a local in-memory LRU cache within each service replica, handled session-specific data that was expensive to reconstruct. We introduced cache-warming jobs during low-traffic windows to preload data likely to be accessed during peak hours. A critical decision here was introducing a dedicated connection pool for real-time analytics queries, isolated from the admin pool. This alone reduced cascading failures by 73% within the first month. ### Phase 3: Real-Time Layer and WebSocket Gateway The polling-to-pub/sub migration was the most complex phase. We introduced Kafka as the central event log, with Redis Streams acting as a lightweight cache for recent events per tenant. Clients no longer polled. Instead, they opened a WebSocket connection to a dedicated gateway built on `uWebSockets.js`. This gateway was orders of magnitude lighter than the previous Socket.io implementation, handling individual connections with minimal overhead. Behind the gateway, a set of consumer services processed events, computed aggregations, and pushed updates to connected clients. We deployed these gateways across three AWS regions—us-east-1, eu-west-1, and ap-south-1—using CloudFront and Route 53 latency-based routing. Clients automatically connected to the nearest edge location, reducing round-trip latency by an average of 140ms for international users. To handle connection scaling, we configured Kubernetes Horizontal Pod Autoscalers based on custom metrics (active connections per pod), with a target of 5,000 connections per pod. We also introduced circuit breakers on all outbound calls, using `opossum` with a custom health-check circuit that would degrade gracefully—showing static cached data with a clear "data may be outdated" indicator—rather than showing a full error page. ### Phase 4: Load Testing and Validation With the new architecture in place, we ran a week-long load test simulating 500,000 concurrent WebSocket connections and 12,000 requests per second. We used `k6` for synthetic load and a custom event generator that mimicked real user behavior—not just uniform requests, but bursty patterns typical of flash sales and product launches. The results validated our approach. The system remained stable well beyond target, with CPU utilization under 60% and memory usage predictable and stable. WebSocket reconnection rate dropped to 0.3%, and database connection pool utilization stabilized at 45% during peak—a dramatic improvement from the 92% saturation we had seen before. ## Results After three months of production operation, the client's platform metrics told a dramatically different story than the baseline measurements we had taken on day one. The dashboard API achieved a p95 latency of 98ms—well under the 150ms target. Uptime reached 99.97% across the quarter, exceeding the contractual SLA of 99.95%. The WebSocket layer handled a sustained peak of 520,000 concurrent connections during the Black Friday event, with zero dropped messages and no degradation in data freshness. Revenue impact was immediate and tangible. Since the dashboard became reliable during peak shopping events, merchant churn decreased by 22% in the following quarter. The client's NPS score rose from 31 to 58, a direct reflection of restored trust in the platform. Operational costs actually decreased by 15% despite the tenfold traffic increase, thanks to more efficient compute utilization and the elimination of emergency engineering hours previously spent on firefighting. Perhaps more importantly, the client's engineering team regained the ability to ship features. Before the overhaul, 40% of engineering cycles were consumed by infrastructure maintenance and incident response. Post-rework, that number dropped to 8%, freeing the team to focus on product differentiation. ## Metrics Here are the key before-and-after metrics that defined the success of this engagement. | Metric | Before | After | Target | |--------|--------|-------|--------| | p95 API Latency | 420ms | 98ms | <150ms | | p99 API Latency | 890ms | 187ms | <250ms | | Uptime | 97.2% | 99.97% | >99.95% | | WebSocket Reconnection Rate | 18% | 0.3% | <1% | | Peak Concurrent Connections | 52,000 | 520,000 | 500,000 | | Dashboard Data Freshness | 18s | 4.2s | <5s | | DB Connection Pool Utilization (Peak) | 92% | 45% | <70% | | Monthly Engineering Hours on Incidents | 160h | 24h | <40h | | Cache Hit Rate | 82% | 97% | >95% | These numbers are not theoretical. They were measured over a full quarter of production traffic using the same instrumentation and alerting thresholds the client continues to use today. ## Lessons Learned No project of this scale goes perfectly, and this engagement was no exception. We walked away with several hard-earned lessons that have shaped how Webskyne approaches infrastructure work. **Lesson 1: Invest in observability before optimization.** We nearly fell into the trap of making changes based on assumptions. The first two days of instrumentation paid for themselves within the first week by preventing us from optimizing the wrong component. Observability is not overhead—it is the foundation of all meaningful improvement. **Lesson 2: Shared resources are hidden failure points.** The most damaging issues in the original architecture traced back to shared database connections and a shared Redis instance without proper resource isolation. When you have multiple critical workloads competing for the same resource, the question is not if they will collide, but when. **Lesson 3: Premature abstraction is expensive.** We were asked early on to split the monolith into 12 microservices. Our analysis showed this would add significant operational complexity without solving the core problem. By keeping the monorepo structure but enforcing clean module boundaries, we delivered the same scalability with a fraction of the operational burden. **Lesson 4: Edge computing requires thoughtful data consistency.** Moving logic closer to users is powerful, but it introduces cache invalidation challenges. We solved this by using Kafka as the source of truth and edge caches as eventually consistent projections. The key was making the consistency model explicit to both the engineering team and the client. **Lesson 5: Load testing must mimic reality.** Our first load tests used uniform request patterns and failed to expose the connection storms that occurred when clients simultaneously reconnected after a maintenance window. It was only when we introduced realistic burst patterns that we uncovered a critical race condition in our connection pooling logic. ## Conclusion This engagement reinforced a belief central to Webskyne's engineering philosophy: the best infrastructure changes are those that are invisible to the end user. When we did our job well, the client's merchants simply noticed that the dashboard was faster, more reliable, and more useful—without caring about the underlying architectural shifts that made it possible. The real victory was not the metrics themselves. It was that the client's team could return to building product rather than maintaining infrastructure. That shift—from reactive firefighting to proactive innovation—is the most meaningful outcome we could have asked for. --- *This case study was compiled by the Webskyne editorial team from direct project records, post-engagement interviews, and continuous production metrics collected during Q1–Q2 2026.*

Scaling a Real-Time Analytics Dashboard: How We Handled 10x Traffic Growth Without Breaking a Sweat

Related Posts

How a Regional Retail Chain Increased Online Revenue by 340% Through Digital Transformation

How PayCurrent Rebuilt Their Payment Gateway and Cut Latency by 62%

How Webskyne Helped a Retail Chain Cut Checkout Abandonment by 34% Through UX Redesign