Webskyne
Webskyne
LOGIN
← Back to journal

23 June 2026 • 6 min read

Scaling Real-Time Notifications: How We Built a Million-Operations-Per-Second Notification Engine

A deep dive into architecting and deploying a distributed notification system that handles over 1 million operations per second for a global e-commerce platform. Learn how we leveraged event-driven architecture, Redis Streams, and container orchestration to achieve 99.99% uptime while reducing infrastructure costs by 40%.

Case StudyScalabilityMicroservicesRedisKubernetesReal-timeE-commercePerformanceObservability
Scaling Real-Time Notifications: How We Built a Million-Operations-Per-Second Notification Engine

Overview

In 2024, our client—a Fortune 500 e-commerce platform processing over 50 million daily active users—faced a critical scalability challenge. Their legacy notification system, built on monolithic polling architecture, was buckling under peak loads of 200,000 messages per minute. Users experienced delayed order confirmations, missed promotional alerts, and app crashes during flash sale events. The business impact was severe: 15% cart abandonment increase during peak hours and customer satisfaction scores plummeting.

Our mandate was clear: redesign the notification infrastructure to handle at least 1 million operations per second with sub-500ms delivery times, while maintaining 99.99% uptime and reducing operational costs. This case study details our journey from diagnosis to deployment.

The Challenge

The existing system suffered from fundamental architectural flaws:

  • Single-point bottlenecks: All notifications routed through a single database cluster, creating cascading failures
  • Inefficient polling: Mobile clients polled every 30 seconds, draining battery and generating wasteful network traffic
  • Lack of prioritization: Critical alerts (security, orders) competed equally with promotional messages
  • Infrastructure sprawl: 47 separate microservices with inconsistent notification logic
  • No observability: Zero real-time metrics or delivery confirmation tracking

Adding to the complexity, the system needed to support multiple channels simultaneously: push notifications, SMS, email, in-app messages, and web sockets—all with different delivery guarantees and formats.

Goals & Requirements

We established measurable objectives aligned with business KPIs:

  1. Scale: Process 1M+ ops/sec during peak loads with horizontal scalability
  2. Performance: P99 delivery latency under 500ms, P50 under 100ms
  3. Reliability: 99.99% uptime with automatic failover and graceful degradation
  4. Cost efficiency: Reduce infrastructure costs by 30-40% through optimization
  5. Developer experience: Unified SDK reducing integration time from 3 days to 2 hours

Technical requirements included multi-region deployment, GDPR-compliant data handling, support for 20+ message templates, and seamless integration with existing microservices.

Our Approach

We adopted a phased approach, beginning with a comprehensive audit of the existing system. Over two weeks, our team traced every notification path, identifying hotspots through distributed tracing and log analysis.

The solution architecture centered on event-driven microservices with these core principles:

  • Event sourcing: All notification events stored in immutable logs
  • CQRS pattern: Separate read/write models for optimized queries
  • Chaos engineering: Proactively test failure scenarios before production
  • Gradual rollout: Blue-green deployments with feature flags

We selected a tech stack optimized for high-throughput messaging:

ComponentTechnologyJustification
Message QueueRedis Streams + Apache KafkaDurability with low-latency pub/sub
StorageCassandra + DynamoDBMulti-region availability with tunable consistency
ProcessingNode.js + Go servicesOptimal for I/O bound workloads
OrchestrationKubernetes + NomadMulti-cloud provider flexibility

Implementation Journey

Phase 1: Foundation (Weeks 1-4)

We began by building the core event pipeline. The architecture featured three distinct layers:

Ingestion Layer: REST and gRPC endpoints for receiving notification events from upstream services. Each event underwent schema validation and enrichment before entering the stream.

Processing Layer: A fleet of worker pools consuming events from Redis Streams. Workers handled template rendering, personalization, and channel-specific formatting. Rate limiting and priority queuing were implemented at this layer.

Delivery Layer: Specialized connectors for each channel—APNs for iOS, FCM for Android, SMTP for email, Twilio for SMS. Each connector maintained its own retry logic and delivery confirmation handling.

Phase 2: Scaling & Optimization (Weeks 5-8)

Performance testing revealed bottlenecks in our initial design. We implemented several key optimizations:

  • Connection pooling: Reduced TCP handshake overhead by 85%
  • Batched writes: Increased database throughput by 3x through bulk inserts
  • Adaptive fan-out: Dynamically scaled worker pools based on queue depth
  • Smart retries: Exponential backoff with jitter reduced retry storms by 73%

The most significant breakthrough came from implementing a circuit breaker pattern for external provider connections. During a major FCM outage, our system gracefully degraded to SMS-only for non-critical alerts while maintaining order confirmation delivery.

Phase 3: Observability & Control (Weeks 9-12)

We built a comprehensive observability stack using OpenTelemetry, Prometheus, and Grafana. Key dashboards tracked:

  • Real-time throughput by channel and priority level
  • Delivery latency distributions and SLA compliance
  • Error rates and categorization for root cause analysis
  • Infrastructure costs and resource utilization efficiency

An admin console allowed product managers to create notification templates, configure routing rules, and monitor campaign performance without engineering involvement.

Results & Impact

The system went live in stages over six weeks, with full rollout completed in March 2025. The results exceeded our targets across all metrics:

MetricBeforeAfterImprovement
Peak throughput200K ops/sec1.2M ops/sec6x increase
P99 latency2.3 seconds320ms86% reduction
Uptime99.2%99.99%7.9% improvement
Infrastructure cost$47K/month$28K/month40% reduction
Integration time3 days2 hours92% faster

Business metrics showed immediate improvement: cart abandonment dropped to historical lows, customer satisfaction scores increased 18%, and the marketing team reported 32% higher open rates due to better timing precision.

Key Metrics Deep Dive

Our monitoring revealed fascinating patterns in user behavior and system performance:

User Engagement Windows: Notifications sent during optimal 15-minute windows based on user timezone and historical open rates showed 2.3x higher engagement. Our machine learning model trained on this data achieved 87% accuracy in predicting best delivery times.

Channel Performance: Push notifications had the highest immediate open rate (67%) but shortest shelf life (4 hours). Email performed best for non-time-sensitive communications with 41% week-long engagement. SMS maintained 91% delivery rate even during network outages.

Error Distribution: 78% of delivery failures were recoverable (device offline, temporary provider issues). Our retry system successfully delivered 94% of these within 15 minutes. The remaining 22% were invalid tokens or unsubscribed users—crucial data for maintaining list hygiene.

Lessons Learned

  1. Start with metrics: Before touching code, we instrumented everything. This saved months of debugging by revealing the true bottlenecks immediately.
  2. Design for failure: Building circuit breakers and graceful degradation paths from day one prevented customer-impacting outages.
  3. Team alignment matters: Weekly cross-functional reviews with product, marketing, and operations ensured we solved the right problems.
  4. Test at scale early: Load testing with production-like traffic in staging caught scaling issues before they affected customers.
  5. Incremental wins: Shipping working improvements every two weeks kept stakeholders engaged and provided opportunities for course correction.

The project taught us that infrastructure transformations succeed not through revolutionary changes, but through methodical execution and continuous feedback. By maintaining backward compatibility throughout the migration, we achieved zero downtime while completely replacing the notification stack.

Looking Forward

With the system stable, we're now exploring AI-driven personalization engines and predictive notification scheduling. The infrastructure we built provides a solid foundation for these advanced features, with message queues capable of handling 5M ops/sec and room for growth into 2027 and beyond.

This project reaffirmed our belief that the best engineering solutions balance technical excellence with business outcome focus. Every architectural decision tied back to measurable improvements in user experience and operational efficiency.

Related Posts

Scaling Webskyne's API Gateway: From Monolith to Microservices Architecture
Case Study

Scaling Webskyne's API Gateway: From Monolith to Microservices Architecture

How Webskyne transformed its monolithic API infrastructure into a scalable microservices architecture, handling a 500% increase in request volume while reducing latency by 65% and improving system reliability across our global client base.

From 5-Second Timeouts to 120ms Responses: How We Cut API Latency by 60% for a Fintech Startup
Case Study

From 5-Second Timeouts to 120ms Responses: How We Cut API Latency by 60% for a Fintech Startup

When PayStream, a Series A fintech startup offering real-time payroll disbursement to Southeast Asian SMEs, started bleeding users because their API ground to a halt during peak payroll-processing hours, we were brought in to diagnose and fix a monolithic Node.js backend that hadn't been meaningfully optimized since day one. In this comprehensive case study, we walk through the four-phase modernization plan — database query overhaul, Redis read-through caching, Cloudflare edge deployment, and BullMQ async job extraction — that took p95 latency from 5.2 seconds down to 1.2 seconds and monthly error rates from 5.1 percent to 0.08 percent. The full account covers the deep-dive audit methodology, the specific architectural changes, the measurable business results that reversed enterprise churn and restored client confidence, and the five hard-won lessons learned that any engineering leader can apply to a platform growing faster than its infrastructure story. Our work with PayStream is a cautionary tale about what happens when product velocity outpaces platform investment.

How We Scaled a Cross-Platform FinTech App to 500K Users with Flutter and NestJS on AWS
Case Study

How We Scaled a Cross-Platform FinTech App to 500K Users with Flutter and NestJS on AWS

In early 2025, Webskyne was tasked with rebuilding a struggling consumer banking application that had plateaued at 120,000 monthly active users with a 3.2-star Android rating. Sporadic crashes, 780-millisecond API response times, and an inconsistent cross-platform experience were driving customer churn and support costs upward. Over a six-month engagement, we redesigned the system from the ground up, unifying fragmented native iOS and Android codebases into a single Flutter repository, migrating the Express on EC2 backend to NestJS on Lambda, and replacing fragile EC2-hosted PostgreSQL with Amazon RDS, DynamoDB caching, and a fully documented infrastructure layer using AWS CDK. This case study examines the architectural decisions, the strangler fig migration strategy, performance engineering choices, and operational transformations that enabled the platform to reach 525,000 active users while improving crash-free sessions from 91% to 99.6%, dropping 95th percentile API latency to 140 milliseconds, and cutting infrastructure cost growth to just 1.8 times despite a fourfold increase in scale.