Scaling Real-Time Notifications: How We Built a Million-Operations-Per-Second Notification Engine

A deep dive into architecting and deploying a distributed notification system that handles over 1 million operations per second for a global e-commerce platform. Learn how we leveraged event-driven architecture, Redis Streams, and container orchestration to achieve 99.99% uptime while reducing infrastructure costs by 40%.

Overview

In 2024, our client—a Fortune 500 e-commerce platform processing over 50 million daily active users—faced a critical scalability challenge. Their legacy notification system, built on monolithic polling architecture, was buckling under peak loads of 200,000 messages per minute. Users experienced delayed order confirmations, missed promotional alerts, and app crashes during flash sale events. The business impact was severe: 15% cart abandonment increase during peak hours and customer satisfaction scores plummeting.

Our mandate was clear: redesign the notification infrastructure to handle at least 1 million operations per second with sub-500ms delivery times, while maintaining 99.99% uptime and reducing operational costs. This case study details our journey from diagnosis to deployment.

The Challenge

The existing system suffered from fundamental architectural flaws:

Single-point bottlenecks: All notifications routed through a single database cluster, creating cascading failures
Inefficient polling: Mobile clients polled every 30 seconds, draining battery and generating wasteful network traffic
Lack of prioritization: Critical alerts (security, orders) competed equally with promotional messages
Infrastructure sprawl: 47 separate microservices with inconsistent notification logic
No observability: Zero real-time metrics or delivery confirmation tracking

Adding to the complexity, the system needed to support multiple channels simultaneously: push notifications, SMS, email, in-app messages, and web sockets—all with different delivery guarantees and formats.

Goals & Requirements

We established measurable objectives aligned with business KPIs:

Scale: Process 1M+ ops/sec during peak loads with horizontal scalability
Performance: P99 delivery latency under 500ms, P50 under 100ms
Reliability: 99.99% uptime with automatic failover and graceful degradation
Cost efficiency: Reduce infrastructure costs by 30-40% through optimization
Developer experience: Unified SDK reducing integration time from 3 days to 2 hours

Technical requirements included multi-region deployment, GDPR-compliant data handling, support for 20+ message templates, and seamless integration with existing microservices.

Our Approach

We adopted a phased approach, beginning with a comprehensive audit of the existing system. Over two weeks, our team traced every notification path, identifying hotspots through distributed tracing and log analysis.

The solution architecture centered on event-driven microservices with these core principles:

Event sourcing: All notification events stored in immutable logs
CQRS pattern: Separate read/write models for optimized queries
Chaos engineering: Proactively test failure scenarios before production
Gradual rollout: Blue-green deployments with feature flags

We selected a tech stack optimized for high-throughput messaging:

Component	Technology	Justification
Message Queue	Redis Streams + Apache Kafka	Durability with low-latency pub/sub
Storage	Cassandra + DynamoDB	Multi-region availability with tunable consistency
Processing	Node.js + Go services	Optimal for I/O bound workloads
Orchestration	Kubernetes + Nomad	Multi-cloud provider flexibility

Implementation Journey

Phase 1: Foundation (Weeks 1-4)

We began by building the core event pipeline. The architecture featured three distinct layers:

Ingestion Layer: REST and gRPC endpoints for receiving notification events from upstream services. Each event underwent schema validation and enrichment before entering the stream.

Processing Layer: A fleet of worker pools consuming events from Redis Streams. Workers handled template rendering, personalization, and channel-specific formatting. Rate limiting and priority queuing were implemented at this layer.

Delivery Layer: Specialized connectors for each channel—APNs for iOS, FCM for Android, SMTP for email, Twilio for SMS. Each connector maintained its own retry logic and delivery confirmation handling.

Phase 2: Scaling & Optimization (Weeks 5-8)

Performance testing revealed bottlenecks in our initial design. We implemented several key optimizations:

Connection pooling: Reduced TCP handshake overhead by 85%
Batched writes: Increased database throughput by 3x through bulk inserts
Adaptive fan-out: Dynamically scaled worker pools based on queue depth
Smart retries: Exponential backoff with jitter reduced retry storms by 73%

The most significant breakthrough came from implementing a circuit breaker pattern for external provider connections. During a major FCM outage, our system gracefully degraded to SMS-only for non-critical alerts while maintaining order confirmation delivery.

Phase 3: Observability & Control (Weeks 9-12)

We built a comprehensive observability stack using OpenTelemetry, Prometheus, and Grafana. Key dashboards tracked:

Real-time throughput by channel and priority level
Delivery latency distributions and SLA compliance
Error rates and categorization for root cause analysis
Infrastructure costs and resource utilization efficiency

An admin console allowed product managers to create notification templates, configure routing rules, and monitor campaign performance without engineering involvement.

Results & Impact

The system went live in stages over six weeks, with full rollout completed in March 2025. The results exceeded our targets across all metrics:

Metric	Before	After	Improvement
Peak throughput	200K ops/sec	1.2M ops/sec	6x increase
P99 latency	2.3 seconds	320ms	86% reduction
Uptime	99.2%	99.99%	7.9% improvement
Infrastructure cost	$47K/month	$28K/month	40% reduction
Integration time	3 days	2 hours	92% faster

Business metrics showed immediate improvement: cart abandonment dropped to historical lows, customer satisfaction scores increased 18%, and the marketing team reported 32% higher open rates due to better timing precision.

Key Metrics Deep Dive

Our monitoring revealed fascinating patterns in user behavior and system performance:

User Engagement Windows: Notifications sent during optimal 15-minute windows based on user timezone and historical open rates showed 2.3x higher engagement. Our machine learning model trained on this data achieved 87% accuracy in predicting best delivery times.

Channel Performance: Push notifications had the highest immediate open rate (67%) but shortest shelf life (4 hours). Email performed best for non-time-sensitive communications with 41% week-long engagement. SMS maintained 91% delivery rate even during network outages.

Error Distribution: 78% of delivery failures were recoverable (device offline, temporary provider issues). Our retry system successfully delivered 94% of these within 15 minutes. The remaining 22% were invalid tokens or unsubscribed users—crucial data for maintaining list hygiene.

Lessons Learned

Start with metrics: Before touching code, we instrumented everything. This saved months of debugging by revealing the true bottlenecks immediately.
Design for failure: Building circuit breakers and graceful degradation paths from day one prevented customer-impacting outages.
Team alignment matters: Weekly cross-functional reviews with product, marketing, and operations ensured we solved the right problems.
Test at scale early: Load testing with production-like traffic in staging caught scaling issues before they affected customers.
Incremental wins: Shipping working improvements every two weeks kept stakeholders engaged and provided opportunities for course correction.

The project taught us that infrastructure transformations succeed not through revolutionary changes, but through methodical execution and continuous feedback. By maintaining backward compatibility throughout the migration, we achieved zero downtime while completely replacing the notification stack.

Looking Forward

With the system stable, we're now exploring AI-driven personalization engines and predictive notification scheduling. The infrastructure we built provides a solid foundation for these advanced features, with message queues capable of handling 5M ops/sec and room for growth into 2027 and beyond.

This project reaffirmed our belief that the best engineering solutions balance technical excellence with business outcome focus. Every architectural decision tied back to measurable improvements in user experience and operational efficiency.