23 June 2026 ⢠6 min read
Scaling Real-Time Notifications: How We Built a Million-Operations-Per-Second Notification Engine
A deep dive into architecting and deploying a distributed notification system that handles over 1 million operations per second for a global e-commerce platform. Learn how we leveraged event-driven architecture, Redis Streams, and container orchestration to achieve 99.99% uptime while reducing infrastructure costs by 40%.
Overview
In 2024, our clientâa Fortune 500 e-commerce platform processing over 50 million daily active usersâfaced a critical scalability challenge. Their legacy notification system, built on monolithic polling architecture, was buckling under peak loads of 200,000 messages per minute. Users experienced delayed order confirmations, missed promotional alerts, and app crashes during flash sale events. The business impact was severe: 15% cart abandonment increase during peak hours and customer satisfaction scores plummeting.
Our mandate was clear: redesign the notification infrastructure to handle at least 1 million operations per second with sub-500ms delivery times, while maintaining 99.99% uptime and reducing operational costs. This case study details our journey from diagnosis to deployment.
The Challenge
The existing system suffered from fundamental architectural flaws:
- Single-point bottlenecks: All notifications routed through a single database cluster, creating cascading failures
- Inefficient polling: Mobile clients polled every 30 seconds, draining battery and generating wasteful network traffic
- Lack of prioritization: Critical alerts (security, orders) competed equally with promotional messages
- Infrastructure sprawl: 47 separate microservices with inconsistent notification logic
- No observability: Zero real-time metrics or delivery confirmation tracking
Adding to the complexity, the system needed to support multiple channels simultaneously: push notifications, SMS, email, in-app messages, and web socketsâall with different delivery guarantees and formats.
Goals & Requirements
We established measurable objectives aligned with business KPIs:
- Scale: Process 1M+ ops/sec during peak loads with horizontal scalability
- Performance: P99 delivery latency under 500ms, P50 under 100ms
- Reliability: 99.99% uptime with automatic failover and graceful degradation
- Cost efficiency: Reduce infrastructure costs by 30-40% through optimization
- Developer experience: Unified SDK reducing integration time from 3 days to 2 hours
Technical requirements included multi-region deployment, GDPR-compliant data handling, support for 20+ message templates, and seamless integration with existing microservices.
Our Approach
We adopted a phased approach, beginning with a comprehensive audit of the existing system. Over two weeks, our team traced every notification path, identifying hotspots through distributed tracing and log analysis.
The solution architecture centered on event-driven microservices with these core principles:
- Event sourcing: All notification events stored in immutable logs
- CQRS pattern: Separate read/write models for optimized queries
- Chaos engineering: Proactively test failure scenarios before production
- Gradual rollout: Blue-green deployments with feature flags
We selected a tech stack optimized for high-throughput messaging:
| Component | Technology | Justification |
|---|---|---|
| Message Queue | Redis Streams + Apache Kafka | Durability with low-latency pub/sub |
| Storage | Cassandra + DynamoDB | Multi-region availability with tunable consistency |
| Processing | Node.js + Go services | Optimal for I/O bound workloads |
| Orchestration | Kubernetes + Nomad | Multi-cloud provider flexibility |
Implementation Journey
Phase 1: Foundation (Weeks 1-4)
We began by building the core event pipeline. The architecture featured three distinct layers:
Ingestion Layer: REST and gRPC endpoints for receiving notification events from upstream services. Each event underwent schema validation and enrichment before entering the stream.
Processing Layer: A fleet of worker pools consuming events from Redis Streams. Workers handled template rendering, personalization, and channel-specific formatting. Rate limiting and priority queuing were implemented at this layer.
Delivery Layer: Specialized connectors for each channelâAPNs for iOS, FCM for Android, SMTP for email, Twilio for SMS. Each connector maintained its own retry logic and delivery confirmation handling.
Phase 2: Scaling & Optimization (Weeks 5-8)
Performance testing revealed bottlenecks in our initial design. We implemented several key optimizations:
- Connection pooling: Reduced TCP handshake overhead by 85%
- Batched writes: Increased database throughput by 3x through bulk inserts
- Adaptive fan-out: Dynamically scaled worker pools based on queue depth
- Smart retries: Exponential backoff with jitter reduced retry storms by 73%
The most significant breakthrough came from implementing a circuit breaker pattern for external provider connections. During a major FCM outage, our system gracefully degraded to SMS-only for non-critical alerts while maintaining order confirmation delivery.
Phase 3: Observability & Control (Weeks 9-12)
We built a comprehensive observability stack using OpenTelemetry, Prometheus, and Grafana. Key dashboards tracked:
- Real-time throughput by channel and priority level
- Delivery latency distributions and SLA compliance
- Error rates and categorization for root cause analysis
- Infrastructure costs and resource utilization efficiency
An admin console allowed product managers to create notification templates, configure routing rules, and monitor campaign performance without engineering involvement.
Results & Impact
The system went live in stages over six weeks, with full rollout completed in March 2025. The results exceeded our targets across all metrics:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Peak throughput | 200K ops/sec | 1.2M ops/sec | 6x increase |
| P99 latency | 2.3 seconds | 320ms | 86% reduction |
| Uptime | 99.2% | 99.99% | 7.9% improvement |
| Infrastructure cost | $47K/month | $28K/month | 40% reduction |
| Integration time | 3 days | 2 hours | 92% faster |
Business metrics showed immediate improvement: cart abandonment dropped to historical lows, customer satisfaction scores increased 18%, and the marketing team reported 32% higher open rates due to better timing precision.
Key Metrics Deep Dive
Our monitoring revealed fascinating patterns in user behavior and system performance:
User Engagement Windows: Notifications sent during optimal 15-minute windows based on user timezone and historical open rates showed 2.3x higher engagement. Our machine learning model trained on this data achieved 87% accuracy in predicting best delivery times.
Channel Performance: Push notifications had the highest immediate open rate (67%) but shortest shelf life (4 hours). Email performed best for non-time-sensitive communications with 41% week-long engagement. SMS maintained 91% delivery rate even during network outages.
Error Distribution: 78% of delivery failures were recoverable (device offline, temporary provider issues). Our retry system successfully delivered 94% of these within 15 minutes. The remaining 22% were invalid tokens or unsubscribed usersâcrucial data for maintaining list hygiene.
Lessons Learned
- Start with metrics: Before touching code, we instrumented everything. This saved months of debugging by revealing the true bottlenecks immediately.
- Design for failure: Building circuit breakers and graceful degradation paths from day one prevented customer-impacting outages.
- Team alignment matters: Weekly cross-functional reviews with product, marketing, and operations ensured we solved the right problems.
- Test at scale early: Load testing with production-like traffic in staging caught scaling issues before they affected customers.
- Incremental wins: Shipping working improvements every two weeks kept stakeholders engaged and provided opportunities for course correction.
The project taught us that infrastructure transformations succeed not through revolutionary changes, but through methodical execution and continuous feedback. By maintaining backward compatibility throughout the migration, we achieved zero downtime while completely replacing the notification stack.
Looking Forward
With the system stable, we're now exploring AI-driven personalization engines and predictive notification scheduling. The infrastructure we built provides a solid foundation for these advanced features, with message queues capable of handling 5M ops/sec and room for growth into 2027 and beyond.
This project reaffirmed our belief that the best engineering solutions balance technical excellence with business outcome focus. Every architectural decision tied back to measurable improvements in user experience and operational efficiency.
