Webskyne
Webskyne
LOGIN
← Back to journal

8 May 202623 min read

Scaling to 2 Million Events/Minute: Building a Serverless Real-Time Analytics Engine

When our retail analytics customer's batch-powered dashboard began failing during peak hours, we faced a hard deadline: process live events at scale or lose their business. Their legacy system—a nightly Spark job on a 12-node EMR cluster—produced reports 14 hours after data collection, making it useless for real-time decision-making. Over 11 months, we architected and migrated them to a fully serverless real-time analytics pipeline on AWS, capable of ingesting, enriching, and serving 2 million events per minute with 99.99% uptime. This is the story of that transformation: the technology choices (Kinesis, DynamoDB Streams, Lambda, and ClickHouse), the data consistency patterns that kept 47 microservices coordinated, and the observability practices that turned a 4-hour weekly firefight into a clean, automated system. We'll examine why Lambda-powered stream processors outperformed a 50-node Flink cluster, how we achieved exactly-once semantics across multiple async boundaries, and which operational guardrails made a 10x scale increase feasible without a proportional cost increase.

Case StudyserverlessstreamingKinesisLambdaDynamoDBClickHousereal-time analyticsAWS
Scaling to 2 Million Events/Minute: Building a Serverless Real-Time Analytics Engine

Introduction: The Batch-to-Real-Time Tipping Point

In March 2025, RetailMetrics Inc., a B2B SaaS provider powering analytics for 180 mid-to-large retailers, confronted a critical inflection point. Their flagship product—an inventory optimization and demand forecasting platform—processed customer data through nightly batch jobs. A 12-node Apache Spark cluster on Amazon EMR ingested the day's sales transactions, ran predictive models, and generated dashboard reports by 10 AM the following morning. This cadence worked when retailers planned采购 decisions days in advance. But as competitive pressures intensified and same-day delivery expectations became standard, "next-morning insights" turned from valuable to obsolete.

The breaking point arrived during Q4 2024 holiday season. Three major clients—each representing $200K+ annual contracts—threatened cancellation unless the platform provided real-time inventory visibility. Their operations teams needed to see sell-through rates, stockout warnings, and demand surges as they happened, not 14 hours later. The business milestone was clear: deliver sub-minute data freshness for 90% of metrics, handle 5x holiday traffic spikes, and maintain 99.9% availability—all within 9 months, or risk a $1.8M revenue hit from cancellations.

The technical challenge was formidable: scale from tens of thousands of events per hour to millions per minute; transform from a once-a-day batch paradigm to continuous stream processing; maintain data accuracy across complex joins and aggregations; and do so without exploding costs or disrupting existing analytics workloads. Over the next 11 months, we led a complete platform redesign, moving from an EMR-based batch architecture to a fully serverless streaming architecture centered on Kinesis Data Streams, Lambda, DynamoDB, and ClickHouse. The project succeeded beyond expectations: by Black Friday 2025, the new system processed a peak of 2.1 million events per minute, delivered dashboards with <30-second latency, reduced infrastructure cost by 35%, and achieved 99.99% uptime. This case study dissects that journey—the architecture decisions that mattered, the data consistency patterns we used, the mistakes we encountered, and the operational practices that turned a high-risk transformation into a business-critical success.

The Legacy System: Why Batch Couldn't Scale

Before the migration, RetailMetrics' data pipeline had three primary stages:

  1. Ingestion: Retail partners uploaded CSV exports via SFTP or API POSTed daily transaction files (~5–50 GB per client).
  2. Batch processing: Every night at 2 AM, an AWS Step Functions workflow spun up a 12-node Spark EMR cluster, loaded the day's files into S3, ran Scala ETL jobs (joins with product master data, store hierarchy mapping, anomaly detection), then wrote Parquet outputs to S3.
  3. Serving: At 8 AM daily, an Athena query refreshed Redshift materialized views, and a Node.js API served dashboard requests from the aggregated dataset.

This architecture exhibited six fatal flaws for real-time needs:

1. Latency ceiling: Even in best-case scenarios, data was 12–14 hours stale. Store managers checking morning inventory at 8 AM were seeing yesterday's sell-through rates, not last hour's.

2. Spike fragility: Holiday seasons produced 5× daily transaction volume. The EMR cluster struggled with cluster provisioning overhead and Spark job GC pauses; four days in December 2024 saw incomplete batch jobs, requiring manual re-runs and causing delayed reports for key accounts.

3. Cost inefficiency: The EMR cluster ran 22 hours daily (2 AM provisioning → 12 PM tear-down, with idle time between jobs), costing $18K/month in EC2 spend despite low average utilization (~35%).

4. Data consistency gaps: Late-arriving files (common with retail partners' batch uploads) required manual reconciliation. If a retailer submitted a corrected CSV at 3 PM, that data wouldn't appear until the next day's batch—creating silent data gaps in daily reports.

5. Stunted innovation velocity: Adding a new metric required modifying Spark jobs, re-running historical backfills (often days-long), and manually testing against sample datasets. The team could ship at most two new features per quarter.

6. Amplified operational burden: The platform team spent 20+ hours weekly on EMR cluster tuning, job failure triage, and data quality fire-drills. The complexity of managing JVM-based Spark deployments on variable instance types created persistent toil.

One telling incident crystallized the urgency. A top client—a national apparel retailer with 220 stores—experienced a flash sale that sold 8,000 units of a single SKU in 45 minutes. Their inventory system showed the product as "in stock" across all stores for 3 hours because the batch hadn't processed the sales transactions yet. Customers placed orders for an item that was actually sold out, triggering a cascade of order cancellations and customer service volume that cost an estimated $50K in lost trust and operational overhead that day.

The business case was undeniable: real-time or near-real-time analytics was no longer a nice-to-have; it was essential for survival in a high-velocity retail environment. But the path from batch to streaming without disrupting ongoing operations demanded a careful, phased strategy.

Initial Requirements

We gathered requirements from three stakeholder groups:

  • Business users: "Show me sell-through by store and SKU within 5 minutes of the sale."
  • Data engineers: "Maintain SQL query compatibility with existing reports—we can't rewrite 500 saved queries."
  • Platform engineering: "Reduce operational toil, enable self-service data pipelines, and keep AWS spend under $22K/month."

From these, we derived technical non-negotiables:

  • Latency: 90% of event-to-dashboard time <30 seconds.
  • Throughput: Support 2M events/minute during peak holiday sales.
  • Availability: 99.99% uptime (≤52 minutes downtime/month).
  • Durability: Zero data loss, even during partial outages.
  • Query compatibility: Existing analytical SQL queries must work unchanged.
  • Cost cap: Infrastructure cost ≤ previous $22K/month baseline.

Architecture Deep Dive: The Serverless Streaming Stack

We evaluated several streaming architectures: managed Apache Flink (Kinesis Data Analytics), self-managed Spark Structured Streaming on EMR, Kafka with ksqlDB, and a pure serverless event-driven design using AWS-native services. After rigorous proof-of-concepts, we selected a serverless-first approach centered on six AWS services and two purpose-built data stores.

Core Ingestion Layer: Kinesis Data Streams

We chose Amazon Kinesis Data Streams over alternatives (SQS, Kafka MSK) because:

  • Native serverless scaling: Kinesis scales throughput by increasing shard count; no cluster management.
  • Ordering guarantees: Per-shard ordering preserved, essential for accurate time-series analytics.
  • Retention flexibility: Configurable 24-hour default retention, extendable to 365 days for replayability.
  • Integration ecosystem: Native Lambda event sources, Kinesis Client Library (KCL) compatibility, and Data Analytics for SQL.

We provisioned two stream types:

  1. Hot path stream: 256 shards initially, scaled to 1,024 during peak season. Carried raw point-of-sale events requiring <15-second processing (inventory deduction, fraud alerts).
  2. Cold path stream: 64 shards for detailed event archival and reconciliation. Lower priority, 60-second processing tolerance.

Each event carried a standard envelope:

{
  "eventId": "uuidv4",
  "retailerId": "acme-corp",
  "storeId": "NYC-042",
  "sku": "TSHIRT-001",
  "quantity": 2,
  "price": 24.99,
  "currency": "USD",
  "timestamp": "2025-11-28T14:23:45.000Z",
  "source": "pos-terminal-12",
  "schemaVersion": 3
}

Schema versioning was critical; we used a centralized schema registry built on AWS Glue to validate incoming payloads and enable safe schema evolution without breaking consumers.

Stream Processing with AWS Lambda

Our most contentious technical decision was using AWS Lambda for stream processing rather than a dedicated streaming engine like Flink or Spark. The platform team initially favored Flink for its stateful processing guarantees and exactly-once semantics. However, after evaluating operational complexity (managing Flink clusters, checkpointing, scaling), we opted for a Lambda-based approach with the following architecture:

Per-shard Lambda function (Kinesis trigger with batch size 1,000 records, batch window 1 second). Each shard had a dedicated Lambda instance (warm container) processing records sequentially to preserve order. This provided:

  • Automatic scaling: Kinesis/Lambda integration automatically allocated one Lambda per shard during peak; scaled down to zero during quiet periods.
  • State management via DynamoDB: Lambda's ephemeral nature meant we stored intermediate state (running totals, session windows) in DynamoDB with conditional writes for atomicity.
  • Checkpointing via Kinesis: Kinesis checkpointing (sequence numbers) stored in DynamoDB ensured exactly-once processing semantics when combined with idempotent writes downstream.

Critics worried Lambda would struggle with 2M events/minute. Our math showed otherwise: with 1,024 shards and 1-second batch windows, each Lambda processed ~2K records/second, well within Lambda's 10,000 RPS concurrent execution limit per function. Actual production metrics confirmed: average Lambda duration was 850ms per batch, with CPU utilization ~40%.

The Lambda function chain implemented four processing stages:

  1. Enrichment: Add store metadata, product category, and retailer-specific taxonomy IDs from DynamoDB reference tables.
  2. Validation: Schema checks, duplicate detection (eventId already processed?), and business rule validation (quantity > 0, price < 10,000).
  3. Deduplication: Maintain a 7-day rolling deduplication cache in DynamoDB with TTL; duplicates triggered compensating delete events.
  4. Routing: Classify events into hot (inventory-critical) vs. cold (historical) paths; publish to respective Kinesis streams or directly to ClickHouse.

Lambda's 15-minute timeout ceiling was never approached; the longest batch processing (including three DynamoDB lookups and two Kinesis puts) was 3.2 seconds.

State Management with DynamoDB and Streams

Stateful operations—inventory counts, daily sales aggregates, running anomaly scores—required durable storage. We adopted DynamoDB for its single-digit millisecond latency and native DynamoDB Streams integration, enabling event-driven updates across services.

Key data models:

Inventory items table (partition key: sku, sort key: storeId):

{
  "sku": "TSHIRT-001",
  "storeId": "NYC-042",
  "quantityOnHand": 45,
  "quantityReserved": 3,
  "lastUpdated": "2025-11-28T14:23:45.000Z",
  "version": 12
}

DynamoDB conditional writes with the version attribute (optimistic locking) ensured inventory updates never overwrote concurrent modifications. When two sales for the same SKU arrived simultaneously, Lambda's sequential per-shard processing combined with DynamoDB conditional writes prevented overselling.

Aggregate tables for rollups:

  • DailySales aggregations: Composite key (retailerId + date + sku), updated by Lambda via DynamoDB UpdateExpression ADD operations (atomic increments).
  • StoreHourMetrics: Partitioned by storeId and hour bucket, supporting Hive-style queries for hourly comparisons.

DynamoDB Streams captured every data change and fed three downstream consumers:

  1. Real-time API cache invalidator: Invalidated Redis cache entries when underlying inventory changed, ensuring dashboard queries never served stale data.
  2. Audit log generator: Wrote every change to an append-only S3 audit log for compliance and debugging.
  3. Anomaly detector: A Lambda function consumed streams, running statistical models (z-score on sales velocity) to detect promotion spikes or potential data errors.

Analytical Storage: ClickHouse for Sub-Second Queries

While DynamoDB served operational lookups and small aggregates, analytical queries—"show me last 30 days of daily sales for this SKU across all stores"—required columnar storage. We evaluated Amazon Redshift, Snowflake, and open-source ClickHouse. Redshift was rejected due to minimum 2-year commitments and costly concurrency scaling. Snowflake's pay-per-query model threatened to exceed our cost cap at projected query volumes. ClickHouse, deployed on Amazon EC2 spot instances behind a load balancer, offered the right blend: SQL-compatible (simplifying query migration), columnar compression, and sub-second performance on time-series aggregations.

ClickHouse cluster configuration:

  • 4 data nodes (r6g.2xlarge, 8 vCPU, 64 GiB) on spot instances with 3-day maximum duration, backed by Auto Scaling groups.
  • 2 coordinator nodes (r6g.xlarge) for query parsing and distribution.
  • Replication: ClickHouse's built-in replication across data nodes with quorum writes ensured durability even if one node failed.
  • Schema: Replicated Star schema with events fact table (partitioned by toYYYYMMDD(timestamp), ordered by (retailerId, sku)) and dimension tables for stores, products, and retailers.

Lambda stream processors wrote directly to ClickHouse via the native HTTP interface, using batch inserts of 10K rows for efficiency. We implemented exactly-once semantics by:

  1. Including a processingId (Kinesis sequence number + shard ID) in every row.
  2. Creating a deduplication_key materialized view that dropped rows with duplicate processing IDs during insertion.
  3. Setting SETTINGS insert_deduplicate=1 on the ClickHouse cluster, which used the deduplication key to silently ignore replayed events.

The results were striking: complex analytical queries that took 8–12 seconds on Redshift now executed in 400–800 milliseconds. The team could explore data interactively in their BI tool (Metabase) instead of queuing batch jobs.

API Layer: GraphQL Gateway with Caching

We replaced the legacy Node.js REST API with a GraphQL gateway built on Apollo Server deployed to AWS Fargate behind an Application Load Balancer. GraphQL was chosen because:

  • Retailers' BI tools generated varied queries; GraphQL allowed clients to request exactly the fields they needed, reducing over-fetching.
  • We could batch N+1 queries efficiently using DataLoader pattern, preventing N+1 query problems.
  • Schema stitching enabled us to expose both real-time (from DynamoDB/ClickHouse) and batch (from S3/Athena) data in a unified interface.

The gateway used a two-tier caching strategy:

  1. Hot cache: Redis (ElastiCache) cached frequently accessed reference data (product catalogs, store hierarchies) with 60-second TTL, reducing DynamoDB read costs by 65%.
  2. Query cache: Apollo's in-memory cache for resolved GraphQL objects (TTL 30 seconds for real-time inventory queries).

All API requests were authenticated via Amazon Cognito with JWT tokens, passed through a Lambda@Edge authorizer for regional latency minimization.

Event-Driven Coordination: EventBridge for Service Orchestration

While Kinesis handled high-throughput event streams, we used Amazon EventBridge for orchestration and workflow events—especially for tasks requiring reliable retries and dead-letter handling. Examples:

  • Inventory.Reconciliation.Daily: Scheduled nightly event triggering a Lambda that compared DynamoDB inventory counts with point-of-sale totals, flagging discrepancies >2% for investigation.
  • Anomaly.Review.Required: When stream processors detected sales velocity >5σ from baseline, they published to EventBridge, which routed to a Slack channel and created a Jira ticket via SaaS integration.
  • Client.DataExported: Exported daily snapshot to S3, then EventBridge triggered an SNS notification to clients' SFTP endpoints.

Migration Strategy: Strangler Fig with Dual Write

We couldn't afford a "big bang" cutover; retailers relied on the existing dashboards 24/7. Instead, we used the strangler fig pattern with a dual-write transition strategy over 9 months.

Phase 1: Parallel Ingestion (Months 1–3)

We deployed the new streaming pipeline alongside the old batch system. Retailers' POS systems continued uploading CSVs to the existing SFTP/S3 ingestion, which fed the nightly Spark job and a new Kinesis Producer Library (KPL) agent we installed on their ingestion servers. This agent read new files in near-real-time (as they completed uploading) and published events to Kinesis, creating a live parallel feed without disrupting existing partners' workflows.

Simultaneously, we built a shadow mode validator: a Lambda function subscribed to both Kinesis (new) and Kinesis (replayed from historical batch outputs) compared results daily, flagging discrepancies for investigation. This gave us confidence that the streaming system matched batch results before any customer saw live data.

Phase 2: Canary Dashboard (Months 4–6)

We introduced a new "real-time" dashboard view within the existing UI, initially available only to internal users and two beta clients. The old batch-powered dashboards remained the default. This allowed us to:

  • Validate data correctness under real-world load.
  • Monitor ClickHouse query performance and optimize indexes.
  • Gather user feedback on refresh cadence and data freshness expectations.

During this phase, we also migrated historical data (~2 years of daily aggregates) to ClickHouse using an ETL job that read from Redshift and wrote to ClickHouse, ensuring historical comparisons worked seamlessly in the new system.

Phase 3: Staged Rollout (Months 7–9)

We progressively rolled out real-time dashboards by retailer cohort, starting with smaller clients (low event volume) and advancing to enterprise accounts. Each rollout used a feature flag controlled via Amazon DynamoDB (per-retailer settings), allowing instant rollback without deployment.

The critical cutover moment came with our largest client—a 450-store national retailer representing 22% of revenue. We scheduled a 4-hour maintenance window during their slowest sales period (Tuesday 2–6 AM). Steps:

  1. Final batch batch job ran at 2 AM, ensuring no gaps.
  2. We switched their KPL agent to produce exclusively to Kinesis (no S3 upload).
  3. Updated their dashboard configuration to point to the new GraphQL endpoints.
  4. Monitored error rates, latency, and data freshness for 4 hours.

Success criteria met: latency <20 seconds, zero errors, inventory counts matched batch verification. After 48 hours of monitoring with no anomalies, we proceeded with remaining clients.

Phase 4: Decommissioning (Months 10–11)

Once all clients migrated, we deprecated the batch pipeline:

  1. Retained Redshift for 90 days as read-only backup.
  2. Shut down EMR cluster and associated Step Functions workflows.
  3. Archived raw CSV files from S3 to Glacier Deep Archive (cost $0.00099/GB-month).
  4. Removed legacy ingestion APIs after confirming KPL agent coverage reached 99.7% of retailers.

The final cost profile revealed $7,800/month savings vs. the previous architecture—a 35% reduction—despite higher event volumes.

Performance Results and Business Impact

The production metrics after full migration tell the story:

Metric Before (Batch) After (Streaming) Change
Data freshness (p99) 14 hours 28 seconds 99.9% faster
Peak events processed/minute ~200K 2.1M 10.5× increase
Average query latency 8.4s 620ms 93% faster
Dashboard availability 99.5% 99.99% +0.49 pp
Infrastructure cost/month $22K $14.3K 35% reduction
Operational toil (hours/week) 20+ 3 85% reduction

Business outcomes were equally compelling:

  • Client retention: All at-risk clients renewed, with three expanding contract value by 40–60%.
  • Feature velocity: From 2 features/quarter to 8–10 features/month; the team shipped real-time stock alerts, promo-tracking dashboards, and anomaly detection within 4 months post-launch.
  • Sales cycle reduction: The "real-time" capability cut the enterprise sales cycle from 9 months to 4 months, becoming the primary value proposition.
  • ROI: The platform team estimated the migration paid for itself within 7 months through retained revenue and reduced operational costs.

During Black Friday/Cyber Monday 2025, the system handled a sustained 1.8M events/minute for 6 hours with zero errors. One retailer used the real-time dashboards to identify a stockout within 12 minutes and re-routed shipments from a nearby warehouse, preventing an estimated $120K in lost sales.

Challenges and Trade-offs

No migration proceeds without difficulties. We encountered several challenges that required mid-course corrections.

Lambda Cold Starts and Warm Pool Management

Initially, we configured each Kinesis shard with a Lambda trigger using batch size = 1,000 and starting position = LATEST. However, during early load tests, we observed throttling when shard count scaled rapidly from 256 to 1,024 and Lambda's concurrency limit (1,000 accounts by default) became a bottleneck. Additionally, cold starts on newly allocated shards added 1.5–3 seconds of latency.

Solution:

  1. Increased account concurrency limit to 3,000 with AWS support.
  2. Implemented provisioned concurrency for Lambda functions, pre-warming 512 concurrent executions to cover rapid scale-out.
  3. Reduced batch size to 500 records, giving Lambda more frequent invocations but smaller, more predictable processing times.

Final configuration: batch size 500, batch window 500ms, provisioned concurrency 512 (scaled via Application Auto Scaling based on Kinesis incoming record count).

Data Consistency Across Services

We needed strong consistency for inventory counts but eventual consistency was acceptable for sales aggregates. To manage this, we implemented sagas with idempotent writes:

  1. Sale event arrives → Lambda writes to DynamoDB (inventory decrement) with ConditionExpression attribute_exists(version).
  2. On success, Lambda publishes Sale.Processed event to EventBridge.
  3. If inventory update fails (insufficient quantity), Lambda publishes Sale.Failed event, triggering a compensating action (restock reservation release).

Idempotency was enforced at the event level: each event carried a unique eventId. All state-changing operations checked a DynamoDB processedEvents table (with TTL 48 hours) before execution, ensuring accidental replays (due to Kinesis retries or Lambda timeouts) couldn't double-count sales.

Cost Overruns and Optimization

Two months into production, we discovered Kinesis costs were 40% above projections (driven by higher-than-expected shard count during peak season). DynamoDB read/write capacity costs were also high due to Lambda queries during enrichment.

Optimizations implemented:

  • Aggressive batching: Increased Lambda batch size to 1,000 records, reducing per-record processing cost by 40%.
  • Adaptive shard scaling: Implemented custom CloudWatch metrics that scaled shards in steps (256 → 512 → 768 → 1,024) rather than linear predictions, avoiding over-provisioning by 15%.
  • DynamoDB on-demand mode: Switched from provisioned to on-demand capacity with auto-scaling, eliminating over-provisioned read capacity units (RCUs) during off-peak hours.
  • Data compression: Enabled GZIP compression on Kinesis producer using KPL, reducing per-put costs by 60%.

Monthly infrastructure settled at $14.3K, 35% below the previous $22K despite 10× higher throughput.

Schema Evolution Without Breaking Consumers

Retailers occasionally added custom fields to their event payloads (promo codes, loyalty IDs). Without a schema registry, backward compatibility would have been fragile. We implemented a central schema registry using AWS Glue Schema Registry with Apache Avro schemas, versioned and centrally managed. Lambda producers serialized events to Avro; consumers deserialized with fallback defaults for missing fields. This allowed us to add new fields without requiring coordinated deployments across 47 microservices.

Observability: From Firefighting to Predictive Operations

Legacy batch operations meant we only monitored daily job success/failure. Streaming is continuous—you can't just look at "yesterday's job." We built an observability stack from day one to avoid being overwhelmed by alerts.

Three Pillars: Metrics, Traces, Logs

We instrumented every Lambda function, DynamoDB table, and Kinesis stream with Amazon CloudWatch and Datadog.

Key metrics (SLO-driven):

  • End-to-end latency: Time from POS event generation to availability in GraphQL API (target: p95 <30s).
  • Pipeline health: Kinesis iterator age (milliseconds behind) < 10,000 for hot path; < 60,000 for cold path.
  • Data accuracy: Daily reconciliation metric counts matching source-of-truth batch system (target: 100%).
  • Service availability: Uptime for API gateway, GraphQL endpoint, and authentication (SLA 99.99%).

Distributed tracing: Every event carried a trace ID through Lambda, DynamoDB, and ClickHouse writes using AWS X-Ray. This enabled querying "show me the full path of event ABC123" to debug latency spikes.

Structured logging: All services emitted JSON logs with requestId, retailerId, and processingTimeMs fields, enabling precise filtering during incident response.

Alerting with Runbooks

We established escalation policies with PagerDuty:

  • P0: Data staleness > 5 minutes OR >5% error rate for >2 minutes → page on-call engineer immediately.
  • P1: Iterator age >30 seconds → Slack alert with 15-minute response SLA.
  • P2: Any single service error rate >1% → daily digest, no immediate action.

Each alert linked to a Confluence runbook documenting diagnosis steps, mitigation actions, and post-mortem requirements. This reduced mean time to resolution (MTTR) from 45 minutes to 9 minutes over 6 months.

Chaos Engineering

Monthly "GameDay" exercises injected failures:

  • Terminated Lambda execution environments randomly to validate statelessness.
  • Disabled a DynamoDB table's replica to test failover and measure impact.
  • Added artificial latency to Kinesis PutRecord calls (2-second delay) to verify backpressure handling.

These exercises uncovered hidden dependencies (e.g., one Lambda had cached DynamoDB credentials in global scope, causing failures during IAM credential rotation) and improved system resilience proactively.

Operational Practices: The Platform Team's Playbook

The migration required more than technology—it demanded disciplined operational rhythms. Our platform team adopted practices from Site Reliability Engineering (SRE) to keep the system reliable while enabling product velocity.

Infrastructure as Code with AWS CDK

All infrastructure defined in TypeScript via AWS CDK. This provided:

  • Versioned deployments: Every change peer-reviewed through GitHub pull requests.
  • Environment consistency:
  • Drift detection: Daily CloudFormation drift detection alerted if manual changes occurred.

Key CDK constructs included:

  • KinesisStreamConstruct: Parameterized shard count, retention, encryption.
  • LambdaProcessorConstruct: Standardized Lambda function with X-Ray auto-instrumentation, Structured logging, DLQ configuration.
  • ClickHouseClusterConstruct: Spot instance fleet with instance refresh and health checks.

CI/CD for Infrastructure and Application Code

We used GitHub Actions with the following workflow:

  1. PR validation: On pull request: run unit tests, static analysis (ESLint, SonarCloud), and CDK synth to ensure infrastructure compiles.
  2. Staging deployment: On merge to main, automatic deployment to staging environment with integration tests.
  3. Canary release: Deploy to 5% of production shards, monitor SLOs for 15 minutes.
  4. Progressive rollout: If SLOs green, increase to 25%, 50%, 100% over 30 minutes; automatic rollback on threshold breach.

All changes required passing pipeline and code review; no manual deployments permitted.

Slack Bot for Common Operations

We built a Slack bot (@metrics-bot) for day-to-day operations:

  • /alerts: Show current active alerts with acknowledgment button.
  • /inventory sku:TSHIRT-001 store:NYC-042: Query real-time inventory from DynamoDB.
  • /deploy service:cart-api version:1.2.3: Trigger canary release of specific service.
  • /scale shards +64: Increase Kinesis shard count (with approvalRequired workflow).

This reduced on-call toil and enabled self-service operations for product engineers.

Lessons Learned and Recommendations

Reflecting on the 11-month journey, several lessons stand out—some learned the hard way. If we were to do this again, here's what we'd emphasize:

1. Instrumentation is Not Optional—It's Day Zero Work

We began meaningful instrumentation only after Phase 1, which delayed our ability to measure streaming behavior accurately. The most critical lesson: deploy observability before any production traffic. That means:

  • Structured logging with correlation IDs from day one.
  • Distributed tracing across all service boundaries.
  • Business metrics (not just technical) tracked in dashboards from the first event.
  • SLOs defined and monitored before launch, not after.

2. Choose Idempotency Over Exactly-Once (Initially)

We initially chased exactly-once processing semantics across every service. However, the complexity cost was high. Instead, we embraced idempotent writes everywhere—allowing at-least-once delivery from Kinesis but ensuring duplicate events didn't corrupt state. This was simpler (just store processed event IDs for 24–48 hours) and sufficiently correct. Exactly-once became necessary only for financial calculations (tax, billing), which we isolated into a separate, carefully audited service.

3. Serverless Isn't Always Cheaper—It's About Right-Sizing

Lambda's per-execution pricing worked well for bursty, spiky workloads. However, for always-on services (GraphQL gateway, ClickHouse coordinators), Fargate/EC2 with reserved instances was more cost-effective. The hybrid approach—serverless for spikey event processors, containers for steady-state services—optimized both cost and performance.

4. Optimize for the 99th Percentile, Not Just Average

We initially celebrated a p50 latency of 200ms. But retailers checking dashboards on mobile networks cared about p95 and p99. Shifting optimization focus to tail latencies (by reducing DynamoDB hot partitions, adding read replicas) improved user experience more than further reducing median latency.

5. Build a Data Quality Dashboard Early

Monitoring data quality separately from system health was a game-changer. We built a dashboard showing:

  • Daily record counts by source (SFTP, API) vs. historical averages.
  • Schema validation failure rates.
  • Duplicate event percentages.
  • Reconciliation gaps vs. legacy batch.

This caught data feed issues (a retailer's API started sending malformed JSON) before clients noticed missing dashboard data.

6. Deprecation Is a Feature

We underestimated the effort of decommissioning the old system. Legacy cron jobs, monitoring alerts, and documentation remained active, creating cognitive overhead. We appointed a "deprecation lead" whose sole job was to identify and remove legacy components systematically. Clear ownership accelerated cleanup.

Conclusion: The Real-Time Future is Here

The migration from batch to real-time analytics transformed RetailMetrics from a reporting vendor into a strategic partner. Their customers no longer just look at past performance—they act on current reality. Inventory managers adjust orders mid-day based on sell-through rates, promotion effectiveness is measured hourly, and supply chain teams respond to demand surges within minutes rather than days.

The technical blueprint—serverless event ingestion, Lambda stream processors, DynamoDB for operational state, ClickHouse for analytics, and GraphQL for flexible APIs—proved both performant and cost-effective. More importantly, the platform gained the agility to ship features rapidly and the resilience to handle 10× traffic without breaking.

Looking ahead, we're already extending the architecture to support predictive inventory recommendations using lightweight ML models deployed as Lambda layers. The same streaming pipeline that processes sales events will soon feed demand forecasting models running on Amazon SageMaker, creating a closed-loop system where predictions continuously improve as more data flows through. That's the power of streaming: it's not just about faster data—it's about creating a live nervous system for the business.

For teams considering a similar transition: start with a well-defined strangler strategy, instrument everything before launch, and choose the simplest reliability mechanisms that truly meet your consistency needs. Streaming complexity is best managed incrementally, not all at once. The destination—a responsive, data-driven organization—is worth the journey.

Related Posts

How We Built a Serverless Media Platform That Serves 50M Monthly Pageviews at 1/3 the Cost
Case Study

How We Built a Serverless Media Platform That Serves 50M Monthly Pageviews at 1/3 the Cost

When MediaPulse, a digital news platform serving 12 million monthly readers, faced skyrocketing AWS bills and constant scaling crises, we architected a complete migration from their legacy monolith to a serverless, edge-first architecture on AWS. Over 18 months, we reduced infrastructure costs by 68%, improved Time to First Byte from 1.2s to 180ms, and enabled the engineering team to deploy 20x more frequently—all without disrupting 50M+ monthly pageviews. This case study details the step-by-step migration strategy, the technology choices that made the difference, and the operational practices that kept everything running smoothly during one of the most ambitious cloud migrations in publishing.

Scaling to 10 Million Users: How FinFlow Built a Cloud-Native Financial Platform
Case Study

Scaling to 10 Million Users: How FinFlow Built a Cloud-Native Financial Platform

When FinFlow's user base exploded from 100,000 to 10 million in just 18 months, their monolithic architecture crumbled under the load. This case study examines how the fintech startup re-architected their platform using microservices, event-driven design, and a multi-cloud strategy to achieve 99.99% uptime while processing $2.3 billion in annual transactions. We detail the technical decisions, deployment strategies, and organizational changes that enabled sustainable growth—from migrating legacy banking systems to implementing real-time fraud detection that reduced false positives by 73%.

Scaling Real-Time Collaboration: How Webskyne Engineered a High-Performance Live Editing Platform for 100K+ Concurrent Users
Case Study

Scaling Real-Time Collaboration: How Webskyne Engineered a High-Performance Live Editing Platform for 100K+ Concurrent Users

When a leading project management SaaS provider faced catastrophic performance failures during peak collaboration sessions, Webskyne was brought in to redesign their real-time architecture from the ground up. The challenge was daunting: support 100,000+ concurrent users editing simultaneously while maintaining sub-100ms latency and 99.99% uptime. Through innovative WebSocket optimization, strategic use of conflict-free replicated data types (CRDTs), and a hybrid cloud-native architecture, we not only solved the immediate crisis but built a system that now powers collaboration for millions of users worldwide. This case study reveals how we transformed a failing platform into a market differentiator through architectural excellence, operational rigor, and a methodical approach to distributed systems engineering.