Orchestrating Scale: How LogisticsCo Rebuilt Their Operations Backend to Handle 10× Holiday Volume

When Bangalore's fast-growing logistics platform LogisticsCo faced their first true test of scale, the signs had been clear for months and ignored: 280,000 daily delivery assignments running on a three-year-old monolithic backend with query times exceeding 3,800 milliseconds at peak and connection pools saturated every afternoon. Engineers were patching production at 11 PM on Tuesdays, and the engineering lead privately called it a quiet, ticking catastrophe. Rather than apply another round of emergency fixes and six-figure cloud overruns, a 14-person engineering team chose to map every bottleneck, instrument the live system with OpenTelemetry, and rebuild the entire operations layer from the ground up over eight weeks using NestJS, PostgreSQL read-replicas, Redis caching, BullMQ async workers, and a deliberate CQRS architecture. The results were decisive: P99 latency crashed from 3,840 ms to 117 ms, reconciliation fell from 18.5 hours to 1.7 hours, monthly cloud spend dropped 31 percent, and the system processed 2.8 million holiday deliveries with zero production incidents. This is the complete case study — the problems, the decisions, the metrics, and the lessons every engineering team needs at a growth inflection point.

📋 Overview

LogisticsCo is a Bangalore-based last-mile delivery technology company founded in early 2022. Their platform connects e-commerce merchants with a network of over 45,000 independent delivery partners across India's Tier 1 and Tier 2 cities. By November 2025, the company was processing an average of 280,000 delivery assignments per day — and their monolithic operations backend, built on an aging stack and rushed into production without a formal architecture review, was rapidly approaching its breaking point.

This case study documents the end-to-end reconstruction of that backend: the original constraints that led to the decision, the technology choices made, the implementation strategy, and the concrete metrics that proved the rebuild was worth the investment.

🔥 The Challenge

The warning signs had been there for months. In August 2025, during the Rakshabandhan sales weekend, the operations dashboard regularly loaded in 30–40 seconds during peak hours. The Ops Support team logged 47 distinct errors in a single four-hour evening shift. Engineers were patching live code at 11 PM on a Tuesday — not because they had the time and space to do it properly, but because the nightly delivery reconciliation job was failing and cascading into the customer notification queue.

The root causes were numerous and interconnected. The original monolith had been written by a three-person contractor team within 11 weeks, and it had been running continuously without a significant refactor for nearly three years. The API layer was a single Express.js application with gigabyte-sized route handlers. Shares resource locks meant that every read operation briefly held a write lock. The PostgreSQL instance, running on an under-provisioned AWS db.t3.medium, routinely saturated CPU during peak hours. There was no caching layer between the application and the database. There were no background job queues — everything was synchronous, every time. When a single partner query timed out, the entire request thread stalled.

The holiday sales season — Diwali and Christmas combined — was projected to hit 2.8 million delivery assignments in a seven-day window, a 10× increase over a typical week. The engineering leadership team knew that if they tried to patch and tune the existing system in place, it would buckle. The question was: could they rebuild the entire operations layer — routing, assignment, reconciliation, notifications, and reporting — in under eight weeks before the holiday rush began?

🎯 Project Goals

The leadership team anchored the project to five concrete goals before writing a single line of new code.

First, peak latency for all core operations APIs had to stay below 200 milliseconds at the P99 level even during simulated holiday traffic. Second, the entire reconciliation batch process, which previously consumed 18 hours of overnight compute, needed to complete in under two hours. Third, the system needed to survive the total loss of a single downstream partner API without cascading failures into the user-facing dashboard. Fourth, the monthly infrastructure bill, which had been growing by 22% year-over-year largely due to over-provisioning to cover peak surges, needed to decrease or at minimum remain stable. Fifth, engineering velocity — new feature lead time — could not degrade during the rebuild itself.

🔬 Approach and Architecture Decisions

Before committing to an approach, the engineering team conducted a two-week deep-dive audit. They instrumented the running monolith with OpenTelemetry spans and traced 17,000 real requests across a representative week of production traffic. The data told a story that was worse than the anecdotal evidence had suggested: 42% of all database connections were blocked waiting for the same set of six queries. The average API request went through 14 independent data fetches. Only 6% of database queries used indexes.

With the root-cause picture clear, the team chose a modular monolith over a full microservices migration. The reason was pragmatic: a full microservices migration in eight weeks, with a team of 14 engineers, was operationally unrealistic for a company that had never run a distributed system before. The modular monolith would achieve separation of concerns, independent deployability of modules, and clear data ownership boundaries — all without the overhead of service mesh configuration, distributed tracing infrastructure, and cross-service data contracts. The team could migrate individual modules to standalone services as business demands grew.

The technical toolkit was deliberately conservative. NestJS was chosen over Express.js for its built-in dependency injection, module scoping, and opinionated structure which had already proven reliable in the company's developer onboarding pipeline. PostgreSQL 16 in AWS RDS, upgraded from the previous 13 instance with additional provisioned IOPS and read replica support. Redis 7.4 on ElastiCache as a multi-layer caching strategy — Redis operated as both a query-results cache for high-cardinality filter data and a write-through cache for partner read profiles. BullMQ for background task queues, replacing the synchronous workflow entirely. Puppeteer for headless generation of PDF invoices and delivery labels. Docker and GitHub Actions for CI/CD with a reviewboard-and-staging-cluster promotion flow.

Perhaps the most consequential architectural decision was the introduction of the CQRS (Command Query Responsibility Segregation) pattern for the assignment and reconciliation module. All write operations — partner onboarding, assignment creation, delivery status updates — were kept on the primary PostgreSQL instance. All read operations were served against PostgreSQL read replicas, updated asynchronously via aCDC pipeline using Debezium connecting into the Redis caching layer at configurable staleness windows. This architecture eliminated the primary write-load from every read-path query and meant that the main database connection pool was no longer being consumed for read operations at peak hours.

⚙️ Implementation

The rebuild was conducted in four parallel tracks, each led by a dedicated engineering sub-team. Track progress gates at the end of weeks two, four, six, and eight ensured that any underperforming module was caught early, de-risked, and re-prioritized.

Track 1 — Data Layer (Weeks 1–2). The data-team started by re-architecting the full ERD. Every table was audited for missing indexes. The 17 slowest analytical queries were rewritten — some from ad-hoc JOIN sprawl into materialized view patterns; others were moved behind a Redis hash cache with a 60-second TTL. The CDC pipeline was configured to propagate write events to the read-replica within 150 milliseconds — verified by an end-to-end test suite that checked replica lag at every commit. A combination of 96 new indexes and the introduction of partial indexes on multi-tenant subqueries reduced average query time from 850 ms to 47 ms on the primary path.

Track 2 — Assignment & Routing Engine (Weeks 2–6). The core business logic for delivery assignment — matching a package with the optimal partner based on package dimensions, preferred location, partner capacity, and real-time vehicle slots — had previously been embedded as synchronous middleware inside the request pipeline, running synchronously on every POST /assign call. For the rebuild, this was extracted into a standalone BullMQ worker. On peak load, the BullMQ worker could scale to 120 concurrent task runners compared to the monolith's hard ceiling of 32 Node.js worker threads, with no additional application code changes. The worker exposed a standardized event bus; new assignment rules could be added by dropping a handler class into the worker directory and registering a queue subscriber — eliminating the need to touch the request processing path.

Track 3 — Reconciliation & Settlement (Weeks 3–7). The overnight reconciliation job — traversing every delivery record from the past 24 hours and comparing against partner payout data — had been a monolithically-executed script. For the rebuild, each state in the reconciliation pipeline was explicitly represented as a state machine, with each state eligible to retry independently on worker failure. That single change all but eliminated the overnight failures that had been purging the reconciliation data to a dead-letter queue and requiring manual intervention. The new job was designed to restart mid-stream — a timestamped cursor tracked the last-completed delivery confirmation, and the job picked up from that record on restart. Total execution time dropped from 18.5 hours to 1 hour, 42 minutes.

Track 4 — Notifications & Reporting (Weeks 4–8). Email and SMS notification dispatch had been synchronous within the request-response cycle, often holding the API thread for 1–3 seconds per notification. For the rebuild, notifications were moved completely off the critical path by queueing them through BullMQ and processing them asynchronously through a Twilio and AWS SES fanout worker. The notification module also incorporated an idempotency key per recipient per delivery status change, eliminating an entire class of duplicate notification bugs that had been causing tens of thousands of customer complaints per month. The analytics reporting dashboard, previously running as raw SQL aggregations on every page load, was migrated to pre-aggregated clickhouse tables refreshed in 60-second batches, served directly to the React frontend with stale-while-revalidate semantics.

📊 Results and Metrics

The metrics below were collected over a four-week production shadow test in which 30% of live traffic was routed through the new codebase while the old system continued running the rest, enabling a direct side-by-side comparison.

Latency: P99 API response time dropped from 3,840 milliseconds to 117 milliseconds — a 97% reduction. P50 fell from 412 ms to 42 ms. The spike witnessed during peak lunch-hour traffic, previously a regular event that stretched for 75 minutes, had completely disappeared in the new system.

Database Performance: Average query time across the 40 highest-frequency queries fell from 820 ms to 38 ms. Connection pool saturation — previously a daily occurrence — had not occurred once during the shadow period. The primary PostgreSQL instance sustained a steady CPU utilization of 48% at peak, compared to 92% in the old system under identical loads.

Reconciliation: Complete overnight reconciliation finished in 1 hour 42 minutes, compared to the 18.5 hours previously. The number of reconciliation jobs that required manual re-run dropped from 43 per week to zero over the four-week test period.

Infrastructure Cost: Monthly infrastructure spend decreased 31% — from approximately ₹4,18,000 to ₹2,88,000 — due to the elimination of over-provisioned peak buffer instances, the reduced compute demand from async workers, and the lower aggregate database load on the primary PostgreSQL instance which enabled a downgrade from db.r6g.2xlarge to db.r6g.xlarge.

Operational Health: Ops support ticket volume related to system slowness dropped by 78% in the month post-launch. Production incidents logged in the on-call rotation decreased by 62%. The engineering team reported a 2.3× improvement in mean time to resolution for issues in the new system, attributed to clear module boundaries, comprehensive OpenTelemetry instrumentation added with zero sampling overhead, and the new standardized error type hierarchy that provided actionable stack traces instead of dead HTTP 500 responses.

Holiday Season Delivery: The rebuilt system was live for the Diwali/Christmas holiday season. During the projected 2.8 million assignment peak, the system maintained P99 latency under 200 ms for 96.4% of the seven-day window. The remaining 3.6% was a single 12-minute window on midnight of December 25th caused by an upstream SMS provider outage — not a platform failure, and one that was detected and communicated to ops support automatically through a custom alerting pipeline within 48 seconds. Zero tickets were logged for platform slowness or unavailability. The supremely achieved: zero.

📐 Key Metrics Summary

Metric	Before	After	Improvement
P99 API Response Time	3,840 ms	117 ms	97% faster
P99 API Response Time	412 ms	42 ms	90% faster
Average DB Query Time	820 ms	38 ms	95% faster
Reconciliation Duration	18.5 hours	1.7 hours	91% faster
Infrastructure Cost (monthly)	₹4,18,000	₹2,88,000	31% reduction
Ops Support Tickets (week)	43 / week	9 / week	79% reduction
DB CPU Peak	92%	48%	48 p.p improvement
Production Incidents (month)	8 / month	3 / month	62% reduction

🧭 Architecture Diagram

System architecture overview diagram - event-driven backend pipeline

Figure 1 — The post-rebuild event-driven architecture: API Gateway → NestJS modules → PostgreSQL primary + read replicas → Redis caching layers → BullMQ async workers → downstream partner APIs and notification services.

🎓 Lessons Learned

Several insights stood out that the team believes will be broadly applicable to any engineering organization facing a similar crisis of technical debt at critical growth inflection points.

Don't patch a monolith at scale — modularize it first. The tendency when a monolith is failing under load is to reach for quick patches: another cache layer, another worker thread, another index. Each of those patches creates new coupling, new debt, new failure modes. A systematic audit backed by real instrumentation data gives you the information needed to target the refactor itself, not just surface it.

Architecture is a business decision, not an engineering religion. The team deliberately chose a modular monolith instead of microservices, even though microservices were fashionable at the time. The choice was driven by the team's actual capabilities, the project timeline, and the infrastructure maturity of the organization. The result: a system that could evolve partially, without a rewrite.

Shadow traffic is the single best way to validate a rewrite. Shipping a side-by-side shadow test enabled the team to catch approximately 17 edge-condition bugs before the new system handled a single real user request. It also concretely demonstrated the performance improvements in terms the business could understand: a real table of numbers, not an engineering ambition.

Async-first is not just a buzzword — it changes the economics of your compute spend. Moving reconciliation and notifications to background workers meant the primary request-serving infrastructure could be provisioned for actual request load, not the worst-case load of all workloads combined. That single architectural decision reduced the monthly bill by in excess of ₹1,30,000/month.

Instrument before you threaten the production system with changes. The team committed two weeks to instrumenting the monolith with OpenTelemetry before writing any new code. That investment paid returns within days: weeks 3 and 4 of the build produced fewer unexpected regressions than weeks 1 and 2 of a typical project with no prior tracing instrumentation.

🔭 Outcomes and the Road Ahead

The rebuild delivered everything the original brief called for — and some things it didn't. The engineering team has already begun migrating the partner and customer modules, one at a time, out of the monolith and towards standalone NestJS services deployable independently. Within 12 months, the operations layer will be three deployable services backed by individual event streams, with GraphQL as the unified query surface. The team has also begun work on a cargo-tracking prediction model that runs asynchronously against the event pipeline — a capability that would have been architecturally impossible before the CQRS read replica architecture was in place.

The broader lesson here is not that every company should rebuild their backend from scratch. Rather, the lesson is that a small number of fundamental, data-backed architectural changes — caching where it matters, async processing where latency tolerance exists, separating reads from writes — can create compounding returns that protect the business at its most critical moments.