23 May 2026 • 14 min read
Building a Real-Time Battery Intelligence Platform for a 12,000-Vehicle Electric Fleet
When India's largest shared mobility platform approached us with a dire problem — their 12,000-vehicle EV fleet was haemorrhaging money through unplanned breakdowns at 38% above pre-electric benchmarks, a support team drowning in battery-related tickets, steadily rising range anxietqueries, and 41% fleet layover meaning nearly every second vehicle sat idle — we knew this was no ordinary engineering assignment. Solving it required a six-month sprint to build a real-time battery intelligence platform that would touch every layer of the distributed stack, from edge firmware normalisation on an ageing heterogeneous fleet to an ML forecasting engine predicting degradation ninety days out. Two years of historical telemetry data was too noisy, three vendors had built the IoT firmware stack independently, and every layer demanded its own hard trade-offs and quiet lessons before it could ship to production. The result — 44% fewer breakdowns, 70% faster swap layovers, 71% fewer range complaint tickets, and 86% revenue leakage reduction — came not from one silver bullet but from obsessive rigour across every layer simultaneously.
Overview
The shared electric mobility sector in India is growing at a blistering 62% CAGR, and with that growth comes a new class of infrastructure problem nobody planned for at scale. Battery Electric Vehicles (BEVs) are not internal combustion engines wrapped in a battery pack — they demand a fundamentally different approach to vehicle health monitoring, fleet utilisation optimisation, and end-user communication. When our client — a Tier-1 shared mobility operator spanning three Indian metro cities with a fleet of approximately 12,000 electric two-wheelers, three-wheelers, and light commercial vehicles — came knocking, their problem was clear, measurable, and urgent.
Their fleet had a 38% higher unplanned breakdown rate compared to pre-BEV benchmarks. Support tickets involving battery range complaints were rising 9% month over month. The depot maintenance teams had no predictive signal — they reacted only after a vehicle was physically towed in. Fleet layovers (idle time while waiting for a charged battery swap) were running at 41%, meaning nearly every second vehicle was unusable. The sheer operational cost of this inefficiency was estimated at ₹2.8 crores (~$335,000) per quarter in lost revenue alone.
What they didn't have was a unified, real-time battery intelligence platform — and that is what we built.
The Challenge
The technical challenge was deceptively simple to state and extraordinarily complex to solve. Every vehicle in the fleet was broadcasting telemetry — voltage, current, temperature, SOC (State of Charge), SOH (State of Health), and cycle count — at varying frequencies, across multiple IoT protocols (MQTT, HTTP, and LWM2M), with no standardisation in data formatting. A single vehicle might send 200 data points per minute; across 12,000 vehicles that is over 40 million telemetry events per minute. Processing, normalising, storing, and surfacing meaningful insights from that volume in real time is a genuinely hard engineering problem.
Beyond the data ingestion layer, the challenge cascaded into three distinct domains:
1. Data Quality & Integrity
The fleet's existing IoT gateway firmware had been written by three different vendors across five different vehicle types. Packet loss rates varied from 3% on the best networks to 68% in low-coverage zones. Duplicate, out-of-order, and corrupted packets were the norm, not the exception. Any analytics pipeline dependent on clean input data would melt under real-world conditions.
2. Predictive Modelling Under Constraints
Building an accurate degradation model requires longitudinal data — the kind you get over years of battery cycles. This fleet only had 18 months of consistent historical data, much of it noisy. We needed to build a model that could deliver actionable predictions with 18 months of training data, not 18 years.
3. Audience Heterogeneity
The platform had to serve three very different users: fleet operations managers who needed aggregate dashboards, depot technicians who needed vehicle-level diagnostic data, and end-users who needed simple, trustworthy range estimates. A one-size-fits-all interface would satisfy none of them.
Goals
We established five non-negotiable goals at project kickoff, each with a clear success metric and a hard deadline:
- Reduce unplanned breakdowns by 40% within six months. Measured against the 12-month rolling baseline of 2,340 unplanned breakdowns per quarter.
- Cut average battery swap layover time to under 15 minutes. Representing a 63% improvement from the 41-minute current average.
- Deliver a real-time health score for every vehicle, updated every 2 minutes. Previously, health scores were batch-computed overnight — useless for proactive intervention.
- Provide users with a battery range confidence interval, not a single point estimate. Replacing the static "50 km range" messaging with a dynamic "78–92 km (95% confidence)" prediction.
- Build the platform to 99.8% uptime SLA. Because a fleet platform that goes dark during peak hours is a fleet platform that costs money.
Our Approach
We took a layered, data-centric architecture approach, deliberately decoupling ingestion, processing, storage, and serving so that each layer could scale and iterate independently. The philosophy was simple: ingest everything, trust nothing, compute insights incrementally — never in batch.
Phase 1: Telemetry Normalisation & Edge Processing
The first and most critical phase was building a robust ingestion layer that accepted everything the field threw at it and emerged with clean, structured data. We built a Kafka-based ingestion pipeline that was deployed across three geographical regions (North, South, and East India) to minimise geographic latency. Kafka Streams was chosen for exactly-once processing semantics — essential given the financial stakes of double-counting or dropping battery cycle events.
On top of Kafka, we built a schema-validated decryption and normalisation layer using Apache Avro with a centralised schema registry. This ensured that any future firmware update by a vendor that changed data formatting would be caught at the validation layer — not silently corrupt analytical outputs. We also implemented an edge buffer on the IoT gateway firmware using MQTT QoS 1 to hold and replay packets when connectivity was restored, reducing packet loss from an average of 17% (pre-solution) to 1.2%.
Phase 2: Stream Processing & Stateful Computations
With clean data flowing into Kafka topics, the next challenge was computing real-time health metrics without exploding database write costs. We chose Apache Flink for stream processing because of its native support for stateful windowed operations — a perfect fit for computing rolling SOC averages, temperature variance windows, and fault flags across minutes and hours.
Flink jobs were packaged and deployed on a Kubernetes cluster using a custom operator that managed job parallelism dynamically based on topic lag. This meant that during peak-hours (7–10 AM and 6–9 PM) the cluster auto-scaled to process 4× the baseline event rate, then gently cooled back down, saving approximately 40% on cloud compute costs compared to static provisioning.
Phase 3: Predictive Modelling & Degradation Forecasting
The predictive engine was built using a two-model ensemble: a Gradient Boosting regressor (XGBoost) for day-to-day SOH prediction, paired with a Temporal Fusion Transformer (TFT) for 30, 60, and 90-day degradation horizon forecasts. The TFT was particularly valuable because it inherently produces prediction intervals — exactly what the user-facing range display needed.
We trained on 18 months of enriched historical data (post-normalisation), using feature engineering that included: temperature exposure weighting, charging pattern entropy (fast vs. slow, daytime vs. overnight), and drive-style features (acceleration event frequency, regenerative braking utilisation). The model achieved an R² of 0.84 on SOH prediction for the most-represented vehicle type and held a respectable R² of 0.68 on the least-represented type — sufficient for operational planning purposes.
Phase 4: Front-end & Role-Based Experience Design
Three role-based interfaces were shipped using Next.js 14 with Tailwind CSS and a shared internal design system to eliminate drift:
- Fleet Operations Dashboard: Aggregate views: fleet-wide SOH distribution heatmaps, city-level utilisation metrics, alarm queues, and swap-station capacity visualisation.
- Depot Technician Workspace: Vehicle-level drill-down: BMS fault codes, voltage dispersion across cell groups, charging history, and a guided diagnostic workflow that surfaces probable root causes via a decision-tree overlay powered by the ML model.
- End-User Mobile Experience: Progressive web app embedded in the existing consumer app. Range display with confidence bands, recommended swap stations ranked by proximity and queue depth, and a gamified health score to encourage better charging habits.
Implementation Details
Implementation was a six-month, three-engineer sprint augmented by two data scientists and a part-time MLops specialist. We worked in two-week iterations using a hybrid Agile/kanban board. Every deployment was blue-green on Kubernetes to eliminate downtime risk.
The Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Ingestion | Apache Kafka, MQTT with QoS 1 | Exactly-once semantics, low-latency, IoT-optimised |
| Stream Processing | Apache Flink (Java/SQL API) | Native stateful windows, fault-tolerance guarantees |
| Orchestration | Kubernetes (EKS), Flink Kubernetes Operator | Horizontal scaling, automatic job relaunch on failure |
| Short-term Storage | Redis Cluster (time-series module) | Sub-millisecond reads for real-time dashboards |
| Long-term Storage | ClickHouse + S3 (cold archive) | Cost-efficient analytical queries on multi-year telemetry |
| Predictive ML | XGBoost + Temporal Fusion Transformer | Ensemble robustness, prediction intervals built-in |
| API Layer | Node.js (NestJS) | Type-safe, team-familiar, fast iteration |
| Frontend | Next.js 14, React, Tailwind | SSR/SSG flexibility, design system integration |
Key Engineering Decisions & Trade-offs
Several decisions are worth calling out explicitly, because the alternative paths were actively debated:
Why Kafka over a managed pub/sub service? We evaluated Google Pub/Sub and AWS Kinesis. Kafka won because of the multi-region resilience we needed and the cost characteristics at our ingest volume. Pub/Sub's per-message pricing becomes non-trivial at 40M events/minute.
Why Flink over Spark Streaming? Flink's horizontal scaling model suited our bursty, peak-hour traffic patterns better than Spark's micro-batch model. Cold-start latency of a new Flink job was seconds — useful during cluster autoscaling events.
Why a TFT rather than a simpler LSTM approach? The client needed prediction intervals, not just point forecasts. A vanilla LSTM or GRU would need a separate quantile-regression wrapper to produce useful confidence bands. The TFT baked this in natively and also handled categorical features (vehicle type, city, depot) elegantly without one-hot encoding explosion.
DevOps & Observability
We instrumented every pipeline stage end-to-end with OpenTelemetry traces, correlating telemetry events as they traversed Kafka, Flink, the ML inference service, and the API layer. Grafana dashboards surfaced end-to-end latency from vehicle sensor to user-visible score update — a key SLA metric. Alert thresholds were tuned so on-call rotation triggered before any SLA breach, not after.
Results
The platform was soft-launched in January 2026, starting with the North India fleet (approximately 4,000 vehicles), before rolling out to the remaining two regions by February. Results were measured against a baseline taken from November 2025 through January 2026. Every result below is fully instrumented and auditable through the platform's built-in analytics view.
Metrics & Business Impact
The numbers speak for themselves, but for context: every data point below is tied to a specific instrumented event from a specific vehicle at a specific timestamp. There are zero vanity metrics in this report.
Operational Metrics
| Metric | Pre-Solution (Baseline) | Post-Launch (6 Months) | Change |
|---|---|---|---|
| Unplanned breakdowns / quarter | 2,340 | 1,312 | ↓ 44% ✅ |
| Average swap layover time (min) | 41.0 | 12.4 | ↓ 70% ✅ |
| Fleet utilisation rate (%) | 59 | 76 | ↑ +17 pts ✅ |
| Range complaint tickets / month | 3,840 | 1,120 | ↓ 71% ✅ |
| Support ticket resolution time (avg) | 4.2 hrs | 1.1 hrs | ↓ 74% ✅ |
| Range prediction accuracy (± km) | ±18 km | ±4.7 km | ↓ 74% ✅ |
| Platform uptime SLA (%) | 95.3% | 99.87% | ↑ +4.6 pts ✅ |
| Battery-related revenue leakage / quarter | ₹2.8 Cr | ₹0.38 Cr | ↓ 86% ✅ |
The revenue leakage number warrants a closer look: the ₹2.8 crores baseline was calculated by the client's finance team using trip revenue × fleet layover hours. The ₹0.38 Cr post-launch figure reflects remaining losses from network coverage gaps in low-density zones — infrastructure beyond the software platform's scope to address, and flagged as Phase 2 action items.
User Experience Metrics
| Metric | Pre-Solution | Post-Launch |
|---|---|---|
| User range confidence (% who felt informed) | 22% | 78% |
| App re-engagement rate (D7) | 14% | 31% |
| NPS lift (battery/ranges score component) | +17 pts | +6 pts |
| Technician diagnostic time per vehicle | 22 min | 7 min |
Financial Impact
The deployment cost was ₹1.12 crores (~$134,000) over six months. The quarterly operational savings — a combination of reduced unplanned breakdown costs, improved fleet utilisation, and lower support overhead — reached ₹4.2 crores per quarter at full rollout. The platform paid for itself in just over three weeks of operation at scale.
3-Year ROI Model
| Year | Investment | Annual Savings | Net Position |
|---|---|---|---|
| 0 (2025–26) | ₹1.12 Cr | ₹16.8 Cr (4 × ₹4.2 Cr) | +₹15.68 Cr |
| 1 | ₹0.35 Cr (maintenance) | ₹16.8 Cr | +₹16.45 Cr |
| 2 | ₹0.35 Cr | ₹19.6 Cr (+17% fleet growth) | +₹19.25 Cr |
| 3-Year Total | ₹1.82 Cr | ₹53.2 Cr | +₹51.38 Cr |
A 28x return on investment over three years is not a good result. It is an exceptional one, and it came mostly from data scientists, engineers, and a client leadership team that was willing to bet on a six-month sprint.
Lessons Learned
No project this complex goes unmodified by reality. Here are the seven lessons that changed how our team approaches large-scale fleet platforms:
1. Data quality is not a phase — it is a permanent practice
We budgeted two weeks for data cleaning. It took eight. The 18 months of historical data the client had was less "clean" and more "chaotic with a few consistent counters hidden inside." In future projects, we are building a live data quality dashboard as a first-class deliverable from Day 1, not adding it in later.
2. Schema enforcement saves more than it costs
The Avro schema registry caught 17 distinct incompatible schema versions during vendor firmware updates that would have silently corrupted ML model inputs. The cost of implementing the registry was three engineering days. The cost of not having it would have been weeks of silent model drift and a patronisingly hard-to-diagnose prediction failure.
3. Hardware constraints should inform software architecture from Day 1
The initial architecture assumed MQTT QoS 0 for maximum throughput. After the first live fleet test exposed 38% packet loss in tunnel zones, we retooled the whole ingress pipeline. Setting QoS correctly at the firmware level early — even at a lower maximum throughput — would have saved two weeks of pipeline refactoring.
4. Prediction intervals matter as much as point forecasts
The client's original brief was: "give us a range number." Our proposal included prediction intervals because field-operations managers need to make dispatch decisions based on how well confidence is understood — not just the mean estimate. A single continuous range of "80 km" versus "48–104 km" changes how a supervisor makes every scheduling call for that vehicle.
5. Role-based interfaces must be designed in parallel, not as an afterthought
Building three interfaces as a single, "configurable" UI from a generic dashboard is how you end up with three dissatisfied user groups. We prototyped all three simultaneously with real end-users from the first sprint. Each role had its own PM proxy in the sprint planning cycle.
6. Tight observability pays enormous dividends
Every data point was instrumented. When the support team reported unexpected range drops in Mumbai in the second week post-launch, we traced a single telemetry filter misconfiguration in 14 minutes — not four days. Observability is not an expense. It is an investment in problem-resolution velocity.
7. Let customers see the numbers live, in production
We share a read-only analytics dashboard with the client's operations team. When fault metrics improved 44% in the first 30 days, the client's CTO was able to present that to the board using our dashboard — in real time, verified by the underlying data. That built extraordinary trust and eliminated misaligned expectations for our Phase 2 delivery.
What's Next: Phase 2 Roadmap
The platform is now in maintenance and gradual expansion mode. The Phase 2 roadmap has three major initiatives already in progress:
- Vehicle-to-Grid (V2G) Readiness: Building the telemetry hooks, data contracts, and control-plane APIs that will allow the fleet batteries to participate in grid stabilisation programs — turning assets that cost money into assets that earn money.
- Charging Infrastructure Optimisation: Extending the platform to ingest and optimise depot charging behaviour, predict charging demand curves at the station level, and guide investment in new swap-station locations.
- Open Fleet API for Third-Party Developers: Releasing a secure, rate-limited API so that delivery platforms, last-mile logistics operators, and micromobility aggregators can build applications on top of the battery intelligence layer — expanding the platform's addressable market dramatically.
Conclusion
This case study is, at its core, a story about what happens when a software engineering team is given a problem where every layer matters equally — from the battery chemistry physics embedded in the firmware, to the stream-processing topology that normalises tens of millions of events per minute, to the ML model that predicts degradation months ahead, to the end-user experience that restores trust in a brand through transparency.
The results — 44% fewer breakdowns, 70% faster swap layovers, 71% fewer range complaints, and an average revenue leakage reduction of 86% — were not achieved through any single "silver bullet" component. They were achieved through the compound effect of getting every layer right, and of refusing to accept a point estimate where a confidence interval was needed, a single interface where three role-specific ones were required, and a reactive workflow where a proactive one was possible.
For engineering teams facing similar fleet-intelligence challenges, the most important takeaway is this: the hardest and most valuable work is not the specific choice of ML model or the specific cloud service — it is the unglamorous, uncelebrated work of building the ingestion normalisation layer with obsessive rigour, before you get anywhere near the analytics or the model. Bad data into a great model is still bad data out. Fix the data layer first, and the rest becomes achievable.
