Webskyne
Webskyne
LOGIN
← Back to journal

3 June 202611 min read

How Sabre Energy Cut Cloud Infrastructure Costs by 42% Through Serverless Transformation

When Sabre Energy’s monthly AWS bill crossed $78,000 and peak-load failures started interrupting production, the leadership team knew something had to change. Over three months, we redesigned their entire batch-processing and monitoring pipeline from a static EC2 fleet into an event-driven, serverless architecture on AWS Lambda and Step Functions. The result was a 42% reduction in monthly infrastructure spend, near-zero downtime during demand spikes, and a shift from reactive firefighting to proactive energy analytics. This case study traces every decision—from the initial cost audit through the migration rollout and the operational changes that made the savings stick.

Case StudycloudAWSserverlessLambdacost optimizationenergyinfrastructuredigital transformation
How Sabre Energy Cut Cloud Infrastructure Costs by 42% Through Serverless Transformation
## Overview Sabre Energy operates a real-time oilfield monitoring platform that ingests telemetry from more than 12,000 sensors spread across remote wellheads in Texas and Oklahoma. Every fifteen minutes, those sensors push pressure, temperature, and flow-rate data into a central processing pipeline that powers anomaly detection, regulatory reporting, and automated equipment alerts. The platform had grown organically since 2019, and by early 2025, the engineering team was spending more time fighting fires than building features. When we first met the Sabre engineering leads, the situation looked manageable on paper but painful in practice: 18 EC2 instances running 24x7, an Auto Scaling group that rarely scaled fast enough, nightly batch windows that regularly overran into business hours, and a support team that got paged at least twice a week for capacity-related incidents. The CFO had already flagged infrastructure as the fastest-growing line item in the P&L. Something had to give. Our engagement was scoped as a four-week diagnostic followed by a three-month implementation. The mandate was simple: keep the platform reliable while cutting infrastructure costs. What we found, however, was a much deeper story about over-provisioning, architectural drift, and an engineering culture that had learned to treat every load spike as a crisis instead of a signal. ## The Challenge Sabre Energy’s platform had four major pain points that were feeding into each other. First, static compute sizing meant they were paying for peak capacity even when the system was idle. Their EC2 fleet averaged just 38% CPU utilization across the full week, with most of the heavy lifting concentrated in two one-hour batch windows each evening. Second, their batch-processing pipeline was fragile. A single malformed payload could stall the entire Spark job for hours, requiring manual intervention and often delaying reports that downstream regulators expected by 7:00 AM. Third, their monitoring and alerting stack was duplicated across tools—CloudWatch for infrastructure, a legacy Grafana instance for real-time dashboards, and a homegrown Slack bot that often routed pages to the wrong on-call engineer. Fourth, the team had no clear view of cost attribution. The AWS bill showed a single monthly figure, and splitting it by product, environment, or feature was a multi-day manual exercise. These challenges were not purely technical. The engineering organization had unwritten rules about uptime targets that discouraged experimentation, and the operations team was measured on incident response time, not mean time between failures. That combination created incentives to over-provision and over-document rather than to simplify. ## Goals Before writing any code, we agreed on measurable outcomes. The primary business goal was a 30-35% reduction in monthly cloud infrastructure spend without sacrificing platform reliability. The secondary goals were operational: reduce mean time to recovery for batch failures from four hours to under thirty minutes, eliminate manual scaling interventions during normal operations, and give product teams the ability to attribute costs to specific features within two clicks. We also set a cultural goal of moving the engineering team from reactive firefighting to proactive improvement—measured by the percentage of sprint capacity allocated to technical debt. The goals were deliberately ambitious. A 35% cost reduction in three months required rethinking not just the infrastructure but the team’s relationship with it. We treated the cost target as the hardest constraint and designed everything else around it. ## Approach We started with a data-driven diagnostic phase rather than jumping into refactoring. Over the first two weeks, we instrumented every component with CloudWatch Metric Streams, ran a detailed cost-and-usage report through AWS Cost Explorer, and interviewed each engineer about where they lost time. The data confirmed what we suspected: the EC2 fleet was consistently oversized, the batch pipeline was spending four to six hours a week in recovery, and nobody could explain why the monthly bill had grown 18% year-over-year. The technical approach we landed on centered on three principles. First, shift from provisioned capacity to demand-driven execution. Wherever we could replace always-on compute with event-triggered functions, we did. Second, make failures cheap. Instead of building ever-larger batch jobs that were expensive to restart, we broke the ingestion and transformation pipeline into small, idempotent steps that could retry independently. Third, instrument everything. Every function, every queue, and every state transition would emit structured logs and metrics from day one, so the team would never again be blind to what the system was doing or what it cost. ## Implementation The migration happened in four overlapping sprints, each delivering a production-ready component. We kept the legacy system running in parallel until the final cutover, which minimized risk and gave the operations team time to adapt. **Sprint 1: Event-driven ingestion.** We replaced the always-on TCP listeners on EC2 with API Gateway endpoints backed by Lambda functions. Sensor data now flows through an HTTP API that can scale to tens of thousands of concurrent connections without any capacity planning. The Lambda functions write raw payloads to Kinesis Data Streams, which decouple ingestion from processing. The biggest surprise here was latency: the old system had an average end-to-end processing time of 47 seconds because the EC2 instances were often saturated. The new Lambda path brings that down to roughly 9 seconds at the p95 level, simply because the functions run in parallel without contention. **Sprint 2: Step Functions for batch orchestration.** The legacy Spark batch job was replaced by a Step Functions state machine that coordinates about a dozen Lambda-based transformation steps. Each step handles a single responsibility—validation, normalization, deduplication, enrichment, and so on. If any step fails, the state machine retries with exponential backoff and dead-letters the payload to an SQS queue for manual review. The result is that the batch windows are no longer hard boundaries; the system processes data continuously and can complete a job in minutes rather than hours. Crucially, because each Lambda invocation costs fractions of a cent and only runs when triggered, the compute bill for the batch pipeline dropped by roughly 70%. **Sprint 3: Observability and cost attribution.** We deployed a custom cost dashboard in Grafana that pulls data from the AWS Cost and Usage Report and maps it to service domains using tag-based cost allocation. Every Lambda function, every Kinesis stream, and every SQS queue carries tags that tie it back to a product feature and an environment. The dashboard updates daily and gives product managers a self-service view of their infrastructure footprint. On the operations side, we consolidated alerting into a single PagerDuty integration backed by CloudWatch Alarms and Lambda-based anomaly detection. The homegrown Slack bot was retired, and the number of false-positive pages dropped by 60% within the first week. **Sprint 4: Training, runbooks, and cutover.** The last sprint was as much about people as it was about technology. We wrote detailed runbooks for every failure mode, conducted tabletop exercises with the on-call team, and embedded a member of the Sabre engineering staff in every design decision so knowledge transfer happened continuously rather than in a final handoff. The cutover itself was a blue-green deployment: we ran the new pipeline alongside the old one for 72 hours, compared outputs, and flipped traffic only after we had confidence that the new system could handle the full production load. One implementation detail deserves special mention: cold starts. Sabre’s engineering team had read the same blog posts we had about Lambda cold-start latency and were worried it would degrade the user experience. We addressed this by keeping provisioned concurrency set to one instance per function and by using Lambda SnapStart for the highest-throughput paths. In practice, the p99 cold-start latency came in under 180 milliseconds, well below the 500-millisecond threshold the product team had set as unacceptable. ## Results The numbers speak for themselves, but the human impact is equally important. Within the first full month of operation, Sabre Energy’s monthly AWS bill dropped from $78,400 to $45,100—a reduction of 42.5%. The savings came primarily from eliminating idle EC2 capacity, moving batch compute to pay-per-use Lambda, and right-sizing storage and data transfer based on actual usage patterns rather than projections. Reliability improved alongside cost. The number of production incidents related to capacity and batch processing fell from twelve in the six months before migration to zero in the three months after. Mean time to recovery for any remaining incident dropped from four hours to twenty-two minutes, largely because the Step Functions state machine provides built-in visibility into where a pipeline is stuck and the Lambda-based steps are quick to redeploy and test in isolation. The operations team went from being paged twice a week for capacity issues to going an entire month without a single infrastructure-related page. That shift changed how they spent their time: instead of reacting to outages, they started proactively optimizing queries and reducing data transfer costs, compounding the savings by another 8% in the following quarter. Product teams gained the ability to see their share of the infrastructure bill within two clicks, and that visibility changed behavior. Two teams that had been running separate demo environments 24x7 switched to on-demand ephemeral environments spun up via CloudFormation only when needed, cutting their combined non-production spend by 55%. ## Key Metrics - Infrastructure cost reduction: 42.5% in the first month, compounding to 48% by month three - Batch processing time: reduced from an average of 3.2 hours to 18 minutes - Production incidents (capacity-related): from 12 in six months to 0 in three months - Mean time to recovery: from 4 hours to 22 minutes - False-positive alert pages: reduced by 60% - EC2 fleet utilization: from 38% average to eliminated entirely (no static fleet required) - Cost attribution coverage: from 0% to 100% of services tagged and self-service visible - Development team reallocation: approximately 15 engineer-hours per week shifted from firefighting to feature development - End-to-end telemetry processing latency (p95): from 47 seconds to 9 seconds ## Lessons Learned The Sabre Energy transformation taught us several lessons that apply far beyond a single company or industry. **Optimize for the real workload, not the projected one.** The original EC2 fleet was sized for a spike that happened once every quarter and lasted less than two hours. Always-on capacity for rare events is one of the most expensive mistakes we see in cloud architecture, and Sabre was not an exception—it was typical. If you can describe your peak load in minutes per month, you probably should not be paying for it by the hour. **Break big batches into small steps.** The legacy Spark job was monolithic: a failure anywhere in the three-hour window meant starting over from the beginning. By decomposing the pipeline into independent, idempotent Lambda functions orchestrated by Step Functions, we made every step resumable and independently testable. The reliability gains were as significant as the cost gains. **Instrumentation is not an afterthought.** Had we tried to migrate without the cost dashboard and structured logging, we would have lacked the feedback loops needed to validate our decisions. The dashboard became a forcing function for good behavior: once teams could see their costs, they started optimizing without being asked. **Change is cultural, not just technical.** The biggest risk to this project was not a technical failure—it was the operations team’s fear that serverless would make their jobs harder or less visible. By involving them in design decisions, writing thorough runbooks, and treating the migration as a knowledge-sharing exercise rather than a vendor handoff, we turned skeptics into advocates. The teams that are still fighting serverless adoption in their own organizations should ask whether they have spent enough time on the people side of the equation. **Start with data, not assumptions.** Our diagnostic phase cost roughly three weeks and about $18,000 in consulting days. It saved an estimated $600,000 in avoided missteps. The cost-and-usage report and engineer interviews told a story that would have been invisible to anyone proposing a solution from a slide deck. If you are thinking about a cloud transformation, invest in understanding your actual workload before you draw a single architecture diagram. ## Conclusion Sabre Energy’s journey from a brittle, over-provisioned EC2 fleet to a responsive, event-driven serverless platform demonstrates that cost optimization and reliability are not opposing goals. In fact, they often reinforce each other. The same architectural changes that eliminated idle capacity—demand-driven execution, small idempotent steps, continuous instrumentation—also made the system more resilient and easier to operate. The transformation is not over. Sabre’s engineering team is now exploring event-driven microservices for real-time alerting, further storage tiering to move infrequently accessed telemetry into S3 Glacier, and a machine learning pipeline that would run on Lambda and SageMaker to predict equipment failures before they happen. Each of those initiatives builds on the serverless foundation we put in place, and each will deliver additional cost savings on top of the 42% reduction that changed the conversation from "how do we afford the cloud" to "how do we get more value from it." If your organization is carrying a similar load of over-provisioned infrastructure and recurring capacity incidents, the Sabre Energy case study is a reminder that the path to lower costs is not through volume discounts or reserved instances alone. It is through architecture. And the best time to start was yesterday; the second best time is now.

Related Posts

From Legacy to Lightning: The Digital Transformation of Greenfield Financial
Case Study

From Legacy to Lightning: The Digital Transformation of Greenfield Financial

Greenfield Financial, a $12 billion regional bank, transformed from a COBOL-locked institution into a cloud-native digital leader. Through an 18-month, $18 million program, they reduced new product deployment from 6 months to under 6 weeks, improved customer Net Promoter Score by 38 points, and achieved 99.99% uptime while cutting infrastructure costs by 30%. This case study examines the technical architecture, cultural challenges, and phased execution strategy that made one of the banking industry's most ambitious modernization efforts a success—without a single customer-facing outage.

How we cut page load times by 65% with Edge Caching, Image Optimization, and Adaptive Compression for a Global Fintech Platform
Case Study

How we cut page load times by 65% with Edge Caching, Image Optimization, and Adaptive Compression for a Global Fintech Platform

A global fintech platform serving 12 million monthly active users was struggling with 4.8-second median page load times, especially in emerging markets. In this case study, we walk through how a focused, three-track performance program — combining edge-level caching, modern image pipelines, and adaptive compression — reduced median load times by 65% while actually improving core business metrics. We share the technical architecture, the rollout strategy, the trade-offs we debated, and the results that convinced the executive team to fund a second phase.

From School Walls to Digital Archives: How Starlings ED Migrated 40+ Years of Student Records to a Cloud-First Platform
Case Study

From School Walls to Digital Archives: How Starlings ED Migrated 40+ Years of Student Records to a Cloud-First Platform

This case study examines how Starlings Education Centre (ED) transformed four decades of fragmented student records into a unified, cloud-native platform. Facing compliance risks, disconnected legacy systems, and roughly twelve hours per week lost to manual data reconciliation, the centre partnered with Webskyne to design and execute a nine-month digital migration. The project combined rigorous data auditing, phased campus-by-campus rollout with tested rollback contingencies, and a people-first change-management program. Outcomes exceeded expectations: record-retrieval time fell by 61%, a state compliance audit passed with zero records-management findings, and staff confidence in the new platform reached 83% within ten weeks of go-live. Key decisions, migration tactics, training approaches, and lessons learned are documented here for education leaders planning similar transformations in 2026 and beyond.