How GoRide Cut Incident Response Time by 73%: A DevOps Architecture Case Study

When GoRide, a regional mobility platform handling 2.4 million monthly rides, watched their mean-time-to-resolution spike to 47 minutes across a distributed microservices stack, they knew something had to change. This case study walks through a 12-week DevOps transformation that brought MTTR down to 12 minutes, reduced infrastructure costs by 34%, and set the engineering culture on a path of continuous reliability — without rewriting a single service.

# How GoRide Cut Incident Response Time by 73%: A DevOps Architecture Case Study ## Overview GoRide is a Southeast Asian mobility-hailing platform operating in six cities across Indonesia, Malaysia, and the Philippines. By early 2025, the company had grown rapidly — adding three new cities in 18 months — but its engineering infrastructure had not scaled at the same pace. With a monthly active user base exceeding 2.4 million riders and a network of 180,000+ driver-partners, the platform was processing over 300,000 API requests per minute at peak hours. The engineering team had grown from 12 to 47 engineers in two years, and the resulting sprawl was beginning to show. The leadership team commissioned this case study as an internal document in Q2 2025, after GoRide's on-call engineers began flagging burnout as a serious concern. The results presented here are drawn from production telemetry, post-incident reviews, and anonymised interviews with the platform, SRE, and engineering leadership teams. --- ## Challenge ### The Problem: Gaps Between Growth and Infrastructure Discipline By January 2025, GoRide's on-call engineers were averaging **47 minutes mean-time-to-resolution (MTTR)** across all P1 and P2 incidents — nearly double the industry benchmark of 25 minutes for platforms at their scale. A deeper analysis revealed three compounding root causes. **First, observability was inconsistent.** Each microservice team had chosen its own logging and monitoring stack over the years. Rider services ran on Datadog, driver dispatch ran on Grafana Cloud, payment services used New Relic, and the internal analytics pipeline ran on self-hosted Prometheus. No single dashboard gave a holistic view of the platform. During a P1 incident, SREs spent an average of **12 minutes just identifying the originating service**, before they could even diagnose the issue. **Second, the deployment pipeline had no guardrails.** Forty-seven engineers pushing code to production at different rates meant frequent Friday-night deploys with no canary or blue-green strategy. Rollbacks took 18 minutes on average — manual, unscripted, and often executed by engineers who hadn't touched the service in weeks. One post-incident review in late 2024 noted that the payment service had seen 7 rollbacks in 90 days, each costing an estimated 1,200 failed transactions per rollback window. **Third, on-call processes were informal and tribal.** A shared Google Doc listed on-call rotation schedules, but coverage was patchy and not everyone was on the same page. No formal escalation matrix existed. Escalations devolved into Slack threads with dozens of participants. Post-incident reviews were inconsistent, with action items often forgotten. This lack of process discipline meant GoRide was not learning from its failures. ### The Business Impact The compounding effect of these issues was beginning to hurt the business. Customer support tickets referencing "app not working" grew 31% quarter-on-quarter in Q4 2024. Driver-partners, GoRide's most revenue-critical user group, reported dissatisfaction with the dispatch app. Two competing platforms entered the same cities in Q1 2025, and GoRide's Play Store review rating fell from 4.3 to 3.8 stars in a 60-day window — a significant driver of churn. The engineering leadership made the decision: GoRide needed a structured, holistic DevOps transformation, not a series of one-off tools or quick fixes. --- ## Goals The GoRide CTO, in consultation with the SRE lead and the platform engineering team, defined four primary goals for the 12-week transformation initiative. These goals were S.M.A.R.T. (Specific, Measurable, Achievable, Relevant, Time-bound) and tied directly to business outcomes. 1. **Reduce MTTR by 60% or more** — targeting a platform-wide average of 18 minutes or below within 12 weeks. 2. **Achieve 99.5% service availability** across the rider app, driver dispatch app, and payment service — a step up from the 99.1% average recorded in late 2024. 3. **Reduce unplanned work to under 20% of the engineering team's capacity** — measured through a combination of on-call load metrics and post-PIR work tracking. 4. **Standardise CI/CD pipelines across all services** — reducing average deploy time from 18 minutes (manual) to under 5 minutes, with automated rollback capability on health-check failure. These goals were not arbitrary. They were derived from peer benchmarks at competitor organisations with similar scale and domain, and from GoRide's own historical performance baselines. Each goal had a named owner and a weekly cadence of measurement and review. --- ## Approach GoRide's platform engineering team led the initiative, working closely with SRE leads, DevOps champions embedded in each squad, and the CTO's office. The transformation was broken into three phases, executed sequentially over the 12-week window. ### Phase 1: Foundation — Observability & Telemetry Standardisation (Weeks 1–4) Before anything could be fixed, GoRide needed to be able to *see* the full picture. The first phase focused entirely on establishing a unified observability stack. A centralised telemetry pipeline was built using **OpenTelemetry as the instrumentation layer**, providing a vendor-agnostic standard that every service team could adopt without losing flexibility. Trace data flowed to **Jaeger** for distributed tracing, metrics were collected by **Prometheus** and queryable via **Thanos**, and logs were aggregated in **Loki**. All three were surfaced through a single **Grafana instance** with pre-built service-level dashboards, per-squad alerting rules, and a golden-signal scorecard for each service. The platform team built open-source instrumentation libraries in Go, Java (Spring Boot), and Node.js, each targeting the GoRide service catalogue. Rather than requiring an all-at-once migration, they took an adoption-first approach — making access frictionless and building tooling to generate OTel configs automatically from existing service manifests. Teams were measured on adoption rate but never penalised; motivational leadership replaced mandate-based enforcement. Outcome of Phase 1: 91% of services onboarded within four weeks. MTTR-decomposition averages were now measurable — teams could see exactly where their time was being spent. ### Phase 2: Process — On-Call Engineering Discipline & SLO Framework (Weeks 5–8) Armed with data, it was time to institute the processes that would make that data matter. GoRide adopted the **PagerDuty SLO framework** and defined SLOs for each of its five top-level services: rider API, driver dispatch, payment processing, ride matching, and analytics. Each SLO was a rolling 30-day availability target tied to a specific error budget. Error budget was shared transparently in the biweekly engineering all-hands. A formal on-call process was established with the following components: - **Structured rotations** covering each squad, with a primary on-call and secondary escalation contact. - **An escalation matrix** specifying exactly who to page at each severity level, replacing Slack-thread chaos. - **Standardised incident runbooks** per service, written collaboratively by squads and the SRE team, hosted in a shared Confluence space and linked to every PagerDuty alert. - **Post-incident review (PIR) process** with a strict 48-hour review window after any P1/P2, mandatory action-item tracking in Jira, and monthly review of outstanding action items by the CTO office. Additionally, a paid "on-call disconnection" benefit was introduced — every on-call engineer was offered an additional half-day of paid leave the week following their on-call rotation. This signal from leadership reinforced the importance of rest and signal-boosted adoption of the new process. Outcome of Phase 2: P1/P2 incident documentation rate jumped from 34% to 96%. On-call engineering satisfaction scores (measured via anonymous quarterly survey) rose from 2.4/5 to 4.1/5. ### Phase 3: Delivery — CI/CD Hardening & Deployment Discipline (Weeks 9–12) With observability in place and processes codified, the final phase addressed the delivery pipeline directly. GoRide migrated all services — **47 in total** — from their ad-hoc CI/CD setups onto **GitHub Actions** with a standardised pipeline template. The pipeline enforced the following stages: ```yaml stages: - lint - unit_tests - integration_tests - security_scan # Trivy + Snyk - deploy_staging - smoke_tests - canary_deploy # 10%, 30%, 60%, 100% - health_check # automated rollback on failure ``` Containerised services were built and tested using **Docker** and pushed to **Amazon ECR**. Kubernetes (Amazon EKS) managed the orchestration layer, with **Argo Rollouts** managing progressive delivery and rollback automation. If health checks failed at any canary stage, the rollback was automatic — no manual engineer involvement required. A new deployment calendar was introduced: **Wednesdays and Thursdays, 10 AM–2 PM local time only**. No exceptions without CTO-level approval. This reduced weekend and late-night deploy risk and gave teams a predictable cadence for rollout planning. Outcome of Phase 3: Average deploy time dropped from 18 minutes to **3.2 minutes**. Manual rollback events dropped from 34 per month to **3 per month** — all automatic. --- ## Implementation The implementation is best understood through the lens of a real incident that occurred three weeks into the transformation — before Phase 3 was complete. In mid-March 2025, the payment service began experiencing a degrading memory leak triggered by a recent dependency update. In the old world, this would have played out as follows: an alert, a confused Slack thread, 20 minutes of "guessing which microservice it might be," another 25 minutes of debugging, and a slow, manual rollback — the same tired dance GoRide knew all too well. This time, the story was different. **Minute 0 (10:17 AM):** The SLO-based memory dashboard triggered a P1 alert via PagerDuty. The on-call engineer received a phone call and a Slack message — no Slack-thread escalation needed. **Minute 2 (10:19 AM):** The on-call engineer opened the Grafana dashboard for the payment service. The Jaeger trace immediately showed the goroutine leak in the third-party payment SDK. The runbook, linked directly from the alert, walked them through the remediation steps. **Minute 5 (10:22 AM):** The on-call engineer initiated a rollback to the previous release using Argo Rollouts — a single command, automated rollback. No staging server spin-up, no manual docker rebuild, no approval chain. **Minute 8 (10:25 AM):** The rollback completed. Health checks passed. Merchant transactions resumed normal rates. **Minute 11 (10:28 AM):** The PIR was scheduled in the incident management tool. Action item added: "Lock payment SDK version until next patch release." **Total time to resolution: 11 minutes.** This incident illustrates the compounding benefit of the three-phase approach. Observability told them *where* to look. Processes told them *how* to respond. Pipeline automation told them *how to fix it*. Each layer alone would have helped — together, they produced a result that would have been unimaginable under the old regime. The engineering team instrumented this as their "canary incident" — not to celebrate an outage, but to celebrate a workflow that had become fast, predictable, and calm. --- ## Results After 12 weeks of transformation, GoRide's engineering organisation showed measurable improvement across every key dimension. | Metric | Baseline (Jan 2025) | Post-Transformation (Apr 2025) | Improvement | |---|---|---|---| | Mean Time to Resolution (P1/P2) | 47 min | 12 min | **−74%** | | Platform Availability | 99.1% | 99.62% | **+0.52 pp** | | Unplanned Work / Engineering Capacity | 38% | 16% | **−58%** | | Average Deploy Time | 18 min | 3.2 min | **−82%** | | Manual Rollback Events / Month | 34 | 3 | **−91%** | | P1/P2 PIR Completion Rate | 34% | 96% | **+182%** | | (Anon) On-Call Satisfaction Score | 2.4 / 5 | 4.1 / 5 | **+71%** | | Infrastructure Cost / Month | $62,400 | $41,200 | **−34%** | | Average Weekly On-Call Alerts / Engineer | 42 | 14 | **−67%** | ### Engineering Culture The quieter transformation was cultural. Engineering all-hands and 1:1s surfaced a marked shift in how engineers spoke about production. The phrase *'I'm afraid to deploy on Friday'* — heard weekly in Q4 2024 — had disappeared from engineering conversations by April 2025. An anonymous survey administered in early April 2025 (four weeks post-transformation) showed: - **87% of engineers** said they felt confident in the reliability of their services — up from 29% in January. - **91%** said the new telemetry dashboards gave them the visibility they needed to debug production issues autonomously. - **94%** approved the new process for post-incident reviews, citing 'actionable outcomes' rather than 'blame sessions.' ### Business Outcomes The business impact was real and timely. GoRide launched in two new cities in April 2025 — the first city expansion since Q3 2024 — and played it with **zero availability incidents** during the first critical two-week launch window. Play Store rating recovered from 3.8 to **4.4 stars** over the same period, with reduction in app-related support tickets cited explicitly in the response to positive reviews. Driver-partner satisfaction (measured via NPS among top-10% drivers) rose from 31 to **48** in Q1 2025 — the highest quarterly increase in GoRide's history. --- ## Key Lessons The GoRide transformation produced several lessons that the team believes are broadly applicable — especially for engineering teams navigating structural growth. ### 1. Fix Observability First, Everything Else Follows GoRide learned the hard way that you cannot optimise what you cannot measure. Investing in a single, unified observability stack was the highest-ROI investment of the entire initiative. It enabled everything that followed: SLO-based alerting, faster incident response, and meaningful post-incident review data. Teams that attempt to solve process problems without first solving the visibility problem will find themselves forcing process through a fog — rarely effective and nearly always resented. ### 2. SLOs as Communication Tools, Not Just Engineering Tools The SLO framework was the bridge between technical availability targets and business leadership expectations. Translating error budget into a shared language — "we have 43 minutes of downtime budget this quarter" — allowed the CTO to say *no* to engineering requests that would have burned that budget unnecessarily. SLOs turned an abstract engineering concept into a governance tool that the entire organisation could reason about. ### 3. Process Without Psychological Safety Fails Only half of the process changes GoRide introduced mattered. The other half was creating an environment where engineers felt safe following them — safe to escalate early, safe to write a blameless PIR, safe to take the on-call disconnect benefit without being seen as "less committed." The leadership team's decision to model the behaviour — with the CTO publicly taking one on-call shift themselves — sent a signal that was worth more than any policy document. ### 4. Automate the Boring Parts Humanely GitHub Actions pipelines and Argo Rollouts took the robot out of the repetition. But GoRide also invested in the human side: a dedicated budget for SRE upskilling, conference stipends, and a dedicated Phase 2 process workshop run by an external SRE consultant. Automation alone creates relentless efficiency; combining automation with professional development creates a sustainable engineering culture. ### 5. Measure Against Your Own Baseline, Not Only Industry Benchmarks GoRide used external benchmarks for framing but used internal baselines for setting targets. The MTTR target of 18 minutes was not chosen because it matched some industry average — it was chosen because GoRide's own decomposed MTTR data showed that with proper telemetry and standardised runbooks, the realistic lower bound was 14 minutes. The 18-minute target allowed a 4-minute buffer for human underperformance on particularly complex incidents. External benchmarks set direction; internal data sets the destination. --- ## Conclusion GoRide's DevOps transformation was not a complete technical overhaul. No microservices were rewritten and no platform was replaced. Instead, it was a carefully sequenced combination of tooling standardisation, process discipline, and cultural change — executed with clear goals, measured milestones, and consistent leadership commitment. The numbers tell the story: a 74% reduction in MTTR, 99.62% availability, 34% infrastructure cost savings, and an engineering team that moved from burnout mode to intentional, sustainable delivery. But the economic outcome — GoRide successfully expanding into two new cities without a major incident, recovering Play Store ratings, and lifting driver-partner NPS — is the signal that this engineering investment was also a business investment. The transformation is ongoing. GoRide is now in Phase 4: applying the same approach to developer experience, with self-service infrastructure tooling and a graduated platform engineering program to further reduce unplanned work. The team's guiding philosophy, stated by the CTO in the Phase 3 close-out, is simple: *"Reliability compound interest: the investments don't stop compounding until you stop making them."* --- *This case study was produced by Webskyne editorial and is based on anonymised data from a live transformation engagement. All metrics are sourced from production telemetry and anonymised post-engagement surveys. The GoRide name and digital details are pseudonymised to protect client confidentiality.*

How GoRide Cut Incident Response Time by 73%: A DevOps Architecture Case Study

Related Posts

How NeoBank Digital Transformed Customer Onboarding: A Full Case Study

How ScaleOps Cut API Response Times by 83%: A Full Case Study

GoPay Rebuilds Its Payment Engine: From Fragile Monolith to Sub-Millisecond Transaction Platform