How CloudNest Scaled Its Multi-Tenant SaaS Platform to 2 Million Users Without a Single Hour of Downtime

CloudNest, a growing B2B SaaS company, faced a critical inflection point: their monolithic architecture was buckling under rapid user growth, customer churn was rising, and the engineering team was spending more time firefighting than building. In just six months, they migrated to a distributed multi-tenant microservices architecture on AWS, achieving 99.99% uptime, sub-200ms API response times, and the ability to onboard new enterprise customers in hours instead of weeks. This case study details the full journey—from diagnosis to delivery—covering the architectural decisions, team reorganization, toolchain overhaul, and the hard metrics that proved the transformation was worth it.

## Overview CloudNest is a mid-sized B2B SaaS company providing project management and team collaboration tools to enterprise clients across North America and Europe. Founded in 2019, the platform grew organically from a handful of small business customers to a roster of mid-market and enterprise accounts by 2024. By the time leadership commissioned an architectural review in Q3 2024, the platform was serving approximately 340,000 active monthly users—but the engineering team knew the real number of end-users within those organizations was closer to 2 million when you counted seat-based employees who interacted with the system indirectly. The growth was a good problem to have, but it had become an expensive one. Recurring outages during peak business hours, chronically inflated infrastructure costs, and an inability to deliver new features without breaking existing ones had created a crisis of confidence internally. Customer support tickets related to performance issues had increased 340% year-over-year. Three enterprise deals had been lost in a single quarter, with procurement teams citing reliability concerns in their exit interviews. CloudNest engaged a senior technical advisory team to conduct a six-week architectural assessment, followed by a planned six-month transformation initiative. The goal was unambiguous: modernize the platform's infrastructure and application layer without sacrificing the product momentum that had gotten them this far. ## The Challenge The symptoms CloudNest was experiencing were familiar to anyone who has watched a startup outgrow its original architecture. The platform ran on a single-region AWS setup using EC2 instances behind a load balancer, with a monolithic Rails application backed by a PostgreSQL database on a single primary-replica pair. It was simple to operate, and that simplicity had been a feature early on. But as the product expanded—adding real-time collaboration, file versioning, custom workflows, and third-party integrations—the application had grown into a tangled mass of tightly coupled code that made even small deployments high-risk events. The database was the most critical bottleneck. With tables containing tens of millions of rows and no meaningful read/write splitting beyond a passive replica, every feature that generated analytics, notifications, or audit logs competed directly with transactional writes for database resources. During peak usage windows—Monday mornings and Thursday afternoons were consistently the worst—the application would begin falling over as database connections were exhausted. The team had addressed this in the short term by increasing the connection pool size and adding caching layers at the application level, but these were tactical patches on a strategic wound. Equally problematic was the inability to isolate tenant data effectively. CloudNest's enterprise customers required data residency guarantees—some needed their data confined to specific AWS regions for compliance purposes. The existing architecture treated all tenants as equal consumers of shared infrastructure, with no mechanism for reserving capacity or isolating noisy neighbors. When one tenant's bulk import job consumed disproportionate resources, it degraded performance for every other customer on the platform. The engineering team was also under pressure to ship new features. The product roadmap had been effectively frozen for two quarters as the team prioritized stability over innovation. Morale was suffering. Engineers who joined to build interesting products were spending their days managing database connection pools and writing post-mortems for outages. ## Goals The advisory team and CloudNest's leadership established five primary objectives for the transformation: 1. **Achieve 99.99% uptime** (no more than 52 minutes of downtime per year), specifically during business hours in CloudNest's primary markets. 2. **Reduce p95 API response time** from an average of 1,800ms to below 200ms under normal operating conditions. 3. **Enable tenant-level resource isolation** so that no single customer could degrade platform performance for others. 4. **Support data residency requirements** for at least three geographic regions without duplicating the entire application stack. 5. **Restore feature delivery velocity** to a cadence of at least two significant product releases per month. ## The Approach ### Phase 1: Assessment and Architecture Design (Weeks 1–6) The advisory team began with a comprehensive audit of the existing system. This included instrumenting the application with distributed tracing (AWS X-Ray), analyzing database query performance (using pg_stat_statements and Amazon RDS Performance Insights), and conducting stakeholder interviews with engineering, product, and customer success teams. The findings confirmed the team's suspicions but also surfaced opportunities they hadn't anticipated. The monolithic application contained 14 distinct functional domains, but only four of them—auth, notifications, analytics, and file management—accounted for 78% of the database load and 65% of the inter-service communication complexity. Extracting these four domains as independent services would address the majority of the performance problems. The architectural design adopted an event-driven microservices pattern using Amazon SQS and SNS for asynchronous communication between services, with Amazon Aurora PostgreSQL as the per-service database backend. Each microservice would own its data store, eliminating shared database contention. The team chose Kubernetes (EKS) as the container orchestration layer, with AWS Fargate for serverless compute to avoid the operational overhead of managing node pools. For data residency, the design used a regional hub-spoke model: a primary global control plane handling authentication, billing, and metadata, with regional data planes in us-east-1, eu-west-1, and ap-southeast-1 managing tenant data. Cross-region replication was handled through DynamoDB Global Tables for metadata and S3 cross-region replication for file storage. ### Phase 2: Incremental Migration (Weeks 7–18) The team adopted a strangler fig migration pattern—never migrating everything at once. Instead, they followed a three-step approach for each domain: 1. **Extract and shield**: Extract the domain into a standalone service while maintaining the existing Rails monolith as the write path. The new service read from replicated data and handled read-only queries. 2. **Dual write**: Enable both the monolith and the new service to accept writes, with a reconciliation process to resolve conflicts. 3. **Cut over**: Shift all traffic to the new service, decommission the legacy path, and retire the corresponding code from the monolith. This phased approach minimized risk and allowed the team to validate each migration independently before moving on. It also provided natural rollback points—If a migration step caused problems, the team could step back to the previous state without a full system rollback. The most technically challenging migration was the notifications service. CloudNest sent millions of transactional emails, push notifications, and in-app alerts per day. The legacy system used a synchronous notification dispatcher embedded in the Rails request cycle, which meant that a slow email provider could block API responses. The new notifications service extracted this entirely, accepting notification requests via an SQS queue and processing them asynchronously. This single change reduced average API response times by 400ms. ### Phase 3: Performance Optimization and Hardening (Weeks 19–24) With the core services migrated, the team focused on performance tuning and resilience testing. They implemented circuit breakers (using AWS SAM and the resilience4j library) to prevent cascading failures, added aggressive caching using Amazon ElastiCache (Redis) for session management and frequently accessed reference data, and introduced database connection pooling with PgBouncer per service. Load testing was conducted using k6, simulating realistic traffic patterns based on the platform's busiest historical periods. The team deliberately over-provisioned test loads to 2.5x their expected peak to establish safety margins. All services were required to maintain sub-200ms p95 response times under these conditions before being accepted as production-ready. ## Implementation Highlights One of the most impactful technical decisions was the adoption of a dedicated analytics pipeline separate from the transactional database. The original monolith had been generating real-time analytics by querying the production PostgreSQL database, which created lock contention and degraded transactional performance during reporting-heavy periods. The new architecture routed analytics queries through Amazon Kinesis Data Firehose into Amazon Redshift, with a separate read replica handling ad-hoc analytical queries. This decoupled the analytics workload entirely from the transactional path. Another significant implementation was the introduction of an API gateway layer using Amazon API Gateway with custom Lambda authorizers for per-tenant authentication and rate limiting. This enabled fine-grained per-tenant resource allocation—enterprise customers on premium plans received dedicated rate-limiting buckets that could not be exhausted by lower-tier customers. It was also the layer that enabled sub-200ms response times for authentication, as the Lambda authorizer result was cached aggressively at the edge. The team also overhauled their CI/CD pipeline, moving from a Jenkins-based setup to GitHub Actions with ArgoCD for GitOps-based deployments to Kubernetes. This reduced deployment time from an average of 47 minutes to 12 minutes, and enabled the team to ship to production multiple times per day without manual intervention. ## Results Six months after the migration began, CloudNest's platform had been transformed. The metrics told a clear story: - **Uptime** for the 90-day period following full migration was 99.994%, exceeding the 99.99% target and representing a full quarter with zero unplanned outages. - **Average API response time** fell from 1,800ms to 127ms—a 93% improvement. The p99 response time, which the team had not previously measured systematically, came in at 340ms. - **Customer-reported performance issues** in support tickets decreased by 78% in the first quarter post-migration. - **Feature releases** increased from an average of 1.2 per month (in the quarters preceding the migration) to 4.1 per month in the quarter following full cut-over. - **Enterprise sales cycle** for new customer onboarding decreased from 18 business days to 3 business days, as provisioning new tenants no longer required infrastructure engineering involvement. - **Infrastructure cost per active user** decreased by 41% due to right-sized compute via Fargate and optimized database sizing through per-service Aurora clusters. Beyond the quantitative results, the qualitative shift was equally significant. Engineers stopped firefighting. The on-call rotation, which had been a source of significant stress for the team, saw a 90% reduction in pages during the first quarter post-migration. Three engineers who had considered leaving reported a renewed sense of ownership and pride in the system. ## Key Lessons **Start with the biggest bottleneck, not the cleanest problem.** The team was tempted early on to extract a simpler domain—one with fewer dependencies and less business risk—as a first migration to build confidence. The advisory team pushed back. The analytics domain was the most painful problem and the most impactful to solve. Extracting it first built organizational confidence faster than a smaller win would have, because everyone felt the improvement immediately. **Incremental migration requires a tolerance for imperfection.** The dual-write phase introduced a period of temporary complexity—a state where two systems were simultaneously responsible for the same data. Engineers found this uncomfortable. The project leadership had to actively manage the pressure to simplify prematurely. The temporary complexity was the price of a safe migration. In hindsight, the team agreed it was worth every awkward minute. **Performance targets must be measured before you aim at them.** CloudNest had not previously measured p95 or p99 response times systematically. They had a vague sense that the system was slow, but no data to know how slow, or where. The first step of the engagement was instrumentation—adding distributed tracing, database query analysis, and API latency tracking. Without baseline measurements, the team would have had no way to know whether the migration actually worked. **Cultural change follows structural change.** Many of the team's operational problems were downstream of an organizational structure where one team was responsible for everything. After the migration, the four extracted services each had a dedicated team responsible for their reliability, performance, and evolution. Accountability became concrete. The on-call engineer for a service was the engineer who had built it. This structural change drove a culture shift more effectively than any amount of process improvement or training would have. **Data residency is not a feature—it is a constraint that should drive architecture from day one.** CloudNest had treated data residency as a future requirement. Building it in after the fact was significantly more expensive than it would have been if the original architecture had been designed with multi-region data isolation as a first-class concern. Teams building multi-tenant SaaS products today should design for geographic data isolation from the beginning, even if no customer has yet asked for it. ## Looking Forward CloudNest is now positioned to pursue enterprise customers with the confidence that their infrastructure can support the requirements those deals demand. The platform is currently evaluating expansion into the Asia-Pacific region, which will require a fourth regional data plane. The architectural choices made during this transformation make that expansion a matter of configuration rather than reconstruction. The transformation also opened the door to a new product tier—a dedicated single-tenant deployment option for large enterprise customers who cannot share infrastructure with multi-tenant workloads. This was a product strategy that had been shelved because the underlying architecture could not support it. The migration made it viable again. The story of CloudNest's transformation is, at its core, a story about the cost of postponing architectural investment until a crisis forces the issue. The team paid a price in stability, morale, and lost revenue. But the outcome—six months of disciplined work, clear metrics, and a platform that the engineering team is genuinely proud of—demonstrates that the price was recoverable, and the lessons are durable.

How CloudNest Scaled Its Multi-Tenant SaaS Platform to 2 Million Users Without a Single Hour of Downtime

Related Posts

From Chaos to Clarity: How a FinTech Startup Scaled Its Payment Gateway and Cut Latency by 72%

Building a Scalable Telemedicine Platform: How CareBridge Delivered 2 Million Virtual Consultations in 18 Months

How Metro Credit Union Rebuilt Its Customer Data Platform and Doubled Cross-Sell Revenue in 18 Months