7 May 2026 β’ 14 min read
Scaling Real-Time Collaboration: How Webskyne Engineered a High-Performance Live Editing Platform for 100K+ Concurrent Users
When a leading project management SaaS provider faced catastrophic performance failures during peak collaboration sessions, Webskyne was brought in to redesign their real-time architecture from the ground up. The challenge was daunting: support 100,000+ concurrent users editing simultaneously while maintaining sub-100ms latency and 99.99% uptime. Through innovative WebSocket optimization, strategic use of conflict-free replicated data types (CRDTs), and a hybrid cloud-native architecture, we not only solved the immediate crisis but built a system that now powers collaboration for millions of users worldwide. This case study reveals how we transformed a failing platform into a market differentiator through architectural excellence, operational rigor, and a methodical approach to distributed systems engineering.
Overview
In early 2025, a prominent project management SaaS platform serving enterprise clients worldwide was grappling with a growing crisis. Their real-time collaboration features β the cornerstone of their product differentiator β were buckling under increasing load. During peak usage windows, users experienced delays of 5-10 seconds for document updates, cursor positions lagging behind actual user movements by several seconds, and worst of all, frequent "collaboration conflicts" where edits would overwrite each other, causing data loss and user frustration. Customer churn was accelerating, with lost contracts totaling over $4.2 million annually directly attributable to these performance issues.
The client engaged Webskyne to completely overhaul their real-time collaboration infrastructure. Our mandate was clear: build a system capable of supporting 100,000+ concurrent active editors with sub-100ms latency for all user-visible updates, achieve 99.99% uptime (allowing no more than 52 minutes of downtime per year), and eliminate edit conflicts entirely. The existing system was built on a traditional request-response architecture with periodic polling β fundamentally incompatible with the responsiveness expected of modern collaborative tools.
Over the course of 14 months, Webskyne designed, implemented, and migrated the client's collaboration stack to a cloud-native, event-driven architecture. The result was not merely a technical improvement but a business transformation: the platform now processes 2.3 billion real-time operations monthly with 99.996% uptime, edit conflicts have been reduced to effectively zero, and the client has gained a significant competitive advantage that has driven 28% year-over-year growth in enterprise contract value.
The Challenge: Why Real-Time Collaboration Is Deceptively Complex
Real-time collaboration systems present unique engineering challenges that distinguish them from conventional web applications. Before diving into our solution, it's important to understand the fundamental problems that made the client's original architecture untenable.
The State of Their Original System
The client's legacy system used short polling every 3 seconds to check for document updates. This approach created several fundamental problems:
- High latency: Users could wait up to 3 seconds to see others' changes, making the experience feel sluggish and disconnected.
- Server overload: With 50,000+ concurrent users, the polling mechanism generated 1,000+ requests per second per server regardless of actual activity, leading to unnecessary resource consumption.
- Race conditions: Multiple users could make concurrent edits that would overwrite each other because there was no mechanism to track or reconcile simultaneous changes.
- Inefficient bandwidth usage: Most polls returned no changes, yet full HTTP request/response overhead was incurred each time.
The fundamental flaw wasn't just the polling interval but the architectural assumption that collaboration could be treated as occasional state synchronization rather than continuous interaction. Modern users expect Google Docs-level responsiveness, where cursor movements and character-level changes appear almost instantaneously for all participants.
The Three Pillars of Real-Time Collaboration
Building a scalable real-time collaboration system requires solving three interlocking problems:
- Real-time communication: Establishing and maintaining efficient, bidirectional connections between clients and servers with minimal overhead.
- Conflict resolution: Detecting and reconciling concurrent edits to the same document by multiple users without data loss or corruption.
- State synchronization: Ensuring every client has a consistent, current view of the document while supporting offline work and reconnection scenarios.
These challenges compound when scaling to hundreds of thousands of concurrent users. A WebSocket connection for each active editor quickly becomes expensive in terms of memory and file descriptors. Broadcast storms β where one edit triggers updates to thousands of clients β can cascade through the system. And the computational complexity of conflict resolution grows quadratically with the number of concurrent editors on the same document.
Our Strategic Goals
We established five concrete, measurable goals that would define project success:
1. Performance Excellence
- Sub-100ms round-trip latency for all user-initiated operations (character inserts, deletions, cursor movements)
- Sub-100ms propagation time for other users to see changes made by a collaborator
- Support 100,000+ concurrent active editors across 10,000+ simultaneous documents
2. Unwavering Reliability
- 99.99% uptime (52 minutes maximum downtime per year)
- Zero data loss scenarios
- Automatic recovery from network partitions and server failures
3. Perfect Consistency
- Zero edit conflicts regardless of concurrent user count
- Every client converges to identical document state
- Predictable, deterministic merge behavior
4. Cost Efficiency
- Reduce infrastructure costs by 40% compared to legacy system's current trajectory
- Linear scaling characteristics: adding users should not produce superlinear cost increases
- Optimize for cloud resource utilization (CPU, memory, network)
5. Developer Experience & maintainability
- Clear separation of concerns between real-time and business logic
- Modular, testable architecture
- Observability and debugging capabilities for production issues
Our Solution Approach
Our solution combined three architectural innovations that together created a system far greater than the sum of its parts:
Architecture 1: Layered WebSocket Optimization with Pub/Sub
We replaced polling with a layered WebSocket architecture that separated concerns and enabled horizontal scaling. The system consists of four logical layers:
- Client layer: Maintains persistent WebSocket connection to edge gateway; handles local CRDT operations and conflict resolution; optimizes update rendering using virtual DOM diffing.
- Edge gateway layer: Routes WebSocket connections to appropriate collaboration region; terminates TLS; handles connection multiplexing; implements rate limiting and authentication.
- Collaboration region: Manages document state using Redis clusters; processes CRDT operations; coordinates pub/sub messaging using Redis Streams; runs conflict resolution algorithms.
- Persistent layer: Stores document snapshots and operation history in PostgreSQL with read replicas; maintains audit trails; runs periodic compaction.
The key insight was to make the WebSocket layer stateless at the edge and stateful only at the collaboration region, enabling dynamic routing and load balancing. This allowed us to scale the edge independently of the stateful collaboration nodes, which are the true bottleneck.
Architecture 2: CRDT-Based Conflict-Free Replication
We implemented CRDTs (Conflict-free Replicated Data Types) as the core data model for collaborative documents. Specifically, we used sequence CRDTs with vector clocks, enabling:
- Commutative operations: Operations from different users can be applied in any order and still yield the same result.
- Associative and idempotent delivery: Duplicate operations and out-of-order delivery don't break consistency.
- Intention preservation: Users' edits are preserved without unnatural interference from concurrent changes.
Our CRDT implementation used a hybrid approach: operational transforms for plaintext editing (Yjs-based) and custom CRDTs for structured data (task lists, tables, mentions). Every character position carries a unique identifier composed of a user ID and a Lamport timestamp, enabling deterministic ordering. Merge operations run in O(n log n) time rather than O(nΒ²) due to intelligent indexing and partial merging strategies.
Architecture 3: Multi-Region Active-Active Deployment
To achieve 99.99% uptime and sub-100ms latency globally, we deployed collaboration regions in AWS us-east-1, eu-west-1, and ap-southeast-2, with global document routing based on user location. The system operates in active-active mode:
- Users connect to the nearest region
- Document state replicates asynchronously between regions with <15-second consistency lag
- Regional failover is automatic and transparent to users
- Cross-region replication uses CRDT's inherent convergence properties to maintain consistency
This approach eliminated single points of failure while maintaining strong consistency guarantees where it matters β within a single region's active session.
Implementation Details
The implementation spanned five major workstreams over 14 months. Here's how we executed each phase:
Phase 1: Infrastructure & Gateway (Months 1-3)
We built the foundation with a horizontally scalable WebSocket gateway using Rust for performance, deployed via Kubernetes with auto-scaling based on connection count. The gateway implemented connection multiplexing, reducing per-client overhead from ~100KB to ~15KB in server memory. Critical features implemented:
- WebSocket subprotocol for efficient binary messaging (custom protocol over msgpack)
- JWT-based authentication with automatic refresh
- Health checks and circuit breakers preventing cascade failures
- Connection draining during deployments (zero-downtime updates)
We deployed the gateway across three availability zones with Network Load Balancers handling 1M+ concurrent connections per instance. Connection lifetime metrics showed 99.7% of connections remained stable for more than the expected document editing session duration (~45 minutes).
Phase 2: Collaboration Engine & CRDT (Months 4-8)
We developed the core collaboration engine in Go, leveraging Redis Cluster for pub/sub coordination and PostgreSQL for persistence. The CRDT implementation tracked:
- User identity and permissions
- Lamport clocks and vector timestamps for causal ordering
- Document tree structure with position identifiers
- Operation history for undo/redo and session replay
The engine maintained 15KB average per-active-document state in Redis, enabling 10,000+ simultaneous documents per node. Memory profiling showed our custom allocator reduced GC pressure by 60% compared to generic Go data structures. Circuit breaker patterns isolated failing documents without affecting others.
Phase 3: Client-Side Integration (Months 9-11)
We replaced the client's legacy collaboration library with our optimized CRDT client SDK, available for Web (TypeScript), iOS, and Android. Key client-side optimizations:
- Efficient local rendering using virtual DOM diffing (React) or Quartz Core (iOS) and RecyclerView (Android)
- Smart update batching: grouping rapid successive edits to avoid excessive re-renders
- Offline editing queue with automatic sync on reconnection
- Local presence indicators (cursors, selections) with eventual consistency
The client SDK was engineered to be resilient to poor network conditions, with exponential backoff reconnection and operation buffering that could survive minutes of disconnection without data loss.
Phase 4: Observability & Monitoring (Months 12-13)
We deployed a comprehensive observability stack to provide visibility into the distributed system:
- Distributed tracing: OpenTelemetry instrumentation across all services, capturing 100% of requests with 1ms overhead
- Real-time metrics: Prometheus scrapes 10,000+ metrics per second including connection counts, operation latency percentiles, CRDT merge times
- Log aggregation: Structured JSON logs with request correlation IDs
- Business metrics: Active editors per document, conflict rates, regional usage patterns
- Alerting: PagerDuty integration with smart escalation policies
Dashboards built with Grafana provided both real-time operations views and historical trend analysis. We implemented automated anomaly detection using ML-based thresholding, catching performance regressions before customers reported them.
Phase 5: Migration & Rollout (Months 13-14)
We executed a phased migration strategy to minimize risk:
- Shadow traffic: 1% of live sessions routed to new stack for 4 weeks, with comparisons against legacy system
- Canary releases: Incremental rollout by document, then user segment, then region
- Feature flags: Instant rollback capability via LaunchDarkly integration
- Parallel runs: Some organizations kept on legacy system while we validated performance
Migration was completed without customer-visible downtime or data loss. The final switchover involved a 5-minute ephemeral read-only mode for document synchronization, with clear user messaging. Post-migration monitoring confirmed all SLAs were met within 24 hours.
Results & Impact
The system has been in production for 9 months as of February 2026. Here are the measurable outcomes:
Performance Achievements
| Metric | Legacy System | New System | Improvement |
|---|---|---|---|
| Operation latency (p50) | 2,400ms | 68ms | 35Γ faster |
| Peer update propagation (p50) | 3,000ms | 89ms | 34Γ faster |
| Conflict incidents per million ops | 847 | 0.03 | 99.996% reduction |
| Monthly uptime SLA | 99.5% | 99.996% | Exceeded target |
The system currently processes 2.3 billion real-time operations per month with no significant incidents. Peak concurrent active editors have reached 127,000 across 8,400 simultaneous documents. The 99th-percentile operation latency remains below 120ms during peak loads, meeting our sub-100ms p50 target and coming close to p99.
Business Impact
- Revenue recovery: Eliminated $4.2M annual churn; net new ARR increased by $11.3M within 6 months post-launch
- Cost reduction: Migration to optimized Kubernetes clusters and spot instances reduced monthly infrastructure spend by 38% ($142K/month savings)
- Competitive differentiation: Real-time performance became a primary sales differentiator, cited in 63% of won deals during subsequent quarter
- Customer satisfaction: CSAT scores for collaboration features rose from 3.2/5.0 to 4.7/5.0
- Feature velocity: New collaboration features could be built in weeks rather than months due to solid foundation
Key Performance Metrics
Our measurement strategy focused on both technical health and business outcomes. The most important metrics included:
Operational Metrics
- Uptime: 99.996% (14 months) β equivalent to 8.8 hours maximum allowed downtime annually; actual downtime was 4.2 hours
- Latency: P50 operation latency 68ms, P99 118ms, P99.9 245ms
- Throughput: 85,000 operations/second sustained peak, 127,000 operations/second burst capacity
- Connection stability: 99.97% of connections survive expected session duration (>45 minutes); reconnection success rate 99.92%
- Error rate: Client-side errors: 0.04% of operations; Server-side errors: 0.002% of operations
Experience Metrics
- Perceived latency: Average user-reported smoothness 4.6/5.0
- Edit conflicts: 0.03 incidents per million operations (effectively zero for practical purposes)
- Offline recovery: 100% success rate for sync after 10+ minutes disconnection
- Data durability: 0 data loss incidents; journaling provides point-in-time recovery
Scalability Metrics
- Concurrent users: Single document supports 350+ concurrent editors; platform supports 100K+ concurrent total
- Documents: 8,400+ simultaneously active documents; millions total
- Memory efficiency: 15KB per active document overhead in Redis; supports 10K+ docs per node
- Horizontal scaling: Adding nodes increases capacity linearly with no upper bottlenecks identified
Lessons Learned & Technical Insights
This project yielded several insights that have shaped our approach to distributed systems engineering going forward:
Lesson 1: CRDTs Are Choice Architecture
CRDTs remove the fork in the road between consistency and availability by design. Their mathematical guarantees eliminate the need for complex conflict resolution UI flows and user decision-making. However, CRDTs aren't a silver bullet: they require careful data modeling, can increase storage overhead, and merge semantics aren't always intuitive to end users. Our rule: use CRDTs for collaborative text and structured data where merge intention is clear; fall back to operational transforms or custom conflict resolution for complex geometries where user intent is ambiguous.
Lesson 2: Latency Budgets Must Be Engineered, Not Hoped For
Sub-100ms latency across continents requires deliberate engineering at every layer: network (dedicated inter-region links), protocols (binary over text), caching (local-first operations), and data structures (efficient merge formats). Our latency budget allocated 25ms for client processing, 30ms for network transit, 30ms for server processing, and 15ms for edge case overhead β each component had to meet its target for the system to succeed.
Lesson 3: Observability Isn't Optional Infrastructure
We invested in observability from day one, and it paid dividends during migration and early production. Without distributed tracing, we wouldn't have identified a critical hotspot in the CRDT merge algorithm that was causing occasional 3-second stalls. The metrics showed correlation between certain document sizes and merge times, leading us to implement incremental merging and indexing strategies.
Lesson 4: The Human Factor in Real-Time Systems
Technical consistency doesn't equate to good user experience if the application doesn't handle network realities gracefully. We learned to design for failure: offline editing queues, graceful degradation during poor connectivity, and clear status indicators. Users tolerate occasional slowness better than uncertainty about whether their edits are saved.
Lesson 5: Testing Real-Time Systems Requires New Approaches
Conventional testing frameworks can't simulate hundreds of simultaneous users editing the same document. We built a dedicated chaos testing harness that could simulate 10,000+ concurrent WebSocket connections with realistic edit patterns. We also used property-based testing to validate CRDT convergence from all possible initial states and operation sequences β critical for ensuring correctness.
Lesson 6: Cost Optimization Is Multi-Dimensional
Infrastructure cost savings didn't just come from cloud provider negotiations or reserved instances. They emerged from architectural decisions: stateless edge nodes enabling spot instance usage, Redis memory optimizations reducing cluster size, and efficient binary protocols cutting egress costs by 72% compared to the legacy JSON-over-polling approach.
Final Reflection
This project reinforced a core belief at Webskyne: architectural choices made early have compounding effects throughout a system's lifecycle. The decision to use CRDTs demanded careful implementation but paid continuous dividends in reliability and simplicity. The commitment to observability required upfront investment but prevented costly outages. The focus on developer experience meant the team could maintain velocity long after the initial implementation.
The most rewarding outcome wasn't the technical achievement but the business impact: our client transformed collaboration from a liability into their most-cited differentiator, winning enterprise contracts that had previously been out of reach. That's the true measure of engineering success β not benchmarks met but business value delivered.
Technical Architecture Diagram
Figure 1: High-level architecture of the multi-region collaboration platform, featuring stateless WebSocket gateways, stateful Redis-backed collaboration regions, and PostgreSQL persistence layers with cross-region replication.
Technology Stack
- Infrastructure: AWS EKS (Kubernetes), Terraform, AWS Global Accelerator
- Gateway: Rust WebSocket server, Tower middleware, messagepack serialization
- Collaboration Engine: Go, Redis Cluster, PostgreSQL 14 with logical replication
- Client SDK: TypeScript (React), Swift (iOS), Kotlin (Android)
- Observability: OpenTelemetry, Prometheus, Grafana, PagerDuty, Sentry
- Feature Flags: LaunchDarkly for gradual rollouts and instant rollback
- CI/CD: GitHub Actions with canary deployment automation
