Webskyne
Webskyne
LOGIN
← Back to journal

8 May 202614 min read

How We Built a Serverless Media Platform That Serves 50M Monthly Pageviews at 1/3 the Cost

When MediaPulse, a digital news platform serving 12 million monthly readers, faced skyrocketing AWS bills and constant scaling crises, we architected a complete migration from their legacy monolith to a serverless, edge-first architecture on AWS. Over 18 months, we reduced infrastructure costs by 68%, improved Time to First Byte from 1.2s to 180ms, and enabled the engineering team to deploy 20x more frequently—all without disrupting 50M+ monthly pageviews. This case study details the step-by-step migration strategy, the technology choices that made the difference, and the operational practices that kept everything running smoothly during one of the most ambitious cloud migrations in publishing.

Case StudyAWSServerlessCloud MigrationPerformanceCost OptimizationMicroservicesEvent-Driven ArchitectureDigital Transformation
How We Built a Serverless Media Platform That Serves 50M Monthly Pageviews at 1/3 the Cost

Overview

In early 2024, MediaPulse, a digital news and media company founded in 2010 and part of a larger media conglomerate, found itself at a technical crossroads. The platform—a monolithic Django application deployed on AWS EC2 with a 3TB PostgreSQL database—was struggling to keep pace with audience growth. Their monthly readership had climbed from 6 million to 12 million over the previous year, and during major news events (elections, breaking news), traffic spikes would overwhelm their infrastructure, causing timeouts and revenue loss.

The monthly AWS bill had ballooned to $42,000, with no clear path to cost optimization. The engineering team of 12 developers spent more time managing infrastructure incidents than building new features. Most critically, every feature deployment required a multi-hour maintenance window, making it impossible to respond quickly to news cycles.

We were engaged to lead a complete platform transformation with three core objectives: radically reduce costs while improving performance, eliminate downtime during news spikes, and enable continuous delivery. Our proposal spanned 18 months and involved a phased migration from monolith to a modern serverless architecture on AWS, using an incremental strangler fig approach. Success would be measured not just in technical metrics but in business outcomes: 50% faster feature delivery, sub-second page loads globally, and a 50% reduction in infrastructure spend.

The engagement concluded with all objectives exceeded. The platform now handles 50 million monthly pageviews across multiple international domains at a cost of $13,500 per month—a 68% reduction. Average page load time improved from 2.3 seconds to 380ms globally, and deployment frequency increased from one per week to 10–12 per day. This case study examines how we achieved those results, the technical decisions that mattered most, and the lessons learned along the way.

Challenge

Technical Constraints

The MediaPulse platform was a classic example of a successful startup that had outgrown its initial architecture:

  • Monolithic Django application: ~400,000 lines of Python code deployed as a single artifact. All features—article rendering, user accounts, newsletter management, ad serving—were tightly coupled, meaning any change required full redeployment.
  • Database bottlenecks: A 3TB PostgreSQL cluster (primary + 2 replicas) experienced connection pool exhaustion during traffic spikes. Complex reporting queries blocked article reads. The team had tried read replicas and query optimization, but the fundamental issue was a single schema serving both transactional and analytical workloads.
  • Session affinity dependencies: User sessions were maintained in-memory using Django's cached DB session backend, tying users to specific EC2 instances. This made horizontal scaling difficult and created uneven load distribution.
  • CDN misconfiguration: CloudFront was in place but configured with long TTLs and no cache purging capability. Breaking news required manual cache invalidations via API calls.
  • Asset storage: Images and videos were stored on EBS volumes attached to web servers, meaning assets were not globally distributed and shared between application instances.
  • Deployment process: Full-stack deployments took 3–4 hours, required taking the site offline for 30 minutes, and had a 15% rollback rate due to integration issues.

Business Impact

The technical constraints had direct business consequences:

  • Downtime during news events: During the 2024 presidential election, traffic spiked to 8× normal levels. The platform experienced multiple partial outages, costing an estimated $80,000 in lost ad revenue over 48 hours.
  • Slow feature velocity: Major product initiatives (a personalized homepage, subscription paywall) took 6–9 months to ship. Newsroom frustration grew as technical constraints limited editorial innovation.
  • Escalating costs: The AWS bill increased 40% year-over-year despite flat traffic in some periods due to over-provisioning to handle peaks. The CFO flagged infrastructure as a major line-item concern.
  • Talent retention issues: Developers, tired of firefighting, began leaving for more modern technology stacks. The team had three openings that went unfilled for months.

Previous Failed Solutions

The team had attempted several incremental improvements before the transformation engagement:

  1. Vertical scaling: Upgraded from m5.2xlarge to m5.4xlarge instances. This provided temporary relief but costs increased 200% without solving the database bottleneck.
  2. Database connection pooling: Implemented PgBouncer. Helped initially, but transaction isolation issues caused data inconsistencies that took weeks to debug.
  3. Redis caching layer: Added Redis for session storage and fragment caching. Session affinity was partially addressed, but cache invalidation logic became increasingly complex and buggy. Two major incidents were caused by stale cache.

These partial fixes had cost approximately $120K over 18 months without delivering a sustainable solution. Leadership recognized that a fundamental re-architecture was required.

Goals

Technical Goals

  1. Traffic capacity: Support 5× current traffic (50M pageviews/month) with 99.95% availability during news spikes.
  2. Performance: Achieve sub-second Time to First Byte (TTFB) globally and full page load under 1 second on 3G networks.
  3. Cost reduction: Lower AWS spend from $42K/month to under $15K/month within 12 months of launch.
  4. Deployment velocity: Enable multiple deployments per day with zero downtime and automated rollback capability.
  5. Developer experience: Provide local development environments that mirror production, reducing 'works on my machine' issues.

Business Goals

  1. Feature delivery: Reduce time-to-market for new features from 6 months to 2 weeks.
  2. Editorial independence: Empower the newsroom to deploy content experiments without engineering involvement.
  3. Global reach: Ensure consistent performance for international audiences, particularly in Europe and Asia.
  4. Ad revenue optimization: Improve page load performance to increase ad impressions and viewability scores.

Non-Goals (Scope Boundaries)

Critical to project success was avoiding scope creep:

  • No redesign of the front-end user interface—the transformation focused on backend infrastructure and performance.
  • No immediate migration of the 10-year archive of articles; legacy data would remain accessible via read-only APIs during the transition.
  • No new feature development for the first 6 months—focus was purely on platform stability and migration.
  • All changes would be backward-compatible; no breaking changes to existing APIs would be introduced during the migration.

Approach

Architectural Strategy: Strangler Fig with Event-Driven Core

We evaluated two migration patterns:

  • Big-bang rewrite: Complete rebuild and cutover. High risk (12–18 months of parallel operation), difficult to roll back.
  • Incremental strangler fig: Extract services one by one, with the monolith gradually 'strangled.' Lower risk, continuous delivery of value.

We chose the strangler fig approach, but with a twist: rather than simply extracting microservices, we first introduced an event-driven backbone using Amazon EventBridge to handle inter-service communication. This prevented creating a distributed monolith where services would still call each other synchronously.

The high-level strategy had three phases:

Network of nodes representing microservices architecture

  1. Foundation (Months 1–6): Set up event infrastructure, CI/CD pipeline, monitoring, and basic serverless components. Extract article rendering into a Lambda function while keeping monolith as source of truth.
  2. Migration (Months 7–15): Incrementally extract monolith features into independent services: user authentication, search, newsletter management, ad serving, analytics. Implement API Gateway as the front door.
  3. Cutover (Months 16–18): Shift database read replicas to new services, decommission monolith components, implement advanced performance optimizations (edge caching, image optimization).

Technology Stack

LayerTechnologyRationale
ComputeAWS Lambda + API GatewayZero capacity management, automatic scaling, pay-per-use pricing aligned with traffic patterns
Edge NetworkCloudFront + Lambda@EdgeGlobal content delivery, request/response manipulation at edge locations
DatabaseAurora PostgreSQL (writer) + Aurora Serverless v2 (read replicas)Handles transactional load while scaling automatically for read-heavy workloads
SearchOpenSearch ServerlessFull-text search with automatic scaling; eliminates need to manage search index infrastructure
CachingElastiCache Redis (session store) + CloudFront cacheRedis for session state; CloudFront for static/dynamic content caching
Object StorageS3 + CloudFrontGlobally distributed images and static assets with lifecycle policies
EventingEventBridge + SQSLoose coupling between services; dead-letter queues for replay
CI/CDGitHub Actions + TerraformInfrastructure-as-code with automated testing and staged rollouts
MonitoringCloudWatch + Datadog (retained for historical data)Real-time metrics, distributed tracing, alerting with automated escalation
SecurityAWS WAF + Shield AdvancedDDoS protection and web application firewall for news traffic spikes

Database Strategy: Dual-Write Period

One of the most challenging aspects was data migration. The monolith's database contained 12 years of business-critical data. Rather than attempt a risky one-time migration, we implemented a dual-write pattern over 9 months:

  • Phase 1: Services wrote to both monolith database and new databases (read-through from source of truth).
  • Phase 2: Services read from new databases but still wrote to monolith (write-through to source of truth).
  • Phase 3: Full cutover—services read/write only to new databases. Monolith becomes read-only for legacy archival.

This approach meant we could validate data consistency at every stage and roll back to monolith reads at any point.

Implementation

Month 1–3: Foundation Infrastructure

The first quarter focused entirely on establishing the migration runway without touching production traffic:

  1. EventBridge setup: Created event bus with schemas for common domain events: ArticlePublished, UserLoggedIn, SubscriptionCreated. Implemented event validation and dead-letter queues.
  2. CI/CD pipeline: Built GitHub Actions workflows with Terraform plan/apply, automated integration tests, canary deployments via CodeDeploy for Lambda functions.
  3. Monitoring stack: Deployed Datadog agents to existing EC2 instances, set up CloudWatch dashboards for key metrics, created alerting thresholds (p95 latency < 500ms, error rate < 0.1%).
  4. Security baseline: Implemented AWS WAF rules for common attack patterns, enabled Shield Advanced for DDoS protection, set up secret rotation for database credentials using Secrets Manager.

Month 4–6: First Service Extraction

We began with the lowest-risk, highest-impact service: article rendering. The monolith's article view endpoint was called millions of times per day and was a significant contributor to database load.

Implementation steps:

  1. Created a Lambda function that called monolithic API to get article data, then rendered the article page using a template engine.
  2. Configured API Gateway to route article requests to the Lambda function for 1% of traffic (canary).
  3. Monitored metrics: latency, error rates, database connections. After 2 weeks, increased to 10% traffic.
  4. At 50% traffic and stable metrics, completed cutover to 100% Lambda.

Results: Database connections decreased by 35% immediately. Lambda auto-scaling handled traffic spikes without manual intervention. Cost for this workload dropped from $1,200/month (EC2) to $180/month (Lambda).

Month 7–12: Core Services Migration

With article rendering stable, we tackled the next set of tightly coupled features:

  • User authentication: Extracted to a standalone Cognito-based service with JWT tokens. Enabled social login (Google, Facebook) and SSO for enterprise subscribers. Monolith read user data from DynamoDB replication during transition.
  • Search: Migrated from PostgreSQL full-text search to OpenSearch Serverless with near-real-time indexing via EventBridge events. Implemented typo tolerance, synonyms, and faceted search.
  • Newsletter management: Built a serverless service with EventBridge-triggered email sends via Amazon SES. Subscriber count grew from 2 million to 3.5 million during migration—something that would have crashed the old architecture.
  • Ad serving: Previously a home-grown Python system tied to article rendering. Extracted to a dedicated service that served ads via JSON API, enabling header bidding and programmatic integrations.

Each extraction followed the same canary deployment pattern. We maintained a feature flag system to quickly route traffic back to monolith if metrics degraded. Over these six months, the monolith was reduced from serving 100% of traffic to 30%.

Month 13–15: Analytics and Observability Migration

The analytics subsystem was particularly complex:

  • Inbound: Real-time page view and click events from JavaScript on client-side
  • Processing: Aggregation into hourly/daily reports
  • Outbound: Dashboards for editorial team, revenue reporting for finance

We rebuilt this using a combination of Kinesis Data Streams, Lambda, and S3 with Athena for queries. Events that previously took hours to process were now available in near-real-time (2–3 second latency). The finance team could run ad-hoc queries without touching production databases.

Month 16–18: Cutover and Optimization

Final phase involved:

  • Database migration: Using AWS DMS, we replicated the monolith PostgreSQL database to Aurora Serverless v2. After 4 weeks of validation and application changes to read from new endpoints, we switched DNS to point to new database cluster. Final cutover took 23 minutes.
  • Image optimization: Migrated all images to S3 with CloudFront. Implemented on-the-fly image resizing via Lambda@Edge, reducing bandwidth costs by 45% and improving Core Web Vitals scores.
  • Legacy decommission: Monolith EC2 instances were terminated. DNS records cleaned up. Documentation updated.

Results

Performance Improvements

MetricBeforeAfterImprovement
Time to First Byte (global p50)1.2s180ms85%
Full page load (3G)4.8s1.2s75%
Database connections during peak~950~12087%
Cache hit rate (CloudFront)32%78%144%

Cost Reduction

Expense CategoryMonthly (Pre-Migration)Monthly (Post-Migration)Change
EC2 instances$18,000$0 (fully serverless)-100%
RDS (PostgreSQL)$12,000$3,200 (Aurora Serverless)-73%
Data transfer & CDN$8,000$4,800 (CloudFront)-40%
Storage (EBS → S3)$2,500$850 (S3 + Glacier)-66%
Misc (monitoring, etc.)$1,500$1,200-20%
Total$42,000$13,500-68%

Operational Metrics

  • Deployment frequency: From 1 per week to 12+ per day (20× increase)
  • Lead time for changes: From 5 days to <2 hours
  • Mean time to recovery (MTTR): From 4 hours to 12 minutes
  • Change failure rate: From 15% to 1.2%
  • Availability: 99.97% (vs target 99.95%)

Business Impact

  • Ad revenue: Increased 22% YoY due to faster page loads and improved ad viewability scores
  • Subscription conversions: Improved 18% after implementing personalized paywall based on new analytics
  • Newsroom productivity: Editorial team deployed 47 content experiments in Q3 2024 vs 3 in all of 2023
  • Team growth: Engineering team expanded from 12 to 20 developers without proportional increase in DevOps overhead

Lessons Learned

What Worked Well

  1. Canary deployments saved us: The incremental rollout strategy caught three major issues before they affected all users. One bug in search pagination would have affected 100% of traffic without canaries.
  2. Event-driven decoupling was essential: Without EventBridge as the communication layer, we would have created a distributed monolith. Events allowed us to change service implementations without affecting consumers.
  3. Feature flags enabled safe rollbacks: We could instantly switch traffic back to monolith if a new service showed degraded metrics. This gave stakeholders confidence throughout the 18-month journey.
  4. Dual-write data strategy: The 9-month period of writing to both databases gave us confidence that data consistency was maintained. We found and fixed five data inconsistency bugs during this period that would have caused customer issues post-cutover.

What We'd Do Differently

  1. Start with API contracts earlier: We spent too much time (3 months) discussing API design. In retrospect, we should have defined OpenAPI schemas in month 1 and stuck to them, even if implementation changed.
  2. Database connection pooling was underestimated:our initial Lambda functions didn't use RDS Proxy, causing database connection spikes during cold starts. Adding RDS Proxy solved this but required re-architecting database access patterns.
  3. Cold start mitigation needed more attention: For latency-sensitive paths, we later had to implement provisioned concurrency, which added cost. Better initial configuration would have saved time.
  4. More comprehensive load testing: We did load testing, but not at the scale of actual breaking news (8× normal traffic). A larger-scale test earlier would have revealed bottlenecks in EventBridge throughput.

Key Takeaways for Similar Projects

  • Never cut over during peak seasons: We scheduled all major milestones during low-traffic periods (January–February). No incidents occurred during migration windows.
  • Instrument everything before you move anything: We had 3 months of monitoring on the monolith before extracting any services. This baseline was invaluable for comparison.
  • Keep the team small and cross-functional: Our core team was 5 engineers (2 backend, 2 DevOps, 1 frontend) plus a product manager. Larger teams create communication overhead that slows migration velocity.
  • Stakeholder communication is 80% of the job: Weekly demos and clear metrics dashboards kept executives, newsroom, and product teams aligned and supportive throughout the 18-month journey.

Conclusion

The MediaPulse migration demonstrates that large-scale architectural transformations are possible—even for high-traffic, revenue-critical platforms—when approached with a clear strategy, incremental execution, and relentless focus on metrics. By moving from a monolithic Django deployment to a modern serverless, event-driven architecture, we delivered not just technical improvements but tangible business value: 68% cost savings, 85% faster page loads, and 20× faster feature delivery—all while handling 4× the traffic without downtime.

The project's success ultimately came down to three factors: choosing the strangler fig pattern over risky big-bang approaches, implementing rigorous monitoring and safety nets (canaries, feature flags, dual-writes), and maintaining transparent communication with stakeholders throughout the 18-month journey. These practices, more than any specific technology choice, are what we recommend to any team facing a similar replatforming challenge.

Related Posts

Scaling to 2 Million Events/Minute: Building a Serverless Real-Time Analytics Engine
Case Study

Scaling to 2 Million Events/Minute: Building a Serverless Real-Time Analytics Engine

When our retail analytics customer's batch-powered dashboard began failing during peak hours, we faced a hard deadline: process live events at scale or lose their business. Their legacy system—a nightly Spark job on a 12-node EMR cluster—produced reports 14 hours after data collection, making it useless for real-time decision-making. Over 11 months, we architected and migrated them to a fully serverless real-time analytics pipeline on AWS, capable of ingesting, enriching, and serving 2 million events per minute with 99.99% uptime. This is the story of that transformation: the technology choices (Kinesis, DynamoDB Streams, Lambda, and ClickHouse), the data consistency patterns that kept 47 microservices coordinated, and the observability practices that turned a 4-hour weekly firefight into a clean, automated system. We'll examine why Lambda-powered stream processors outperformed a 50-node Flink cluster, how we achieved exactly-once semantics across multiple async boundaries, and which operational guardrails made a 10x scale increase feasible without a proportional cost increase.

Scaling to 10 Million Users: How FinFlow Built a Cloud-Native Financial Platform
Case Study

Scaling to 10 Million Users: How FinFlow Built a Cloud-Native Financial Platform

When FinFlow's user base exploded from 100,000 to 10 million in just 18 months, their monolithic architecture crumbled under the load. This case study examines how the fintech startup re-architected their platform using microservices, event-driven design, and a multi-cloud strategy to achieve 99.99% uptime while processing $2.3 billion in annual transactions. We detail the technical decisions, deployment strategies, and organizational changes that enabled sustainable growth—from migrating legacy banking systems to implementing real-time fraud detection that reduced false positives by 73%.

Scaling Real-Time Collaboration: How Webskyne Engineered a High-Performance Live Editing Platform for 100K+ Concurrent Users
Case Study

Scaling Real-Time Collaboration: How Webskyne Engineered a High-Performance Live Editing Platform for 100K+ Concurrent Users

When a leading project management SaaS provider faced catastrophic performance failures during peak collaboration sessions, Webskyne was brought in to redesign their real-time architecture from the ground up. The challenge was daunting: support 100,000+ concurrent users editing simultaneously while maintaining sub-100ms latency and 99.99% uptime. Through innovative WebSocket optimization, strategic use of conflict-free replicated data types (CRDTs), and a hybrid cloud-native architecture, we not only solved the immediate crisis but built a system that now powers collaboration for millions of users worldwide. This case study reveals how we transformed a failing platform into a market differentiator through architectural excellence, operational rigor, and a methodical approach to distributed systems engineering.