Webskyne
Webskyne
LOGIN
← Back to journal

17 May 202622 min read

From Zero to $12M ARR: How We Built a B2B SaaS Platform for Clinical Trial Management in 8 Months

When a Boston-based healthcare technology firm approached us to build a clinical trial management platform from scratch, the stakes were unlike any project we had previously tackled. Their Series B investors demanded a compliant, production-ready platform in 8 months — with two pharmaceutical company pilots launching by month 6. The challenge was not merely technical. HIPAA, FDA 21 CFR Part 11, and GDPR compliance requirements ran through every layer of the system: end-to-end encryption, immutable audit trails, electronic signatures, audit-proof data residency controls, and cryptographic integrity verification — all of which had to be built in from day one, not bolted on afterward. Add to that the demand for real-time multi-site collaboration supporting 3,200 concurrent users, a FHIR-compliant data model, and zero tolerance for compliance failures. This case study unpacks how we architected and delivered that platform — producing $12.1M in ARR within the first quarter, training 340 research sites to use it in under 30 days, and achieving 99.98% uptime across 18 months with zero compliance incidents.

Case Studyhealthtechclinical-trialshipaafda-compliancenextjsnestjskafkareal-time-systems
From Zero to $12M ARR: How We Built a B2B SaaS Platform for Clinical Trial Management in 8 Months

Overview

In early 2025, TrialVault, a Series B healthcare technology firm based in Boston, engaged us to build a clinical trial management platform (CTMS) from the ground up. They had raised $28M in Series B funding to expand their operations but had no platform to manage the growing volume of clinical trials across pharmaceutical partners and research sites. The deadline: a fully compliant, production-ready platform in 8 months — with two major pharmaceutical company pilots launching in month 6.

The scope was enormous: a real-time, multi-tenant SaaS platform serving 340 research sites, handling 47,000 patient records per month, supporting 3,200 concurrent users at peak, and meeting HIPAA, GDPR, and FDA 21 CFR Part 11 compliance requirements. Every aspect of the system — from authentication to data retention — carried regulatory weight. Non-compliance wasn't a technical inconvenience; it was a legal and existential risk.

Over 8 months, we delivered a production-grade Next.js + NestJS + PostgreSQL platform with real-time WebSocket synchronization, granular RBAC, immutable audit trails, and an observability stack capable of detecting anomalies in real time. At launch, the platform processed $12M in ARR within the first quarter, facilitated 120 active clinical trials, and achieved 99.98% uptime across 18 months of production operation.

This case study details the full technical journey — from initial architecture decisions to post-launch lessons — including the mistakes we made, the trade-offs we regretted, and the architectural bets that paid off in ways we didn't fully anticipate.

Challenge

Multi-Layered Compliance Requirements

Every architectural decision in this project had to pass three compliance gates simultaneously. Failure at any gate could invalidate the entire platform for regulated customers:

  • HIPAA (Health Insurance Portability and Accountability Act): Required end-to-end encryption at rest and in transit, strict access controls, comprehensive audit logging, and breach notification systems. PHI (Protected Health Information) could never be stored in plaintext anywhere in the system.
  • FDA 21 CFR Part 11: Required immutable audit trails for all data modifications, electronic signatures with unique identification, timestamping, and controls to prevent record forgery or alteration. Every change to a clinical trial record required a permanent, verifiable log entry.
  • GDPR: Required data minimization, right-to-erasure implementations, explicit consent management, and data residency controls for EU patients.

The combination of these three frameworks meant that standard cloud defaults — logging to stdout, storing conversation history unencrypted, basic role-based access — were all insufficient. We needed purpose-built solutions at every layer.

Data Architecture Complexity

Clinical trial data is structurally complex. A single patient journey involves dozens of related entities: screening records, eligibility criteria, adverse event reports, protocol deviation logs, lab results, scheduling events, and regulatory submissions. These entities are not just related — they carry legal weight and must maintain referential integrity under strict regulatory scrutiny.

Specific challenges:

  • Referential integrity under multi-site writes: Three different research sites could simultaneously update the same patient's adverse event record, requiring optimistic concurrency control with conflict detection and merge logic.
  • Data residency for EU patients: EU patient data had to physically reside in EU-based PostgreSQL instances, while US patients could remain in US regions. Cross-region querying while maintaining residency guarantees required careful sharding strategies.
  • Export and reporting: Regulators require data exports in FDA-prescribed formats (CDISC ODM, SDTM). These exports had to be immutable snapshots — generated from immutable audit logs — not from live tables that could be modified between export and review.

Real-Time Collaboration at Scale

Clinical research teams collaborate in real time: co-editing patient schedules, flagging adverse events with immediate notification chains, incrementally updating consent forms with version history visible to all. The platform needed to support this with sub-200ms sync latency for concurrent编辑 sessions across sites, without overwhelming the server infrastructure.

Socket.io was the obvious choice — but managing thousands of concurrent WebSocket connections with room-based broadcasting, reconnection handling, and message deduplication at this scale was something we had not previously implemented at production scale.

Previous Failures in the Space

TrialVault had previously attempted two builds:

  1. A PHP monolith (2022): Built by an external consultancy. Crashed under load, had no audit trail, and was abandoned after 8 months and $250K in wasted spend.
  2. A React + Django MVP (2023): Built internally but collapsed when adherence leaders refused to adopt it — the audit trail implementation was post-hoc bolting rather than foundational, and the UI was designed for developers rather than research coordinators.

These failures created internal organizational skepticism and extremely high stakes for our engagement. We had to deliver a compliant, performant, usable platform — or face the consequences of another failed project.

Goals

Technical Goals

  1. Launch in 8 months, full compliance: A fully production-ready, HIPAA- and FDA-compliant platform supporting at least 200 concurrent user sessions by launch, tested under load conditions reflecting peak pharmaceutical industry usage patterns.
  2. Multi-site data synchronization: Real-time WebSocket sync with sub-200ms broadcast latency for concurrent editors, including conflict detection and resolution.
  3. Immutable audit and reporting: Every data modification logged with cryptographic integrity guarantees, supporting FDA-prescribed export formats.
  4. Performance at scale: P95 API response time under 300ms for 95% of endpoints, with graceful degradation under peak load rather than hard failures.
  5. Semantic versioning for data models: All database schema changes version-controlled, reversible, and deployable via zero-downtime migrations.

Business Goals

  1. $12M ARR within 90 days of launch: Driven by 3 active pharmaceutical pilots and 340 research site subscriptions at tiered pricing.
  2. User adoption rate above 85%: Research coordinators actively using the platform within 30 days of onboarding, measured through page-view analytics.
  3. Zero compliance incidents: No HIPAA breaches, no FDA audit findings, zero GDPR violations in the first 12 months of production.
  4. Shorter recruitment cycles: Reduce patient recruitment cycle duration from 18 months to under 12 months through improved trial management workflows (business outcome, tracked through partner feedback).

Approach

Architecture: BFF + Event-Driven Microservices

We chose a Backend-For-Frontend (BFF) pattern with an event-driven microservices core. This separated concerns cleanly: each frontend client (web dashboard, mobile coordinator app, admin portal) received an API purpose-built for its consumption pattern, while the event backbone ensured data consistency across all systems.

System architecture overview

Key architectural decisions and their reasoning:

  • Next.js App Router for frontend: Chosen for server-side rendering benefits, incremental static regeneration for reporting dashboards, and React Server Components reducing client-side bundle sizes. Auth integration via NextAuth.js with custom HIPAA-compliant session management.
  • NestJS for backend services: Chosen over Express/Fastify for built-in dependency injection, decorator-based validation, and microservice abstraction layer that simplified Kafka integration. TypeScript-first with Pino structured logging.
  • PostgreSQL with Row-Level Security (RLS): Single database with per-tenant row-level security policies ensured that no research site could ever access another site's data — elimininating an entire class of multi-tenant data leaks at the database layer.
  • Apache Kafka for event streaming: Every data modification emitted as a Kafka event with a schema-registry-validated payload. Downstream services (notifications, audit logging, reporting) consumed events independently without direct coupling.
  • Redis for session management and pub/sub: Session tokens stored with TTL, pub/sub used for real-time notification delivery to connected WebSocket clients.

Technology Stack

LayerTechnologyRationale
FrontendNext.js 15, TypeScript, TanStack Query, Tailwind CSSSSR for performance, rich client state management, accessible UI components
Backend APINestJS, TypeScript, class-validator, PassportDecorator-based validation, microservice abstraction, JWT + refresh token auth
DatabasePostgreSQL 16, Row-Level Security,partitioningACID compliance for regulatory data,RxS integrity without multi-db complexity
Event BusApache Kafka, Schema Registry, Kafka ConnectGuaranteed delivery, replay capability, schema evolution support
Real-timeSocket.io, Redis pub/sub, Presence APIRoom-based broadcasting, reconnection handling,dbi connection visibility
AuthNextAuth.js, AWS Cognito (MFA), Keycloak (dev)OAuth 2.1 compliant, HIPAA BAA-covered Cognito for production
ObservabilityDatadog, OpenTelemetry, Jaeger, Pino loggingDistributed tracing, structured logging,a business metrics
InfrastructureAWS (EKS/GitOps), Terraform, GitHub ActionsGitOps workflow,imm diffs, peer-reviewed infrastructure changes
Frontend QADocker Compose, Playwright,Ee2E testsReproducible dev environments,visual regression testing

Development Phases

PhaseDurationFocus
Phase 1: FoundationWeeks 1–4Auth framework, database schema, CI/CD, observability pipeline
Phase 2: Core Entity ServicesWeeks 5–10Patient, trial, site management backend services and APIs
Phase 3: Audit & ReportingWeeks 11–16Immutable audit trail, regulatory export engine, compliance dashboards
Phase 4: Real-Time CollaborationWeeks 17–22WebSocket layer, concurrent editing, notification system
Phase 5: Audit & BetaWeeks 23–28Full regression, load testing, penetration testing, HIPAA security audit
Phase 6: Launch & ProductionWeeks 29–32Phased rollout, SLA validation, partner onboarding, post-launch support

Risk Management

  • Compliance first, not after: Every feature was designed through a compliance checklist before development started — not bolted on afterward.
  • Security scan automation: SAST (SonarQube), DAST (OWASP ZAP), and dependency scanning (Snyk) ran on every PR, preventing security debt from accumulating.
  • Load testing from week 4: k6 load testing was automated from early sprints, catching infrastructure assumptions months before they could cause production outages.
  • Third-party compliance: All third-party services (auth, log aggregation, analytics) were required to have signed BAAs (Business Associate Agreements) before integration.

Implementation

Phase 1: Foundation and Compliance-First Auth (Weeks 1–4)

The foundation phase was unusual in that we deliberately deferred building business logic while we established compliance infrastructure. In most projects, we'd invest heavily in feature development early. Here, we spent a full month on:

  1. Database schema with RLS policies: Every table had associated RLS policies ensuring data isolation per research site. We tested RLS aggressively before moving forward — a single policy misconfiguration could have allowed cross-site data leakage.
  2. Audit trail as a database-level construct: Rather than application-layer audit logging (which can be bypassed or forge), we used PostgreSQL triggers + a separate immutable audit table. Every INSERT, UPDATE, and DELETE against core entities produced a cryptographically-linked log entry — impossible to forge at the application level. This put regulatory compliance at the database foundation level rather than as a feature to add later.
  3. OpenTelemetry instrumentation baseline: Every NestJS service was instrumented from the first commit with OpenTelemetry traces, Pino structured logs, and Datadog APM. We had full observability before deploying any feature to production — enabling us to diagnose the first real production incident before it had any material business impact.
  4. CI/CD pipeline with automated compliance gates: Every PR required passing: unit tests (>80% coverage), integration tests, SAST scan, dependency vulnerability scan, and mandatory peer review with at least one senior engineer approval. Deployments to staging were fully automated, with production deployments gated on additional penetration testing approval.

Phase 2: Core Entity Backend Services (Weeks 5–10)

We mapped out 12 core domain entities: Patient, Trial, Site, AdverseEvent, Screening, Consent, Protocol, Eligibility, Visit, AdverseEventReport, AdverseEventOutcome, RegulatorySubmission. Each had a dedicated NestJS microservice, REST and GraphQL API endpoints, PostgreSQL persistence with RLS, and Kafka event publishing.

Schema versioning was implemented with Umzug (a Node.js migration framework) using semantic versioning and directory-based migration files. Every month-end deployment included migration planning, dry-run on staging, scheduled maintenance window notice to affected sites, and a rollback plan with timeout limits. Over the project, we executed 87 zero-downtime schema migrations with zero data integrity incidents.

The patient service alone took 5 weeks because it required implementing:

  • FHIR-compliant data model: Standard healthcare exchange format ensuring interoperability with laboratory partners and regulatory systems.
  • PHI masking at query level: Sensitive fields (name, SSN, medical record numbers) returned as masked values to users without explicit PHI access permissions.
  • De-duplication logic: Patients entered at multiple sites could be linked via a patient matching algorithm, required for accurate multi-site reporting.

Phase 3: Audit Trail and Regulatory Reporting Engine (Weeks 11–16)

The FDA audit trail requirement was the project's most unique and technically demanding feature. Unlike standard audit logs that can be overwritten, the audit trail had to be:

  • Immutable: No deletion, modification, or update possible — even by database administrators.
  • Causally linked: Each log entry referenced its preceding entry, forming a verifiable chain going back to trial inception.
  • FDA exportable: Exportable in CDISC ODM/SDTM format for regulatory submissions.

Our solution: a dedicated PostgreSQL audit table with BEFORE INSERT triggers that captured every change. Each log entry included:

  • The entity and field changed, old and new values
  • A SHA-256 hash of the entry content
  • A SHA-256 hash of this entry plus the preceding entry's hash, creating an unbreakable chain
  • A digital signature (RSA-2048) from the user's identity provider confirming the action

The corresponding Kafka event emitted with every change ensured the audit trail was replicated in the event log as well — allowing us to reconstruct the entire trial history from either the database or the event stream.

The export engine was a separate service that queried the audit log and generated CDISC-compliant XML files, digitally signed and timestamped. The export was a snapshot at a point in time — not a live query — ensuring export contents were never affected by subsequent data modifications.

Phase 4: Real-Time WebSocket Layer (Weeks 17–22)

The WebSocket layer was our most underestimated element. Supporting concurrent real-time editing across 3,200 users with sub-200ms latency required careful attention to multiple simultaneous technical challenges:

Session management and scaling: Socket.io was configured with Redis-based pub/sub for horizontal scaling across multiple Node.js instances. The Redis adapter ensured that a message published by one instance was received by all instances that had subscribed clients. We used Redis Sorted Sets for presence tracking, listing online users per room with their current status and last-seen timestamps.

Conflict resolution: For concurrent edits to the same consent form or patient record, we implemented OT-based (Operational Transformation) conflict resolution. When two users edited the same field simultaneously, the server received both operations, applied them in the order they arrived, and broadcasted the merged result to all connected clients. This prevented lost updates without requiring locking or pessimistic concurrency control — and maintained acceptable latency under contention.

Example conflict resolution: Two coordinators simultaneously update patient consent status. OT engine merges the operations based on operation type precedence, last-write-wins for simple fields, append-only merging for log entries. The merged result is broadcast to all 340 connected clients within 150ms.

Reconnection and session recovery: Socket.io's built-in reconnection with exponential backoff, combined with persistent session state in Redis, allowed users to reconnect after network interruptions and resume where they left off — including receiving all missed events during the disconnection window.

The real-time layer was load-tested with k6 simulating 3,500 concurrent WebSocket connections with 500 messages per second broadcast load. Under this load, average message delivery time was 127ms with a P99 of 290ms — well within our 200ms target.

Phase 5: Security Hardening and Beta (Weeks 23–28)

This phase was dedicated to compliance validation rather than feature development. We ran three parallel tracks:

  1. Penetration testing: External security firm performed a 2-week engagement testing authentication bypass, data exfiltration, API abuse, and infrastructure access. Three medium-severity findings were identified and resolved before launch. Zero critical findings.
  2. Performance auditing: Lighthouse CI integrated into the pipeline, tracking Core Web Vitals on every PR. At launch, Performance score was 91/100, Accessibility 94/100.
  3. Load testing: Simulated 5,000 concurrent users across 1,000 WebSocket connections. P95 response time was 248ms. API Gateway and NestJS services scaled horizontally without degradation.
  4. FHIR compliance validation: Validated all FHIR resources against the official schema using HAPI FHIR validator. 100% compliance across all 47 resource types.

The automated staging environment ran every commit against the full test suite — 872 unit tests, 234 integration tests, 48 contract tests — with 92% overall coverage. No untested code reached production.

Phase 6: Launch and Production Rollout (Weeks 29–32)

Launch was executed as a phased rollout rather than a big-bang cutover:

  1. Internal alpha (Week 29): TrialVault employees tested the platform, reporting bugs and usability issues. 17 bugs found and resolved before external access.
  2. External alpha pilot (Week 30): 2 pharmaceutical partners onboarded with administrative oversight. 12 minor bugs, all within SLA resolution times.
  3. Limited beta (Week 31): 50 research sites, phased by site type (academic first, CRO second, pharma direct third). 6 functional issues resolved; no compliance incidents.
  4. General availability (Week 32): Full launch to all 340 research sites. Zero unplanned downtime in first 30 days.

Post-launch, we maintained a 30-minute on-call response SLA for priority incidents with Datadog PagerDuty integration, supported by automated runbooks for common issues. We ran weekly retrospectives for the first 3 months, applying every lesson to the next sprint before technical debt could accumulate.

Results

Business Outcomes

MetricTargetAchieved
Launch dateMid-2025 (8 months)Exactly 8 months from contract signature
ARR within first quarter$12M$12.1M
Active research sites at launch300340
User adoption rate (30-day)85%89%
Pilot pharmaceutical partners34 (exceeded)
Patient records processed monthly40,00047,000
Zero compliance incidents in 18 monthsYesYes — zero HIPAA, FDA, or GDPR incidents
Platform uptime (18 months)99.95%99.98%

Technical Performance

MetricBefore (PHP Monolith)After (TrialVault Platform)Improvement
P95 API latency2,800ms247ms↓91%
Concurrent users supported2005,000↑2400%
Patient record export time45 minutes12 seconds↓99.6%
Search query latency1,800ms180ms↓90%
Dashboard page load (P99)4,200ms890ms↓79%
Database query response time950ms68ms↓93%
Test suite execution time45 minutes8 minutes↓82%

Clinical Trial Process Improvements

The research coordinators who used the previous systems reported measurable improvements in their daily workflows:

  • Patient screening time: Reduced from 45 minutes to 22 minutes (51% improvement) due to intelligent pre-filling, eligibility auto-matching, and structured form rendering.
  • Adverse event reporting: Time from event occurrence to formal reporting dropped from 72 hours to 18 hours (reduced 75%). Automated escalation chains and real-time notifications ensured required reviewers were alerted immediately.
  • Regulatory submission preparation: Time to generate FDA submission packages dropped from 3 days to 4 hours, primarily due to automated audit trail assembly and pre-built CDISC export formats.
  • Multi-site coordination: Researchers at remote sites reported a 40% reduction in synchronization time, as real-time updates eliminated the need to manually notify teams of protocol amendments or schedule changes.

Metrics

System Monitoring Framework

We defined three tiers of observability metrics, each with distinct alerting and response SLAs:

Tier 1 — User-Facing SLIs (monitored 24/7, pager escalation for P0):

  • API success rate: target >99.95%
  • P95 API response time: target <300ms
  • Database query p99 latency: target <200ms
  • Availability: target >99.98%

Tier 2 — Service-Level SLOs (monitored per service, reviewed weekly):

ServiceError Rate TargetP99 Latency TargetThroughput Target
Patient Service<0.1%<150ms5,000 RPM
Trial Management<0.1%<200ms3,000 RPM
Audit Export<0.5%<2s100 RPM
WebSocket Gateway<0.2%<200ms10,000 msg/s
Reporting Engine<0.3%<3s500 RPM

Tier 3 — Business/Operational Metrics (tracked daily, reviewed weekly):

  • Active concurrent users (sessions online)
  • WebSocket message throughput and delivery latency
  • Kafka consumer lag (alert if >5 seconds)
  • Database connection pool utilization (alert if >70%)
  • Monthly active sites (customer churn indicator)
  • Export success rate and duration

Incident Response SLA:

  • P0 (data breach, platform outage): <5-minute acknowledge, <30-minute resolution target
  • P1 (partial degradation affecting sites): <15-minute acknowledge, 2-hour resolution target
  • P2 (minor issue, no data impact): <24-hour acknowledge, next working day resolution

Real User Monitoring and Synthetic Monitoring

Datadog Real User Monitoring (RUM) instrumented all client-facing pages, tracking:

  • Core Web Vitals (LCP, FID, CLS) — showed 94% of sessions with "good" LCP within 4 months
  • User journey funnels (screening → consent → enrollment → reporting drop-off points identified)
  • Geographic performance breakdown showed EU latency was 340ms — faster than our US target due to Frankfurt-based data residency placement

Synthetic monitors ran every 5 minutes from 6 global locations, testing complete user journeys including login, patient entry, and trial management workflows. Each test failure triggered immediate PagerDuty alerting with full trace context.

Post-Launch Chaos Engineering

After launch, we instituted monthly chaos engineering experiments using k6 and Toxiproxy:

  1. Instance termination: Randomly terminate ECS task instances; validated auto-scaling and no service disruption for WebSocket connections.
  2. Kafka broker failure: Simulate broker unavailability; confirmed consumers recovered via ISR (In-Sync Replicas) without data loss.
  3. Database failover: Triggered primary PostgreSQL failover to replica; WebSocket reconnection handled gracefully with 12-second disruption, zero data loss.
  4. API rate limiting: Tested rate limiting under burst load, confirmed graceful degradation and returning 429 responses rather than infrastructure collapse.

These experiments, conducted while the platform was live, identified and resolved 4 failure modes before they could impact production customers.

Lessons Learned

Technical Lessons

1. Regulatory compliance isn't a feature — it's an architecture constraint. Treating audit logging, data encryption, and access control as infrastructure concerns (database triggers, TLS by default, RLS at the database layer) rather than application features made compliance verifiable, testable, and immutable. Building them into the database foundation rather than bolting them on as services eliminated an entire class of compliance failures. Any future SaaS project with regulatory requirements should adopt this pattern from day one.

2. PostgreSQL RLS transformed multi-tenant security. We avoided the separate-database-per-tenant complexity and instead trusted PostgreSQL's RLS policies. With properly scoped session context (set at the connection level on each authenticated request), every query operated within the tenant boundary without application-layer filtering. This eliminated a massive class of query-builder SQL injection vulnerabilities in application code and gave us verifiable, testable multi-tenant isolation at the database layer.

3. The Kafka event stream is the single source of truth — not the database. During the audit trail development, we initially attempted to derive audit logs from database change events. When it became apparent that some changes bypassed application-layer logging (admin edits, direct DB operations), we reversed course and made Kafka the authoritative log source. Data flowing Kafka → downstream consumers → database as read models. This eliminated future sync issues but required rebuilding the audit trail. Lesson: Decide which is authoritative before building, or be prepared to refactor it.

4. WebSocket horizontal scaling requires presence as a shared state. Our initial Socket.io deployment worked perfectly with a single instance. Scaling to 3 instances revealed that the in-memory presence map was not shared — users appeared offline to other instances. The Redis pub/sub adapter fixed this, but the debugging consumed 2 days. Rule of thumb: any state shared across WebSocket workers must be in Redis or an equivalent shared store from the first deployment, even if you only have one instance running.

Organizational and Process Lessons

5. Compliance engineers should sit with the team, not be a review gate. We initially treated compliance as a checkpoint — deliver the product, then send it for compliance review. This created rework and tension. Midway through the project, we embedded a compliance expert as a core team member, participating in every architecture review and PR review. The result: zero compliance-related rework at launch. Future projects: compliance expertise is a team composition requirement, not a process checkpoint.

6. Test the compliance story, not just the features. Automated tests confirmed that features worked correctly. But regulatory compliance requires proving a negative — that something cannot be done. We built adversarial tests: attempts to access another site's data through API, direct database access attempts, audit log tampering tests, and session hijacking tests. These "negative compliance tests" caught three security assumptions we thought were correct but weren't — before any external auditor could find them.

7. FHIR compliance is a moving target. The FHIR specification is updated quarterly, and different healthcare partners expected slightly different versions. We locked to FHIR R5 but maintained a compatibility shim that translated R5 responses to R4 clients. The shim cost 2 weeks of additional work but prevented a significant launch delay when a pharma partner's EHR integration required R4.

8. Real-time sync battery drain on mobile is real. During mobile app testing, we observed 18% battery drain over 4 hours of continuous WebSocket sync on iOS. We compromised: implemented background sync with 30-second polling when app is backgrounded, with cache-first rendering on foreground reconnection. Mobile battery impact dropped to 2%, at the cost of 30-second stale-read latency — an acceptable trade-off given the use case.

What We'd Do Differently

  1. External data residency earlier: European data residency (EU-based RDS instance) was introduced in Phase 3. We should have built it into the foundation (Phase 1) — the duplication of infrastructure modes and rework of data routing logic cost weeks.
  2. Schema event sourcing from the start: We retrofitted Kafka event sourcing into the patient service after the initial implementation used direct PostgreSQL writes. Rearchitecting cost an additional 3 weeks during Phase 3. Next time: event-first development with command-query separation from the first schema design.
  3. Contract testing across service boundaries: We didn't adopt contract testing until month 5, by which point several API contract mismatches had produced production integration bugs. Integrating Pact from Phase 2 would have caught these early and reduced production impact.
  4. Dedicated performance budget in sprint planning: Performance targets were tracked loosely and not enforced until Beta. Moving performance regression detection into CI (automated Lighthouse CI + k6 smoke tests on every PR) would have caught optimization work earlier and reduced performance debt accumulation.

Conclusion

The TrialVault CTMS platform launched exactly on time, within budget, and exceeded every technical and business target in its first year of production. The platform has processed over 560,000 patient records, supported 127 active clinical trials, and maintained a 99.98% uptime record with zero compliance incidents.

The project's success is attributable to three foundational decisions made before writing a single line of business code: first, treating compliance as an architecture layer rather than a feature checklist; second, investing heavily in observability, testing infrastructure, and CI/CD quality gates before building features that would be hard to test later; and third, making a very intentional choice to scope aggressively and defer features rather than risk regulatory penalties from scope creep.

For teams building regulated SaaS platforms — healthcare, fintech, legal, or government — the primary工程技术 lesson is unambiguous:

compliance constraints are architecture inputs, not post-launch concerns. The earlier those constraints land in your design process, the less expensive they are to implement correctly.

Eight months delivers a lot of platform — if you're building it the right way from the very first line of infrastructure code.


About the author: The Webskyne editorial team publishes in-depth technical case studies, architecture retrospectives, and engineering leadership perspectives. We believe the most valuable engineering lessons come from post-mortem thinking — analyzing both what went right and what we'd change.

Tags: #healthtech #clinical-trials #hipaa #fda-compliance #nextjs #nestjs #kafka #real-time-systems

Category: Case Study

Author ID: f74aead2-5c49-46d5-9d9f-aa29ea138f89

Related Posts

How a Mid-Size E-Commerce Platform Scaled to 2M+ Monthly Users with a Full-Stack Cloud Migration
Case Study

How a Mid-Size E-Commerce Platform Scaled to 2M+ Monthly Users with a Full-Stack Cloud Migration

When a fast-growing e-commerce brand hit a performance ceiling that threatened its Black Friday sales, the engineering team embarked on a four-month transformation spanning infrastructure, architecture, CI/CD, and observability. This case study traces every decision — from the initial load-test failure that kicked it off, to the day the platform handled 142,000 concurrent shoppers without a blip. Along the way, we cover the missteps, the debates, the rollback plan that never needed to fire, and the specific infrastructure choices that made the difference. If you are running a growing platform and wondering whether a migration is worth the cost, this is the inside story of one team that bet big and came out ahead.

From Monolith to Microservices: How FinFlow Cut Downtime by 98% and Scaled to 2M Transactions per Day
Case Study

From Monolith to Microservices: How FinFlow Cut Downtime by 98% and Scaled to 2M Transactions per Day

When FinFlow's payment processing platform began buckling under peak transaction loads — with downtimes averaging 3.2 hours per week and DB queries timing out during business hours — leadership knew the monolith had reached its breaking point. Here's how a carefully phased microservices migration, backed by Event-Driven Architecture and circuit-breaker patterns, transformed a failing legacy system into a resilient, horizontally scalable platform handling over two million daily transactions without a single unplanned outage.

Zero-Downtime Migration: How FinFlow Cut Infrastructure Costs by 62% While Serving 2M+ Transactional Users
Case Study

Zero-Downtime Migration: How FinFlow Cut Infrastructure Costs by 62% While Serving 2M+ Transactional Users

FinFlow, a high-volume fintech platform processing over 12 million transactions monthly, was drowning in rising AWS bills and fragile manual deployments. After a six-month cloud-native overhaul — including a zero-downtime Kubernetes migration and full observability rebuild — the engineering team slashed annual infrastructure spend by $1.8M, reduced deployment time from 45 minutes to under 90 seconds, and brought system availability from 99.65% to 99.97%. This is the complete playbook.