Skip to main content

Work Order QMS Module — Operational Readiness

Classification: Internal — Operations & Engineering
Date: 2026-02-13
Artifact: 66 of WO System Series
Status: Proposed
Source Artifacts: 12-sdd.md, 13-tdd.md, 14-c4-architecture.md, 25-agent-orchestration-spec.md, 58-gap-closure-prompts.md, 64-security-architecture.md


1. Team Topology

1.1 Conway's Law Alignment

The WO system architecture implies a team structure. Misalignment between architecture and org chart creates friction, slower delivery, and cross-team dependency hell. The mapping below reflects the container decomposition from the C4 model (14-c4-architecture.md).

Architecture ComponentOwning TeamTeam TypeHeadcountKey Dependency
Agent OrchestratorPlatform CoreStream-aligned3–4AI model providers
Compliance EngineCompliance EngEnabling2–3Regulatory SMEs
WO Lifecycle EngineWO ProductStream-aligned3–4Platform Core (state machine)
State Store (PostgreSQL)Data PlatformPlatform2Infrastructure
API GatewayPlatform CorePlatform1–2Identity provider
Event Bus (NATS)Data PlatformPlatform1Infrastructure
IDE Shell (Theia)DX TeamStream-aligned2–3Platform Core (APIs)
Observability StackSREEnabling2All teams
Vendor PortalWO ProductStream-aligned1–2API Gateway

1.2 Team Interaction Modes

Platform Core ←→ WO Product         (Collaboration: shared state machine, agent dispatch)
Platform Core → Compliance Eng (X-as-a-Service: compliance rules consumed via API)
Data Platform → All teams (X-as-a-Service: PostgreSQL, NATS, infra primitives)
SRE ←→ All teams (Collaboration: incident response, runbook co-authoring)
Compliance Eng → WO Product (Facilitating: compliance requirements, audit prep)
DX Team ←→ Platform Core (Collaboration: IDE integrates with orchestrator)

1.3 Minimum Viable Team (MVP — Launch)

For initial launch with 5–10 customers, compress to:

RoleFTECovers
Full-stack engineer (TS/Python)2WO lifecycle, API, agent orchestration
Platform/infra engineer1PostgreSQL, NATS, deployment, observability
Compliance engineer1Regulatory validation, audit trail, e-signatures
SRE / DevOps (fractional)0.5On-call, monitoring, incident response
Product / CTO0.5Architecture decisions, customer-facing
Total5

Scale to full topology (15–20 headcount) at 25+ customers / $4M+ ARR.


2. Skills Gap Analysis

2.1 Required Capabilities vs. Availability

CapabilityCriticalityAvailabilityGapMitigation
TypeScript + ReactHigh✅ CommonNone
Python (AsyncIO, FastAPI)High✅ CommonNone
PostgreSQL (RLS, partitioning, triggers)High⚠️ ModerateAdvanced RLS + partitioningHire or upskill; DB consultant for initial design
FDA 21 CFR Part 11 expertiseCritical❌ RareRegulatory knowledgeHire compliance engineer with pharma QMS background
HIPAA technical safeguardsHigh⚠️ ModeratePHI handling specificsTraining + consulting engagement
AI/LLM agent orchestrationHigh❌ RareEmerging skillInternal training; leverage Anthropic documentation
Eclipse Theia / MonacoMedium❌ RareIDE platform expertise1 specialist hire or contract
Event-driven architecture (NATS)Medium⚠️ ModerateEvent sourcing patternsTraining + reference implementation
Kubernetes / GKE operationsHigh⚠️ ModerateProduction operationsSRE hire with GKE experience
Cryptographic engineeringMedium❌ RareVault integration, hash chains, signingSecurity consultant for initial implementation

2.2 Training Plan

PhaseDurationFocusDeliveryPriority Roles
Onboarding (Week 1–2)2 weeksWO system architecture, data model, state machine, compliance frameworkSelf-paced docs + pairingAll engineering
Regulatory Foundations (Month 1)4 sessions × 2hrFDA 21 CFR Part 11 deep-dive, HIPAA technical safeguards, SOC 2 controlsWorkshop (compliance eng leads)All engineering
Agent Orchestration (Month 1–2)3 sessions × 2hrMulti-agent patterns, model routing, checkpoint management, token economicsWorkshop + hands-on labPlatform + WO product
PostgreSQL Advanced (Month 2)2 sessions × 3hrRLS policies, partitioning strategies, trigger-based audit trails, performance tuningExternal training + labData platform + all backend
Incident Response (Month 2)1 dayGameday exercise: simulated outage, compliance violation, security incidentTabletop + live drillAll engineering + compliance

2.3 Critical Hire Priorities

  1. Compliance Engineer (P0) — Pharma/biotech QMS background, FDA audit experience. This role cannot be replaced by training existing engineers. Estimated time-to-fill: 6–8 weeks.
  2. SRE with Healthcare SaaS experience (P1) — HIPAA-compliant infrastructure operations. Estimated time-to-fill: 4–6 weeks.
  3. Senior Python Engineer (AI/agent) (P1) — LLM orchestration, async systems. Estimated time-to-fill: 4–8 weeks.

3. Operational Runbooks

3.1 Runbook Template

Every component in the WO system has a runbook following this structure:

# Runbook: [Component Name]
Last Updated: [Date] | Owner: [Team] | Review Cadence: Quarterly

## Service Overview
- Purpose and business criticality
- Upstream/downstream dependencies
- Data classification level (per 63-data-architecture.md)
- SLA tier (per §8 below)

## Health Checks
| Check | Endpoint | Expected Response | Interval |
|-------|----------|-------------------|----------|

## Common Issues & Resolution
| Symptom | Likely Cause | Diagnostic Steps | Resolution | Escalation |
|---------|-------------|-----------------|-----------|-----------|

## Scaling Procedures
- Current capacity (users, WOs/sec, transitions/sec)
- Scale-up trigger thresholds
- Scale-up procedure (manual + auto-scaling config)
- Scale-down cooldown rules

## Emergency Procedures
- Circuit breaker manual override
- Break-glass access activation (per §5.2 of 64-security-architecture.md)
- Compliance engine bypass (NEVER in production — audit-only mode only)
- Rollback procedure

## Disaster Recovery
- RPO/RTO for this component (per §7 below)
- Backup verification procedure
- Restore procedure (step-by-step with expected duration)
- Failover trigger and procedure

3.2 WO Lifecycle Engine Runbook

Business Criticality: P1 — Core revenue-generating service
Data Classification: L4 (Regulated — WO records, approvals, audit trail)
SLA: 99.9% availability, P95 < 500ms

Health CheckEndpointExpectedInterval
API healthGET /api/health{"status":"healthy","db":"connected","nats":"connected"}30s
State machineGET /api/health/state-machineAll guard functions loaded, transition map valid5 min
Audit trail writePOST /api/health/audit-write-testWrite + verify in < 50ms1 min
Agent connectivityGET /api/health/agentsOrchestrator reachable, ≥ 1 worker healthy30s
SymptomCauseDiagnosticResolutionEscalate
WO transitions hangingPostgreSQL connection pool exhaustedCheck pg_stat_activity; look for long-running queriesKill idle connections, scale pool sizeData Platform if recurring
Approval signatures failingVault connectivity or key rotation in progressCheck vault health endpoint; verify key versionWait for rotation to complete; or use previous key versionSecurity team if key is revoked
Audit trail writes failingDisk full on PostgreSQL partitionCheck partition size; verify auto-partition creationRun emergency partition creation; archive old partitionsSRE immediately (compliance violation if audit lost)
Agent dispatches rejectedToken budget exhausted for tenantCheck token_usage table for tenant; verify budget configIncrease tenant budget or wait for reset; notify customerProduct team if systemic
Circuit breaker OPENAgent worker failure rate > thresholdCheck worker logs; verify model API healthWait for half-open probe; or manually close if false positivePlatform Core + SRE

3.3 Compliance Engine Runbook

Business Criticality: P0 — Compliance failure = regulatory risk
Data Classification: L4 (Regulated)
SLA: 99.95% availability, zero data loss

SymptomCauseResolutionEscalate
Compliance guard rejecting valid transitionsPolicy rule misconfigured or staleReview guard function against regulatory matrix; update ruleCompliance Eng immediately
E-signature verification failingHash mismatch (data integrity issue)DO NOT bypass. Investigate hash chain integrity. Check for concurrent modification.Security team + Compliance Eng — potential tampering event
PHI scanner false positivesPattern match too aggressiveAdjust confidence thresholds; add to allowlist if appropriateCompliance Eng (never auto-suppress)
Audit trail hash chain breakMissing entry or out-of-order writeRun chain verification job; identify gap; reconstruct if possibleP0 incident — regulatory reporting may be required

3.4 PostgreSQL State Store Runbook

Business Criticality: P0 — All state depends on database
Data Classification: L4 (stores all regulated data)
SLA: 99.99% availability, RPO = 0 (synchronous replication)

SymptomCauseResolutionEscalate
Replication lag > 1sNetwork issue or standby overloadedCheck network; reduce standby query load; verify WAL shippingSRE if lag > 10s
Connection pool saturatedQuery spike or leakIdentify long queries (pg_stat_activity); kill if safe; increase poolData Platform if recurring
Table bloat (audit_trail)Append-only table growing without partitioningVerify partition pruning active; create new partition; archive oldData Platform for capacity planning
RLS policy not enforcingapp.tenant_id not set on connectionCRITICAL — tenant data leak. Kill affected connections immediately. Audit affected queries.P0 incident — Security team + Compliance Eng

4. Cost of Ownership Model

4.1 Infrastructure Cost Projection

ResourceY1 (Monthly)Y2 (Monthly)Y3 (Monthly)Notes
GKE cluster (3 nodes, e2-standard-4)$800$1,200$2,400Scale with customer count
Cloud SQL (PostgreSQL, HA)$500$1,000$2,500Scale with data volume
NATS cluster (3 nodes)$200$300$600Scale with event volume
Cloud KMS + Secret Manager$50$100$200Scale with key count
Cloud Storage (backups, evidence)$100$300$800Scale with audit trail size
Networking (egress, LB)$150$300$600Scale with traffic
Observability (OTEL, Prometheus, Grafana Cloud)$200$400$800Scale with metrics volume
AI model tokens$500$2,000$8,000Scale with WO volume; routing reduces
Total Infrastructure$2,500$5,600$15,900
Annual$30K$67K$191K
Per Customer (Y2, 25 customers)$224/moInfrastructure marginal cost

4.2 Staffing Cost Projection

RoleY1 (FTE × Cost)Y2Y3
Engineering (4 FTE)$600K$900K (6 FTE)$1.4M (9 FTE)
Compliance Eng (1 FTE)$150K$200K$300K (2 FTE)
SRE (0.5 FTE)$75K$150K (1 FTE)$225K (1.5 FTE)
Product (0.5 FTE)$75K$150K (1 FTE)$200K (1 FTE)
Total Staffing$900K$1.4M$2.125M

4.3 Total Cost of Ownership

CategoryY1Y2Y33-Year Total
Infrastructure$30K$67K$191K$288K
Staffing$900K$1.4M$2.125M$4.425M
Licensing (third-party)$15K$30K$50K$95K
Compliance consulting$50K$25K$25K$100K
Total$995K$1.522M$2.391M$4.908M
Revenue (from 60-business-model.md)$960K$4.6M$16.3M$21.86M
Gross Profit($35K)$3.08M$13.9M$16.95M

Y1 is near break-even (pre-revenue buildup). Y2 crosses into strong profitability as multi-tenant leverage kicks in. Y3 infrastructure costs grow sublinearly vs. revenue because shared platform scales efficiently.

4.4 Opportunity Cost Assessment

If We Do ThisWe Don't Do ThisRevenue ImpactRisk
Gap closure series (G01–G28)New feature development (2–3 months)$0 direct; prevents $200K+ per compliance findingCompliance risk elimination — non-negotiable
Vendor portal (V1)Advanced agent capabilities+$200/vendor/mo × vendor seatsDelayed competitive moat
Multi-region deploymentFeature parity with MasterControlEnables enterprise segment ($216K+ ACV)Slower mid-market growth
ISO 13485 compliance frameworkFintech SOC2 expansionOpens medical device segmentDelays fintech revenue

5. Vendor Risk Assessment

5.1 Critical Vendor Dependencies

VendorDependency LevelWhat They ProvideExit DifficultyAlternative
Anthropic (Claude)Tier 1: CorePrimary AI model (Opus/Sonnet/Haiku)Medium — model routing abstraction enables swapOpenAI, Google, open-source models
Google Cloud (GKE, Cloud SQL)Tier 1: CoreCompute, database, networkingHigh — deep GCP integrationAWS (EKS, RDS), Azure (AKS, Flex Server)
HashiCorp VaultTier 2: StrategicSecrets management, credential vaultMedium — Vault API is industry standardGCP Secret Manager, AWS Secrets Manager
NATSTier 2: StrategicEvent bus, async messagingLow — NATS is open-source, self-hostedRedis Streams, Apache Kafka, RabbitMQ
Grafana LabsTier 3: StandardObservability dashboardsLow — OTEL is vendor-neutralDatadog, New Relic, self-hosted Grafana
Auth0 / OktaTier 2: StrategicIdentity provider, SSOMedium — OIDC is standard protocolKeycloak (self-hosted), Azure AD, Google Identity

5.2 Exit Strategy per Vendor

VendorExit TriggerMigration PathEstimated DurationData Portability
AnthropicModel quality degradation, pricing increase > 50%, BAA revocationSwitch model router to OpenAI or open-source; retrain agent prompts4–6 weeksN/A (stateless API calls)
GCPPricing, region availability, compliance certification lossTerraform-based infra; containerized workloads; PostgreSQL standard8–12 weekspg_dump + storage export
VaultLicensing, security vulnerabilityMigrate secrets to cloud-native (GCP SM); update vault:// references2–4 weeksSecret export + re-encrypt
NATSPerformance, maintainabilityReplace with Redis Streams (similar pub/sub semantics)4–6 weeksEvent replay from audit trail

5.3 Vendor Health Monitoring

SignalFrequencyRed Flag Threshold
API availabilityReal-time< 99.5% over 30 days
Pricing changesPer announcement> 25% increase without proportional value
Security incidentsPer disclosureAny breach affecting customer data category
BAA/DPA complianceAnnuallyRefusal to renew or terms degradation
Financial healthQuarterlyNegative press, layoffs > 20%, funding concerns
Support responsivenessMonthlyP1 response > 4 hours consistently

6. Disaster Recovery

6.1 RPO/RTO by Data Tier

TierData TypesRPORTOMethodCost
Tier 1: Zero-lossAudit trail, e-signatures, compliance records0< 15 minSynchronous replication to standbyHigh (dedicated standby)
Tier 2: Near-zeroWork orders, approvals, state machine state< 1 min< 30 minStreaming replication (WAL)Medium
Tier 3: HourlyAgent execution logs, token usage, metrics< 1 hour< 2 hoursPoint-in-time recovery (PITR)Low
Tier 4: DailyTemporary data, cache, session state< 24 hours< 4 hoursDaily snapshotsMinimal

6.2 Backup Strategy

ComponentMethodFrequencyRetentionVerification
PostgreSQL (full)pg_basebackup to Cloud StorageDaily (2 AM UTC)30 daysWeekly restore test to ephemeral instance
PostgreSQL (WAL)Continuous WAL archivingContinuous7 daysVerified via replication lag monitoring
NATS (event replay)Events reconstructable from audit trailN/A (derived)N/AReplay test monthly
Vault (secrets)Vault snapshot to encrypted Cloud StorageDaily90 daysQuarterly restore test
Application configGit (infrastructure-as-code, Terraform)Per commitIndefiniteEvery deployment is a restore test
Container imagesContainer registry (GCR) with signingPer build90 days (tagged releases indefinite)Every deployment

6.3 Recovery Procedures

Database Failover (Automated)

Trigger: Primary PostgreSQL unresponsive for > 30 seconds
OR replication lag > 60 seconds with primary unhealthy

Procedure:
1. Cloud SQL automatic failover promotes standby to primary
2. Application connection pool detects new primary (< 30s)
3. NATS publishes FAILOVER_COMPLETE event
4. Observability: verify query latency, replication re-established
5. Post-failover: create new standby from promoted primary

Verification:
- All RLS policies active on new primary
- Audit trail writes resuming without gaps
- E-signature verification passing
- No tenant data exposure during failover window

Duration: 60–120 seconds total

Full Region Failure (Manual)

Trigger: GCP region outage > 15 minutes with no ETA
OR regulatory requirement to relocate data

Procedure:
1. Activate DR region infrastructure (pre-provisioned via Terraform)
2. Restore PostgreSQL from latest WAL archive (RPO: < 1 min for Tier 1/2)
3. Update DNS to DR region load balancer
4. Verify RLS policies, audit trail integrity, e-signature chain
5. Notify customers of region failover via status page
6. Run compliance verification suite (subset of OQ)

Duration: 2–4 hours (dominated by data restore + verification)

6.4 DR Testing Schedule

Test TypeFrequencyScopeDurationParticipants
Automated failoverMonthlyDatabase failover to standby15 minSRE (automated, reviewed)
Backup restoreWeeklyRestore latest backup to ephemeral instance, verify30 minSRE (automated)
Full DR drillQuarterlyComplete region failover simulation (non-production)4 hoursSRE + Engineering + Compliance
Tabletop exerciseBi-annuallyScenario walkthrough (no live systems)2 hoursAll teams + leadership
Compliance DR reviewAnnuallyAuditor reviews DR evidence, procedures, test results1 dayCompliance Eng + External auditor

7. Business Continuity

7.1 Degraded Mode Operations

When full functionality is unavailable, the system can operate in degraded modes:

ScenarioDegraded CapabilityWhat Still WorksRecovery
AI model provider downNo agent-driven transitionsManual WO transitions, approvals, audit trail, e-signaturesModel router failover to secondary provider
NATS downNo async events, no real-time notificationsSynchronous WO operations, direct database writesEvents queued in PostgreSQL; replay on NATS recovery
Vault downNo new credential issuanceCached credentials (TTL-based); no new agent executionsVault recovery; credential refresh cycle
Observability downNo dashboards, no alertingAll business operations continue normallyGrafana/Prometheus recovery; no data loss (buffered)
Primary region degradedIncreased latencyFull functionality at higher latencyAuto-scaling in region; or regional failover

7.2 Communication Plan

SeverityCommunication ChannelTimelineAudience
P0 (data loss risk)Status page + email + phone to affected customers< 15 min initial; 30-min updatesAll affected customers + internal leadership
P1 (service degraded)Status page + email< 30 min initial; hourly updatesAll customers
P2 (non-critical issue)Status page< 1 hour; updates as availableSubscribers
P3 (cosmetic/minor)Internal tracking onlyNext business dayInternal only

7.3 Regulatory Continuity

For FDA-regulated customers, business continuity has specific compliance requirements:

RequirementImplementationEvidence
Audit trail continuity during DRWrite-ahead log ensures no gaps; chain verification post-failoverChain verification report
E-signature validity after failoverSignatures tied to record content, not infrastructure; re-verifiableSignature verification test results
Data integrity post-restoreChecksum verification of all L4 data after restoreIntegrity verification report
Regulatory notificationIf DR event affects validated system records, notify customer's QANotification log, customer acknowledgment

8. SLA Framework

8.1 External SLAs (Customer-Facing)

TierAvailabilityAPI P95 LatencyAudit Trail WriteSupport Response
Starter99.5%< 1s< 200ms< 24 hours (business)
Professional99.9%< 500ms< 100ms< 4 hours (business)
Enterprise99.95%< 300ms< 50ms< 1 hour (24/7)
Enterprise Plus99.99%< 200ms< 50ms< 15 min (24/7, named)

8.2 Internal SLOs (Engineering Targets)

Internal targets are stricter than external SLAs to provide margin:

ComponentSLO TargetMeasurementAlert Threshold
WO API availability99.95%Synthetic monitoring (every 30s)< 99.9% over 1 hour
State transition latencyP50 < 50ms, P95 < 200ms, P99 < 1sRequest tracingP95 > 300ms
Audit trail write latencyP50 < 10ms, P95 < 50msWrite acknowledgment timingP95 > 75ms
Agent dispatch latency< 1s from request to first agent actionOrchestrator instrumentation> 2s
E-signature verification< 100ms per signatureVerification request timing> 200ms
Database query latencyP50 < 5ms, P95 < 50mspg_stat_statementsP95 > 75ms
Event delivery latency< 500ms from publish to consumerNATS metrics> 1s
Error budget0.05% of requests (99.95% target)Error rate tracking> 0.03% (warning), > 0.04% (critical)

8.3 Error Budget Policy

Monthly error budget = 100% - SLO = 0.05% of requests

Budget tracking:
Green: < 50% budget consumed → Normal development velocity
Yellow: 50–80% budget consumed → Reduce risky deployments, increase testing
Orange: 80–95% budget consumed → Freeze non-critical changes, focus on reliability
Red: > 95% budget consumed → All-hands reliability focus, no new features

Budget resets monthly. Unused budget does not carry over.
Compliance-critical paths (audit trail, e-signatures) have separate, stricter budgets.

9. On-Call Design

9.1 Rotation Structure

TierWhoScopeHoursResponse Time
Primary on-callSRE + 1 backend engineer (rotating)All P0/P1 alerts24/7< 15 min acknowledge
Secondary on-callPlatform Core engineer (rotating)Escalation from primary24/7< 30 min acknowledge
Compliance on-callCompliance engineerCompliance-specific alerts onlyBusiness hours + P0 24/7< 1 hour (business), < 30 min (P0)
Leadership escalationCTO / VP EngP0 unresolved > 1 hour24/7< 15 min

9.2 Rotation Schedule

  • Weekly rotation (Monday 10 AM UTC to Monday 10 AM UTC)
  • Minimum 3 people in primary rotation to avoid burnout (max 1 week in 3)
  • Handoff meeting at rotation change: open issues, upcoming maintenance, known risks
  • Post-on-call debrief for any P0/P1 incidents during the week

9.3 Alert Routing

Alert CategoryExamplesPrimary RouteEscalation
InfrastructureNode unhealthy, disk > 85%, memory pressureSRE (PagerDuty)Platform Core
ApplicationError rate spike, latency degradation, circuit breaker OPENPrimary on-call (PagerDuty)Secondary on-call
ComplianceAudit trail write failure, hash chain break, PHI detectionCompliance on-call (PagerDuty + Slack)Leadership
SecurityFailed auth spike, privilege escalation, break-glass activationSecurity + Primary on-call (PagerDuty)Leadership
BusinessTenant approaching WO limit, token budget exhaustionSlack notification (non-urgent)Product team

9.4 Incident Response Playbook

P0 Incident Flow:

0 min: Alert fires → PagerDuty pages primary on-call
5 min: Primary acknowledges → assess severity → confirm P0
10 min: Open incident channel (Slack #incident-YYYYMMDD-NNN)
Assign roles: Incident Commander (primary), Scribe (secondary), Comms (product)
15 min: First customer communication (status page update)
30 min: If unresolved → escalate to secondary on-call + relevant team lead
60 min: If unresolved → escalate to leadership
Compliance on-call verifies: audit trail integrity, no data exposure
Ongoing: 30-minute status updates until resolved

Resolution:
- Verify: service restored, data integrity confirmed, compliance checks pass
- Customer notification: "Resolved" status page update
- Within 24 hours: Preliminary incident report
- Within 5 business days: Full post-incident review (blameless)
- PIR outputs: action items with owners and deadlines, runbook updates

9.5 On-Call Wellness

PracticeImplementation
Alert hygieneMonthly alert review; snooze or delete alerts with < 10% actionable rate
Toil budgetMax 30% of on-call time on toil; remaining time for reliability improvements
CompensationOn-call stipend + incident response premium; time-off-in-lieu for overnight pages
Fatigue preventionMax 2 P0/P1 incidents per rotation before mandatory rest day; no back-to-back weeks
Alert-free goalsTeam OKR: reduce alert volume 10% per quarter through reliability investment

10. Localization & Multi-Market Readiness

10.1 Current Scope

The WO system launches for US-based FDA-regulated organizations. Multi-market readiness is planned for Y2/Y3 expansion.

10.2 Internationalization Architecture (i18n)

ComponentApproachStatus
UI stringsExternalized to JSON resource bundles (ICU MessageFormat)Design — implement before first non-US customer
Date/timeISO 8601 storage; locale-aware display (Intl API)Partially implemented (all timestamps UTC in DB)
Number/currencyLocale-aware formatting (Intl.NumberFormat)Not started
Regulatory frameworkConfigurable compliance rules per jurisdictionArchitecture supports (compliance engine is rule-based)
Audit trail languageAudit events in English (system language); UI translatableDesign decision: audit in English for FDA consistency

10.3 Regulatory Jurisdiction Mapping

MarketRegulationsConfiguration RequiredPriority
US (primary)FDA 21 CFR Part 11, HIPAA, SOC 2ImplementedY1
EUEMA Annex 11, GDPR, EU MDR 2017/745Compliance engine rules + data residencyY2
UKMHRA GxP, UK GDPRSimilar to EU with UK-specific rulesY2
BrazilANVISA, LGPDCompliance rules + Brazil data residencyY3
JapanPMDA, APPICompliance rules + Japan data residencyY3

10.4 Data Residency for International

RegionDeploymentData Classification ImpactedImplementation
US (us-central1)PrimaryAll tiersDeployed
EU (europe-west1)Required for EU customersL3, L4 (PII, regulated)Terraform-ready, deploy when first EU customer
Brazil (southamerica-east1)Required for LGPDL3, L4Planned Y3

Tenant-level configuration: tenant.data_residency_region controls where all L3/L4 data is stored and processed. L0/L1/L2 can be served from any region (CDN, global read replicas).


Operational readiness is not a launch checklist — it's a continuous capability. Every new feature, every new customer, every new region requires updating the runbooks, DR procedures, on-call scope, and cost model. This document is alive; it degrades the moment it stops being updated.


Appendix: Cross-Reference to Gap Closure Series

Gap IDOperational Readiness ImpactThis Document Section
G01 (Credential Vault)Vault runbook required; vendor risk for HashiCorp§3.3, §5.1
G08 (FDA Completeness)Compliance validation in DR procedures§6.3, §7.3
G13 (SIEM Integration)Alert routing for security events§9.3
G14 (Incident Response WO)P0 incident auto-creates WO§9.4
G17 (Optimistic Locking)Conflict resolution runbook§3.2
G27 (Artifact Updates)Runbook review cadence§3.1 (template)
G28 (Dashboard Sync)Monitoring dashboard accuracy§8.2