Work Order QMS Module — Operational Readiness

Classification: Internal — Operations & Engineering
Date: 2026-02-13
Artifact: 66 of WO System Series
Status: Proposed
Source Artifacts: 12-sdd.md, 13-tdd.md, 14-c4-architecture.md, 25-agent-orchestration-spec.md, 58-gap-closure-prompts.md, 64-security-architecture.md

1. Team Topology

1.1 Conway's Law Alignment

The WO system architecture implies a team structure. Misalignment between architecture and org chart creates friction, slower delivery, and cross-team dependency hell. The mapping below reflects the container decomposition from the C4 model (14-c4-architecture.md).

Architecture Component	Owning Team	Team Type	Headcount	Key Dependency
Agent Orchestrator	Platform Core	Stream-aligned	3–4	AI model providers
Compliance Engine	Compliance Eng	Enabling	2–3	Regulatory SMEs
WO Lifecycle Engine	WO Product	Stream-aligned	3–4	Platform Core (state machine)
State Store (PostgreSQL)	Data Platform	Platform	2	Infrastructure
API Gateway	Platform Core	Platform	1–2	Identity provider
Event Bus (NATS)	Data Platform	Platform	1	Infrastructure
IDE Shell (Theia)	DX Team	Stream-aligned	2–3	Platform Core (APIs)
Observability Stack	SRE	Enabling	2	All teams
Vendor Portal	WO Product	Stream-aligned	1–2	API Gateway

1.2 Team Interaction Modes

Platform Core ←→ WO Product         (Collaboration: shared state machine, agent dispatch)
Platform Core  → Compliance Eng      (X-as-a-Service: compliance rules consumed via API)
Data Platform  → All teams           (X-as-a-Service: PostgreSQL, NATS, infra primitives)
SRE           ←→ All teams           (Collaboration: incident response, runbook co-authoring)
Compliance Eng → WO Product          (Facilitating: compliance requirements, audit prep)
DX Team       ←→ Platform Core       (Collaboration: IDE integrates with orchestrator)

1.3 Minimum Viable Team (MVP — Launch)

For initial launch with 5–10 customers, compress to:

Role	FTE	Covers
Full-stack engineer (TS/Python)	2	WO lifecycle, API, agent orchestration
Platform/infra engineer	1	PostgreSQL, NATS, deployment, observability
Compliance engineer	1	Regulatory validation, audit trail, e-signatures
SRE / DevOps (fractional)	0.5	On-call, monitoring, incident response
Product / CTO	0.5	Architecture decisions, customer-facing
Total	5

Scale to full topology (15–20 headcount) at 25+ customers / $4M+ ARR.

2. Skills Gap Analysis

2.1 Required Capabilities vs. Availability

Capability	Criticality	Availability	Gap	Mitigation
TypeScript + React	High	✅ Common	None	—
Python (AsyncIO, FastAPI)	High	✅ Common	None	—
PostgreSQL (RLS, partitioning, triggers)	High	⚠️ Moderate	Advanced RLS + partitioning	Hire or upskill; DB consultant for initial design
FDA 21 CFR Part 11 expertise	Critical	❌ Rare	Regulatory knowledge	Hire compliance engineer with pharma QMS background
HIPAA technical safeguards	High	⚠️ Moderate	PHI handling specifics	Training + consulting engagement
AI/LLM agent orchestration	High	❌ Rare	Emerging skill	Internal training; leverage Anthropic documentation
Eclipse Theia / Monaco	Medium	❌ Rare	IDE platform expertise	1 specialist hire or contract
Event-driven architecture (NATS)	Medium	⚠️ Moderate	Event sourcing patterns	Training + reference implementation
Kubernetes / GKE operations	High	⚠️ Moderate	Production operations	SRE hire with GKE experience
Cryptographic engineering	Medium	❌ Rare	Vault integration, hash chains, signing	Security consultant for initial implementation

2.2 Training Plan

Phase	Duration	Focus	Delivery	Priority Roles
Onboarding (Week 1–2)	2 weeks	WO system architecture, data model, state machine, compliance framework	Self-paced docs + pairing	All engineering
Regulatory Foundations (Month 1)	4 sessions × 2hr	FDA 21 CFR Part 11 deep-dive, HIPAA technical safeguards, SOC 2 controls	Workshop (compliance eng leads)	All engineering
Agent Orchestration (Month 1–2)	3 sessions × 2hr	Multi-agent patterns, model routing, checkpoint management, token economics	Workshop + hands-on lab	Platform + WO product
PostgreSQL Advanced (Month 2)	2 sessions × 3hr	RLS policies, partitioning strategies, trigger-based audit trails, performance tuning	External training + lab	Data platform + all backend
Incident Response (Month 2)	1 day	Gameday exercise: simulated outage, compliance violation, security incident	Tabletop + live drill	All engineering + compliance

2.3 Critical Hire Priorities

Compliance Engineer (P0) — Pharma/biotech QMS background, FDA audit experience. This role cannot be replaced by training existing engineers. Estimated time-to-fill: 6–8 weeks.
SRE with Healthcare SaaS experience (P1) — HIPAA-compliant infrastructure operations. Estimated time-to-fill: 4–6 weeks.
Senior Python Engineer (AI/agent) (P1) — LLM orchestration, async systems. Estimated time-to-fill: 4–8 weeks.

3. Operational Runbooks

3.1 Runbook Template

Every component in the WO system has a runbook following this structure:

# Runbook: [Component Name]
Last Updated: [Date] | Owner: [Team] | Review Cadence: Quarterly

## Service Overview
- Purpose and business criticality
- Upstream/downstream dependencies
- Data classification level (per 63-data-architecture.md)
- SLA tier (per §8 below)

## Health Checks
| Check | Endpoint | Expected Response | Interval |
|-------|----------|-------------------|----------|

## Common Issues & Resolution
| Symptom | Likely Cause | Diagnostic Steps | Resolution | Escalation |
|---------|-------------|-----------------|-----------|-----------|

## Scaling Procedures
- Current capacity (users, WOs/sec, transitions/sec)
- Scale-up trigger thresholds
- Scale-up procedure (manual + auto-scaling config)
- Scale-down cooldown rules

## Emergency Procedures
- Circuit breaker manual override
- Break-glass access activation (per §5.2 of 64-security-architecture.md)
- Compliance engine bypass (NEVER in production — audit-only mode only)
- Rollback procedure

## Disaster Recovery
- RPO/RTO for this component (per §7 below)
- Backup verification procedure
- Restore procedure (step-by-step with expected duration)
- Failover trigger and procedure

3.2 WO Lifecycle Engine Runbook

Business Criticality: P1 — Core revenue-generating service
Data Classification: L4 (Regulated — WO records, approvals, audit trail)
SLA: 99.9% availability, P95 < 500ms

Health Check	Endpoint	Expected	Interval
API health	`GET /api/health`	`{"status":"healthy","db":"connected","nats":"connected"}`	30s
State machine	`GET /api/health/state-machine`	All guard functions loaded, transition map valid	5 min
Audit trail write	`POST /api/health/audit-write-test`	Write + verify in < 50ms	1 min
Agent connectivity	`GET /api/health/agents`	Orchestrator reachable, ≥ 1 worker healthy	30s

Symptom	Cause	Diagnostic	Resolution	Escalate
WO transitions hanging	PostgreSQL connection pool exhausted	Check `pg_stat_activity`; look for long-running queries	Kill idle connections, scale pool size	Data Platform if recurring
Approval signatures failing	Vault connectivity or key rotation in progress	Check vault health endpoint; verify key version	Wait for rotation to complete; or use previous key version	Security team if key is revoked
Audit trail writes failing	Disk full on PostgreSQL partition	Check partition size; verify auto-partition creation	Run emergency partition creation; archive old partitions	SRE immediately (compliance violation if audit lost)
Agent dispatches rejected	Token budget exhausted for tenant	Check `token_usage` table for tenant; verify budget config	Increase tenant budget or wait for reset; notify customer	Product team if systemic
Circuit breaker OPEN	Agent worker failure rate > threshold	Check worker logs; verify model API health	Wait for half-open probe; or manually close if false positive	Platform Core + SRE

3.3 Compliance Engine Runbook

Business Criticality: P0 — Compliance failure = regulatory risk
Data Classification: L4 (Regulated)
SLA: 99.95% availability, zero data loss

Symptom	Cause	Resolution	Escalate
Compliance guard rejecting valid transitions	Policy rule misconfigured or stale	Review guard function against regulatory matrix; update rule	Compliance Eng immediately
E-signature verification failing	Hash mismatch (data integrity issue)	DO NOT bypass. Investigate hash chain integrity. Check for concurrent modification.	Security team + Compliance Eng — potential tampering event
PHI scanner false positives	Pattern match too aggressive	Adjust confidence thresholds; add to allowlist if appropriate	Compliance Eng (never auto-suppress)
Audit trail hash chain break	Missing entry or out-of-order write	Run chain verification job; identify gap; reconstruct if possible	P0 incident — regulatory reporting may be required

3.4 PostgreSQL State Store Runbook

Business Criticality: P0 — All state depends on database
Data Classification: L4 (stores all regulated data)
SLA: 99.99% availability, RPO = 0 (synchronous replication)

Symptom	Cause	Resolution	Escalate
Replication lag > 1s	Network issue or standby overloaded	Check network; reduce standby query load; verify WAL shipping	SRE if lag > 10s
Connection pool saturated	Query spike or leak	Identify long queries (`pg_stat_activity`); kill if safe; increase pool	Data Platform if recurring
Table bloat (audit_trail)	Append-only table growing without partitioning	Verify partition pruning active; create new partition; archive old	Data Platform for capacity planning
RLS policy not enforcing	`app.tenant_id` not set on connection	CRITICAL — tenant data leak. Kill affected connections immediately. Audit affected queries.	P0 incident — Security team + Compliance Eng

4. Cost of Ownership Model

4.1 Infrastructure Cost Projection

Resource	Y1 (Monthly)	Y2 (Monthly)	Y3 (Monthly)	Notes
GKE cluster (3 nodes, e2-standard-4)	$800	$1,200	$2,400	Scale with customer count
Cloud SQL (PostgreSQL, HA)	$500	$1,000	$2,500	Scale with data volume
NATS cluster (3 nodes)	$200	$300	$600	Scale with event volume
Cloud KMS + Secret Manager	$50	$100	$200	Scale with key count
Cloud Storage (backups, evidence)	$100	$300	$800	Scale with audit trail size
Networking (egress, LB)	$150	$300	$600	Scale with traffic
Observability (OTEL, Prometheus, Grafana Cloud)	$200	$400	$800	Scale with metrics volume
AI model tokens	$500	$2,000	$8,000	Scale with WO volume; routing reduces
Total Infrastructure	$2,500	$5,600	$15,900
Annual	$30K	$67K	$191K
Per Customer (Y2, 25 customers)	—	$224/mo	—	Infrastructure marginal cost

4.2 Staffing Cost Projection

Role	Y1 (FTE × Cost)	Y2	Y3
Engineering (4 FTE)	$600K	$900K (6 FTE)	$1.4M (9 FTE)
Compliance Eng (1 FTE)	$150K	$200K	$300K (2 FTE)
SRE (0.5 FTE)	$75K	$150K (1 FTE)	$225K (1.5 FTE)
Product (0.5 FTE)	$75K	$150K (1 FTE)	$200K (1 FTE)
Total Staffing	$900K	$1.4M	$2.125M

4.3 Total Cost of Ownership

Category	Y1	Y2	Y3	3-Year Total
Infrastructure	$30K	$67K	$191K	$288K
Staffing	$900K	$1.4M	$2.125M	$4.425M
Licensing (third-party)	$15K	$30K	$50K	$95K
Compliance consulting	$50K	$25K	$25K	$100K
Total	$995K	$1.522M	$2.391M	$4.908M
Revenue (from 60-business-model.md)	$960K	$4.6M	$16.3M	$21.86M
Gross Profit	($35K)	$3.08M	$13.9M	$16.95M

Y1 is near break-even (pre-revenue buildup). Y2 crosses into strong profitability as multi-tenant leverage kicks in. Y3 infrastructure costs grow sublinearly vs. revenue because shared platform scales efficiently.

4.4 Opportunity Cost Assessment

If We Do This	We Don't Do This	Revenue Impact	Risk
Gap closure series (G01–G28)	New feature development (2–3 months)	$0 direct; prevents $200K+ per compliance finding	Compliance risk elimination — non-negotiable
Vendor portal (V1)	Advanced agent capabilities	+$200/vendor/mo × vendor seats	Delayed competitive moat
Multi-region deployment	Feature parity with MasterControl	Enables enterprise segment ($216K+ ACV)	Slower mid-market growth
ISO 13485 compliance framework	Fintech SOC2 expansion	Opens medical device segment	Delays fintech revenue

5. Vendor Risk Assessment

5.1 Critical Vendor Dependencies

Vendor	Dependency Level	What They Provide	Exit Difficulty	Alternative
Anthropic (Claude)	Tier 1: Core	Primary AI model (Opus/Sonnet/Haiku)	Medium — model routing abstraction enables swap	OpenAI, Google, open-source models
Google Cloud (GKE, Cloud SQL)	Tier 1: Core	Compute, database, networking	High — deep GCP integration	AWS (EKS, RDS), Azure (AKS, Flex Server)
HashiCorp Vault	Tier 2: Strategic	Secrets management, credential vault	Medium — Vault API is industry standard	GCP Secret Manager, AWS Secrets Manager
NATS	Tier 2: Strategic	Event bus, async messaging	Low — NATS is open-source, self-hosted	Redis Streams, Apache Kafka, RabbitMQ
Grafana Labs	Tier 3: Standard	Observability dashboards	Low — OTEL is vendor-neutral	Datadog, New Relic, self-hosted Grafana
Auth0 / Okta	Tier 2: Strategic	Identity provider, SSO	Medium — OIDC is standard protocol	Keycloak (self-hosted), Azure AD, Google Identity

5.2 Exit Strategy per Vendor

Vendor	Exit Trigger	Migration Path	Estimated Duration	Data Portability
Anthropic	Model quality degradation, pricing increase > 50%, BAA revocation	Switch model router to OpenAI or open-source; retrain agent prompts	4–6 weeks	N/A (stateless API calls)
GCP	Pricing, region availability, compliance certification loss	Terraform-based infra; containerized workloads; PostgreSQL standard	8–12 weeks	pg_dump + storage export
Vault	Licensing, security vulnerability	Migrate secrets to cloud-native (GCP SM); update vault:// references	2–4 weeks	Secret export + re-encrypt
NATS	Performance, maintainability	Replace with Redis Streams (similar pub/sub semantics)	4–6 weeks	Event replay from audit trail

5.3 Vendor Health Monitoring

Signal	Frequency	Red Flag Threshold
API availability	Real-time	< 99.5% over 30 days
Pricing changes	Per announcement	> 25% increase without proportional value
Security incidents	Per disclosure	Any breach affecting customer data category
BAA/DPA compliance	Annually	Refusal to renew or terms degradation
Financial health	Quarterly	Negative press, layoffs > 20%, funding concerns
Support responsiveness	Monthly	P1 response > 4 hours consistently

6. Disaster Recovery

6.1 RPO/RTO by Data Tier

Tier	Data Types	RPO	RTO	Method	Cost
Tier 1: Zero-loss	Audit trail, e-signatures, compliance records	0	< 15 min	Synchronous replication to standby	High (dedicated standby)
Tier 2: Near-zero	Work orders, approvals, state machine state	< 1 min	< 30 min	Streaming replication (WAL)	Medium
Tier 3: Hourly	Agent execution logs, token usage, metrics	< 1 hour	< 2 hours	Point-in-time recovery (PITR)	Low
Tier 4: Daily	Temporary data, cache, session state	< 24 hours	< 4 hours	Daily snapshots	Minimal

6.2 Backup Strategy

Component	Method	Frequency	Retention	Verification
PostgreSQL (full)	`pg_basebackup` to Cloud Storage	Daily (2 AM UTC)	30 days	Weekly restore test to ephemeral instance
PostgreSQL (WAL)	Continuous WAL archiving	Continuous	7 days	Verified via replication lag monitoring
NATS (event replay)	Events reconstructable from audit trail	N/A (derived)	N/A	Replay test monthly
Vault (secrets)	Vault snapshot to encrypted Cloud Storage	Daily	90 days	Quarterly restore test
Application config	Git (infrastructure-as-code, Terraform)	Per commit	Indefinite	Every deployment is a restore test
Container images	Container registry (GCR) with signing	Per build	90 days (tagged releases indefinite)	Every deployment

6.3 Recovery Procedures

Database Failover (Automated)

Trigger: Primary PostgreSQL unresponsive for > 30 seconds
         OR replication lag > 60 seconds with primary unhealthy

Procedure:
  1. Cloud SQL automatic failover promotes standby to primary
  2. Application connection pool detects new primary (< 30s)
  3. NATS publishes FAILOVER_COMPLETE event
  4. Observability: verify query latency, replication re-established
  5. Post-failover: create new standby from promoted primary

Verification:
  - All RLS policies active on new primary
  - Audit trail writes resuming without gaps
  - E-signature verification passing
  - No tenant data exposure during failover window

Duration: 60–120 seconds total

Full Region Failure (Manual)

Trigger: GCP region outage > 15 minutes with no ETA
         OR regulatory requirement to relocate data

Procedure:
  1. Activate DR region infrastructure (pre-provisioned via Terraform)
  2. Restore PostgreSQL from latest WAL archive (RPO: < 1 min for Tier 1/2)
  3. Update DNS to DR region load balancer
  4. Verify RLS policies, audit trail integrity, e-signature chain
  5. Notify customers of region failover via status page
  6. Run compliance verification suite (subset of OQ)

Duration: 2–4 hours (dominated by data restore + verification)

6.4 DR Testing Schedule

Test Type	Frequency	Scope	Duration	Participants
Automated failover	Monthly	Database failover to standby	15 min	SRE (automated, reviewed)
Backup restore	Weekly	Restore latest backup to ephemeral instance, verify	30 min	SRE (automated)
Full DR drill	Quarterly	Complete region failover simulation (non-production)	4 hours	SRE + Engineering + Compliance
Tabletop exercise	Bi-annually	Scenario walkthrough (no live systems)	2 hours	All teams + leadership
Compliance DR review	Annually	Auditor reviews DR evidence, procedures, test results	1 day	Compliance Eng + External auditor

7. Business Continuity

7.1 Degraded Mode Operations

When full functionality is unavailable, the system can operate in degraded modes:

Scenario	Degraded Capability	What Still Works	Recovery
AI model provider down	No agent-driven transitions	Manual WO transitions, approvals, audit trail, e-signatures	Model router failover to secondary provider
NATS down	No async events, no real-time notifications	Synchronous WO operations, direct database writes	Events queued in PostgreSQL; replay on NATS recovery
Vault down	No new credential issuance	Cached credentials (TTL-based); no new agent executions	Vault recovery; credential refresh cycle
Observability down	No dashboards, no alerting	All business operations continue normally	Grafana/Prometheus recovery; no data loss (buffered)
Primary region degraded	Increased latency	Full functionality at higher latency	Auto-scaling in region; or regional failover

7.2 Communication Plan

Severity	Communication Channel	Timeline	Audience
P0 (data loss risk)	Status page + email + phone to affected customers	< 15 min initial; 30-min updates	All affected customers + internal leadership
P1 (service degraded)	Status page + email	< 30 min initial; hourly updates	All customers
P2 (non-critical issue)	Status page	< 1 hour; updates as available	Subscribers
P3 (cosmetic/minor)	Internal tracking only	Next business day	Internal only

7.3 Regulatory Continuity

For FDA-regulated customers, business continuity has specific compliance requirements:

Requirement	Implementation	Evidence
Audit trail continuity during DR	Write-ahead log ensures no gaps; chain verification post-failover	Chain verification report
E-signature validity after failover	Signatures tied to record content, not infrastructure; re-verifiable	Signature verification test results
Data integrity post-restore	Checksum verification of all L4 data after restore	Integrity verification report
Regulatory notification	If DR event affects validated system records, notify customer's QA	Notification log, customer acknowledgment

8. SLA Framework

8.1 External SLAs (Customer-Facing)

Tier	Availability	API P95 Latency	Audit Trail Write	Support Response
Starter	99.5%	< 1s	< 200ms	< 24 hours (business)
Professional	99.9%	< 500ms	< 100ms	< 4 hours (business)
Enterprise	99.95%	< 300ms	< 50ms	< 1 hour (24/7)
Enterprise Plus	99.99%	< 200ms	< 50ms	< 15 min (24/7, named)

8.2 Internal SLOs (Engineering Targets)

Internal targets are stricter than external SLAs to provide margin:

Component	SLO Target	Measurement	Alert Threshold
WO API availability	99.95%	Synthetic monitoring (every 30s)	< 99.9% over 1 hour
State transition latency	P50 < 50ms, P95 < 200ms, P99 < 1s	Request tracing	P95 > 300ms
Audit trail write latency	P50 < 10ms, P95 < 50ms	Write acknowledgment timing	P95 > 75ms
Agent dispatch latency	< 1s from request to first agent action	Orchestrator instrumentation	> 2s
E-signature verification	< 100ms per signature	Verification request timing	> 200ms
Database query latency	P50 < 5ms, P95 < 50ms	pg_stat_statements	P95 > 75ms
Event delivery latency	< 500ms from publish to consumer	NATS metrics	> 1s
Error budget	0.05% of requests (99.95% target)	Error rate tracking	> 0.03% (warning), > 0.04% (critical)

8.3 Error Budget Policy

Monthly error budget = 100% - SLO = 0.05% of requests

Budget tracking:
  Green:   < 50% budget consumed   → Normal development velocity
  Yellow:  50–80% budget consumed  → Reduce risky deployments, increase testing
  Orange:  80–95% budget consumed  → Freeze non-critical changes, focus on reliability
  Red:     > 95% budget consumed   → All-hands reliability focus, no new features

Budget resets monthly. Unused budget does not carry over.
Compliance-critical paths (audit trail, e-signatures) have separate, stricter budgets.

9. On-Call Design

9.1 Rotation Structure

Tier	Who	Scope	Hours	Response Time
Primary on-call	SRE + 1 backend engineer (rotating)	All P0/P1 alerts	24/7	< 15 min acknowledge
Secondary on-call	Platform Core engineer (rotating)	Escalation from primary	24/7	< 30 min acknowledge
Compliance on-call	Compliance engineer	Compliance-specific alerts only	Business hours + P0 24/7	< 1 hour (business), < 30 min (P0)
Leadership escalation	CTO / VP Eng	P0 unresolved > 1 hour	24/7	< 15 min

9.2 Rotation Schedule

Weekly rotation (Monday 10 AM UTC to Monday 10 AM UTC)
Minimum 3 people in primary rotation to avoid burnout (max 1 week in 3)
Handoff meeting at rotation change: open issues, upcoming maintenance, known risks
Post-on-call debrief for any P0/P1 incidents during the week

9.3 Alert Routing

Alert Category	Examples	Primary Route	Escalation
Infrastructure	Node unhealthy, disk > 85%, memory pressure	SRE (PagerDuty)	Platform Core
Application	Error rate spike, latency degradation, circuit breaker OPEN	Primary on-call (PagerDuty)	Secondary on-call
Compliance	Audit trail write failure, hash chain break, PHI detection	Compliance on-call (PagerDuty + Slack)	Leadership
Security	Failed auth spike, privilege escalation, break-glass activation	Security + Primary on-call (PagerDuty)	Leadership
Business	Tenant approaching WO limit, token budget exhaustion	Slack notification (non-urgent)	Product team

9.4 Incident Response Playbook

P0 Incident Flow:

0 min:  Alert fires → PagerDuty pages primary on-call
5 min:  Primary acknowledges → assess severity → confirm P0
10 min: Open incident channel (Slack #incident-YYYYMMDD-NNN)
        Assign roles: Incident Commander (primary), Scribe (secondary), Comms (product)
15 min: First customer communication (status page update)
30 min: If unresolved → escalate to secondary on-call + relevant team lead
60 min: If unresolved → escalate to leadership
        Compliance on-call verifies: audit trail integrity, no data exposure
Ongoing: 30-minute status updates until resolved

Resolution:
  - Verify: service restored, data integrity confirmed, compliance checks pass
  - Customer notification: "Resolved" status page update
  - Within 24 hours: Preliminary incident report
  - Within 5 business days: Full post-incident review (blameless)
  - PIR outputs: action items with owners and deadlines, runbook updates

9.5 On-Call Wellness

Practice	Implementation
Alert hygiene	Monthly alert review; snooze or delete alerts with < 10% actionable rate
Toil budget	Max 30% of on-call time on toil; remaining time for reliability improvements
Compensation	On-call stipend + incident response premium; time-off-in-lieu for overnight pages
Fatigue prevention	Max 2 P0/P1 incidents per rotation before mandatory rest day; no back-to-back weeks
Alert-free goals	Team OKR: reduce alert volume 10% per quarter through reliability investment

10. Localization & Multi-Market Readiness

10.1 Current Scope

The WO system launches for US-based FDA-regulated organizations. Multi-market readiness is planned for Y2/Y3 expansion.

10.2 Internationalization Architecture (i18n)

Component	Approach	Status
UI strings	Externalized to JSON resource bundles (ICU MessageFormat)	Design — implement before first non-US customer
Date/time	ISO 8601 storage; locale-aware display (Intl API)	Partially implemented (all timestamps UTC in DB)
Number/currency	Locale-aware formatting (Intl.NumberFormat)	Not started
Regulatory framework	Configurable compliance rules per jurisdiction	Architecture supports (compliance engine is rule-based)
Audit trail language	Audit events in English (system language); UI translatable	Design decision: audit in English for FDA consistency

10.3 Regulatory Jurisdiction Mapping

Market	Regulations	Configuration Required	Priority
US (primary)	FDA 21 CFR Part 11, HIPAA, SOC 2	Implemented	Y1
EU	EMA Annex 11, GDPR, EU MDR 2017/745	Compliance engine rules + data residency	Y2
UK	MHRA GxP, UK GDPR	Similar to EU with UK-specific rules	Y2
Brazil	ANVISA, LGPD	Compliance rules + Brazil data residency	Y3
Japan	PMDA, APPI	Compliance rules + Japan data residency	Y3

10.4 Data Residency for International

Region	Deployment	Data Classification Impacted	Implementation
US (us-central1)	Primary	All tiers	Deployed
EU (europe-west1)	Required for EU customers	L3, L4 (PII, regulated)	Terraform-ready, deploy when first EU customer
Brazil (southamerica-east1)	Required for LGPD	L3, L4	Planned Y3

Tenant-level configuration: tenant.data_residency_region controls where all L3/L4 data is stored and processed. L0/L1/L2 can be served from any region (CDN, global read replicas).

Operational readiness is not a launch checklist — it's a continuous capability. Every new feature, every new customer, every new region requires updating the runbooks, DR procedures, on-call scope, and cost model. This document is alive; it degrades the moment it stops being updated.

Appendix: Cross-Reference to Gap Closure Series

Gap ID	Operational Readiness Impact	This Document Section
G01 (Credential Vault)	Vault runbook required; vendor risk for HashiCorp	§3.3, §5.1
G08 (FDA Completeness)	Compliance validation in DR procedures	§6.3, §7.3
G13 (SIEM Integration)	Alert routing for security events	§9.3
G14 (Incident Response WO)	P0 incident auto-creates WO	§9.4
G17 (Optimistic Locking)	Conflict resolution runbook	§3.2
G27 (Artifact Updates)	Runbook review cadence	§3.1 (template)
G28 (Dashboard Sync)	Monitoring dashboard accuracy	§8.2

1. Team Topology​

1.1 Conway's Law Alignment​

1.2 Team Interaction Modes​

1.3 Minimum Viable Team (MVP — Launch)​

2. Skills Gap Analysis​

2.1 Required Capabilities vs. Availability​

2.2 Training Plan​

2.3 Critical Hire Priorities​

3. Operational Runbooks​

3.1 Runbook Template​

3.2 WO Lifecycle Engine Runbook​

3.3 Compliance Engine Runbook​

3.4 PostgreSQL State Store Runbook​

4. Cost of Ownership Model​

4.1 Infrastructure Cost Projection​

4.2 Staffing Cost Projection​

4.3 Total Cost of Ownership​

4.4 Opportunity Cost Assessment​

5. Vendor Risk Assessment​

5.1 Critical Vendor Dependencies​

5.2 Exit Strategy per Vendor​

5.3 Vendor Health Monitoring​

6. Disaster Recovery​

6.1 RPO/RTO by Data Tier​

6.2 Backup Strategy​

6.3 Recovery Procedures​

Database Failover (Automated)​

Full Region Failure (Manual)​

6.4 DR Testing Schedule​

7. Business Continuity​

7.1 Degraded Mode Operations​

7.2 Communication Plan​

7.3 Regulatory Continuity​

8. SLA Framework​

8.1 External SLAs (Customer-Facing)​

8.2 Internal SLOs (Engineering Targets)​

8.3 Error Budget Policy​

9. On-Call Design​

9.1 Rotation Structure​

9.2 Rotation Schedule​

9.3 Alert Routing​

9.4 Incident Response Playbook​

9.5 On-Call Wellness​

10. Localization & Multi-Market Readiness​

10.1 Current Scope​

10.2 Internationalization Architecture (i18n)​

10.3 Regulatory Jurisdiction Mapping​

10.4 Data Residency for International​

Appendix: Cross-Reference to Gap Closure Series​