Work Order QMS Module — Operational Readiness
Classification: Internal — Operations & Engineering
Date: 2026-02-13
Artifact: 66 of WO System Series
Status: Proposed
Source Artifacts: 12-sdd.md, 13-tdd.md, 14-c4-architecture.md, 25-agent-orchestration-spec.md, 58-gap-closure-prompts.md, 64-security-architecture.md
1. Team Topology
1.1 Conway's Law Alignment
The WO system architecture implies a team structure. Misalignment between architecture and org chart creates friction, slower delivery, and cross-team dependency hell. The mapping below reflects the container decomposition from the C4 model (14-c4-architecture.md).
| Architecture Component | Owning Team | Team Type | Headcount | Key Dependency |
|---|
| Agent Orchestrator | Platform Core | Stream-aligned | 3–4 | AI model providers |
| Compliance Engine | Compliance Eng | Enabling | 2–3 | Regulatory SMEs |
| WO Lifecycle Engine | WO Product | Stream-aligned | 3–4 | Platform Core (state machine) |
| State Store (PostgreSQL) | Data Platform | Platform | 2 | Infrastructure |
| API Gateway | Platform Core | Platform | 1–2 | Identity provider |
| Event Bus (NATS) | Data Platform | Platform | 1 | Infrastructure |
| IDE Shell (Theia) | DX Team | Stream-aligned | 2–3 | Platform Core (APIs) |
| Observability Stack | SRE | Enabling | 2 | All teams |
| Vendor Portal | WO Product | Stream-aligned | 1–2 | API Gateway |
1.2 Team Interaction Modes
Platform Core ←→ WO Product (Collaboration: shared state machine, agent dispatch)
Platform Core → Compliance Eng (X-as-a-Service: compliance rules consumed via API)
Data Platform → All teams (X-as-a-Service: PostgreSQL, NATS, infra primitives)
SRE ←→ All teams (Collaboration: incident response, runbook co-authoring)
Compliance Eng → WO Product (Facilitating: compliance requirements, audit prep)
DX Team ←→ Platform Core (Collaboration: IDE integrates with orchestrator)
1.3 Minimum Viable Team (MVP — Launch)
For initial launch with 5–10 customers, compress to:
| Role | FTE | Covers |
|---|
| Full-stack engineer (TS/Python) | 2 | WO lifecycle, API, agent orchestration |
| Platform/infra engineer | 1 | PostgreSQL, NATS, deployment, observability |
| Compliance engineer | 1 | Regulatory validation, audit trail, e-signatures |
| SRE / DevOps (fractional) | 0.5 | On-call, monitoring, incident response |
| Product / CTO | 0.5 | Architecture decisions, customer-facing |
| Total | 5 | |
Scale to full topology (15–20 headcount) at 25+ customers / $4M+ ARR.
2. Skills Gap Analysis
2.1 Required Capabilities vs. Availability
| Capability | Criticality | Availability | Gap | Mitigation |
|---|
| TypeScript + React | High | ✅ Common | None | — |
| Python (AsyncIO, FastAPI) | High | ✅ Common | None | — |
| PostgreSQL (RLS, partitioning, triggers) | High | ⚠️ Moderate | Advanced RLS + partitioning | Hire or upskill; DB consultant for initial design |
| FDA 21 CFR Part 11 expertise | Critical | ❌ Rare | Regulatory knowledge | Hire compliance engineer with pharma QMS background |
| HIPAA technical safeguards | High | ⚠️ Moderate | PHI handling specifics | Training + consulting engagement |
| AI/LLM agent orchestration | High | ❌ Rare | Emerging skill | Internal training; leverage Anthropic documentation |
| Eclipse Theia / Monaco | Medium | ❌ Rare | IDE platform expertise | 1 specialist hire or contract |
| Event-driven architecture (NATS) | Medium | ⚠️ Moderate | Event sourcing patterns | Training + reference implementation |
| Kubernetes / GKE operations | High | ⚠️ Moderate | Production operations | SRE hire with GKE experience |
| Cryptographic engineering | Medium | ❌ Rare | Vault integration, hash chains, signing | Security consultant for initial implementation |
2.2 Training Plan
| Phase | Duration | Focus | Delivery | Priority Roles |
|---|
| Onboarding (Week 1–2) | 2 weeks | WO system architecture, data model, state machine, compliance framework | Self-paced docs + pairing | All engineering |
| Regulatory Foundations (Month 1) | 4 sessions × 2hr | FDA 21 CFR Part 11 deep-dive, HIPAA technical safeguards, SOC 2 controls | Workshop (compliance eng leads) | All engineering |
| Agent Orchestration (Month 1–2) | 3 sessions × 2hr | Multi-agent patterns, model routing, checkpoint management, token economics | Workshop + hands-on lab | Platform + WO product |
| PostgreSQL Advanced (Month 2) | 2 sessions × 3hr | RLS policies, partitioning strategies, trigger-based audit trails, performance tuning | External training + lab | Data platform + all backend |
| Incident Response (Month 2) | 1 day | Gameday exercise: simulated outage, compliance violation, security incident | Tabletop + live drill | All engineering + compliance |
2.3 Critical Hire Priorities
- Compliance Engineer (P0) — Pharma/biotech QMS background, FDA audit experience. This role cannot be replaced by training existing engineers. Estimated time-to-fill: 6–8 weeks.
- SRE with Healthcare SaaS experience (P1) — HIPAA-compliant infrastructure operations. Estimated time-to-fill: 4–6 weeks.
- Senior Python Engineer (AI/agent) (P1) — LLM orchestration, async systems. Estimated time-to-fill: 4–8 weeks.
3. Operational Runbooks
3.1 Runbook Template
Every component in the WO system has a runbook following this structure:
# Runbook: [Component Name]
Last Updated: [Date] | Owner: [Team] | Review Cadence: Quarterly
## Service Overview
- Purpose and business criticality
- Upstream/downstream dependencies
- Data classification level (per 63-data-architecture.md)
- SLA tier (per §8 below)
## Health Checks
|-------|----------|-------------------|----------|
## Common Issues & Resolution
|---------|-------------|-----------------|-----------|-----------|
## Scaling Procedures
- Current capacity (users, WOs/sec, transitions/sec)
- Scale-up trigger thresholds
- Scale-up procedure (manual + auto-scaling config)
- Scale-down cooldown rules
## Emergency Procedures
- Circuit breaker manual override
- Break-glass access activation (per §5.2 of 64-security-architecture.md)
- Compliance engine bypass (NEVER in production — audit-only mode only)
- Rollback procedure
## Disaster Recovery
- RPO/RTO for this component (per §7 below)
- Backup verification procedure
- Restore procedure (step-by-step with expected duration)
- Failover trigger and procedure
3.2 WO Lifecycle Engine Runbook
Business Criticality: P1 — Core revenue-generating service
Data Classification: L4 (Regulated — WO records, approvals, audit trail)
SLA: 99.9% availability, P95 < 500ms
| Health Check | Endpoint | Expected | Interval |
|---|
| API health | GET /api/health | {"status":"healthy","db":"connected","nats":"connected"} | 30s |
| State machine | GET /api/health/state-machine | All guard functions loaded, transition map valid | 5 min |
| Audit trail write | POST /api/health/audit-write-test | Write + verify in < 50ms | 1 min |
| Agent connectivity | GET /api/health/agents | Orchestrator reachable, ≥ 1 worker healthy | 30s |
| Symptom | Cause | Diagnostic | Resolution | Escalate |
|---|
| WO transitions hanging | PostgreSQL connection pool exhausted | Check pg_stat_activity; look for long-running queries | Kill idle connections, scale pool size | Data Platform if recurring |
| Approval signatures failing | Vault connectivity or key rotation in progress | Check vault health endpoint; verify key version | Wait for rotation to complete; or use previous key version | Security team if key is revoked |
| Audit trail writes failing | Disk full on PostgreSQL partition | Check partition size; verify auto-partition creation | Run emergency partition creation; archive old partitions | SRE immediately (compliance violation if audit lost) |
| Agent dispatches rejected | Token budget exhausted for tenant | Check token_usage table for tenant; verify budget config | Increase tenant budget or wait for reset; notify customer | Product team if systemic |
| Circuit breaker OPEN | Agent worker failure rate > threshold | Check worker logs; verify model API health | Wait for half-open probe; or manually close if false positive | Platform Core + SRE |
3.3 Compliance Engine Runbook
Business Criticality: P0 — Compliance failure = regulatory risk
Data Classification: L4 (Regulated)
SLA: 99.95% availability, zero data loss
| Symptom | Cause | Resolution | Escalate |
|---|
| Compliance guard rejecting valid transitions | Policy rule misconfigured or stale | Review guard function against regulatory matrix; update rule | Compliance Eng immediately |
| E-signature verification failing | Hash mismatch (data integrity issue) | DO NOT bypass. Investigate hash chain integrity. Check for concurrent modification. | Security team + Compliance Eng — potential tampering event |
| PHI scanner false positives | Pattern match too aggressive | Adjust confidence thresholds; add to allowlist if appropriate | Compliance Eng (never auto-suppress) |
| Audit trail hash chain break | Missing entry or out-of-order write | Run chain verification job; identify gap; reconstruct if possible | P0 incident — regulatory reporting may be required |
3.4 PostgreSQL State Store Runbook
Business Criticality: P0 — All state depends on database
Data Classification: L4 (stores all regulated data)
SLA: 99.99% availability, RPO = 0 (synchronous replication)
| Symptom | Cause | Resolution | Escalate |
|---|
| Replication lag > 1s | Network issue or standby overloaded | Check network; reduce standby query load; verify WAL shipping | SRE if lag > 10s |
| Connection pool saturated | Query spike or leak | Identify long queries (pg_stat_activity); kill if safe; increase pool | Data Platform if recurring |
| Table bloat (audit_trail) | Append-only table growing without partitioning | Verify partition pruning active; create new partition; archive old | Data Platform for capacity planning |
| RLS policy not enforcing | app.tenant_id not set on connection | CRITICAL — tenant data leak. Kill affected connections immediately. Audit affected queries. | P0 incident — Security team + Compliance Eng |
4. Cost of Ownership Model
4.1 Infrastructure Cost Projection
| Resource | Y1 (Monthly) | Y2 (Monthly) | Y3 (Monthly) | Notes |
|---|
| GKE cluster (3 nodes, e2-standard-4) | $800 | $1,200 | $2,400 | Scale with customer count |
| Cloud SQL (PostgreSQL, HA) | $500 | $1,000 | $2,500 | Scale with data volume |
| NATS cluster (3 nodes) | $200 | $300 | $600 | Scale with event volume |
| Cloud KMS + Secret Manager | $50 | $100 | $200 | Scale with key count |
| Cloud Storage (backups, evidence) | $100 | $300 | $800 | Scale with audit trail size |
| Networking (egress, LB) | $150 | $300 | $600 | Scale with traffic |
| Observability (OTEL, Prometheus, Grafana Cloud) | $200 | $400 | $800 | Scale with metrics volume |
| AI model tokens | $500 | $2,000 | $8,000 | Scale with WO volume; routing reduces |
| Total Infrastructure | $2,500 | $5,600 | $15,900 | |
| Annual | $30K | $67K | $191K | |
| Per Customer (Y2, 25 customers) | — | $224/mo | — | Infrastructure marginal cost |
4.2 Staffing Cost Projection
| Role | Y1 (FTE × Cost) | Y2 | Y3 |
|---|
| Engineering (4 FTE) | $600K | $900K (6 FTE) | $1.4M (9 FTE) |
| Compliance Eng (1 FTE) | $150K | $200K | $300K (2 FTE) |
| SRE (0.5 FTE) | $75K | $150K (1 FTE) | $225K (1.5 FTE) |
| Product (0.5 FTE) | $75K | $150K (1 FTE) | $200K (1 FTE) |
| Total Staffing | $900K | $1.4M | $2.125M |
4.3 Total Cost of Ownership
| Category | Y1 | Y2 | Y3 | 3-Year Total |
|---|
| Infrastructure | $30K | $67K | $191K | $288K |
| Staffing | $900K | $1.4M | $2.125M | $4.425M |
| Licensing (third-party) | $15K | $30K | $50K | $95K |
| Compliance consulting | $50K | $25K | $25K | $100K |
| Total | $995K | $1.522M | $2.391M | $4.908M |
| Revenue (from 60-business-model.md) | $960K | $4.6M | $16.3M | $21.86M |
| Gross Profit | ($35K) | $3.08M | $13.9M | $16.95M |
Y1 is near break-even (pre-revenue buildup). Y2 crosses into strong profitability as multi-tenant leverage kicks in. Y3 infrastructure costs grow sublinearly vs. revenue because shared platform scales efficiently.
4.4 Opportunity Cost Assessment
| If We Do This | We Don't Do This | Revenue Impact | Risk |
|---|
| Gap closure series (G01–G28) | New feature development (2–3 months) | $0 direct; prevents $200K+ per compliance finding | Compliance risk elimination — non-negotiable |
| Vendor portal (V1) | Advanced agent capabilities | +$200/vendor/mo × vendor seats | Delayed competitive moat |
| Multi-region deployment | Feature parity with MasterControl | Enables enterprise segment ($216K+ ACV) | Slower mid-market growth |
| ISO 13485 compliance framework | Fintech SOC2 expansion | Opens medical device segment | Delays fintech revenue |
5. Vendor Risk Assessment
5.1 Critical Vendor Dependencies
| Vendor | Dependency Level | What They Provide | Exit Difficulty | Alternative |
|---|
| Anthropic (Claude) | Tier 1: Core | Primary AI model (Opus/Sonnet/Haiku) | Medium — model routing abstraction enables swap | OpenAI, Google, open-source models |
| Google Cloud (GKE, Cloud SQL) | Tier 1: Core | Compute, database, networking | High — deep GCP integration | AWS (EKS, RDS), Azure (AKS, Flex Server) |
| HashiCorp Vault | Tier 2: Strategic | Secrets management, credential vault | Medium — Vault API is industry standard | GCP Secret Manager, AWS Secrets Manager |
| NATS | Tier 2: Strategic | Event bus, async messaging | Low — NATS is open-source, self-hosted | Redis Streams, Apache Kafka, RabbitMQ |
| Grafana Labs | Tier 3: Standard | Observability dashboards | Low — OTEL is vendor-neutral | Datadog, New Relic, self-hosted Grafana |
| Auth0 / Okta | Tier 2: Strategic | Identity provider, SSO | Medium — OIDC is standard protocol | Keycloak (self-hosted), Azure AD, Google Identity |
5.2 Exit Strategy per Vendor
| Vendor | Exit Trigger | Migration Path | Estimated Duration | Data Portability |
|---|
| Anthropic | Model quality degradation, pricing increase > 50%, BAA revocation | Switch model router to OpenAI or open-source; retrain agent prompts | 4–6 weeks | N/A (stateless API calls) |
| GCP | Pricing, region availability, compliance certification loss | Terraform-based infra; containerized workloads; PostgreSQL standard | 8–12 weeks | pg_dump + storage export |
| Vault | Licensing, security vulnerability | Migrate secrets to cloud-native (GCP SM); update vault:// references | 2–4 weeks | Secret export + re-encrypt |
| NATS | Performance, maintainability | Replace with Redis Streams (similar pub/sub semantics) | 4–6 weeks | Event replay from audit trail |
5.3 Vendor Health Monitoring
| Signal | Frequency | Red Flag Threshold |
|---|
| API availability | Real-time | < 99.5% over 30 days |
| Pricing changes | Per announcement | > 25% increase without proportional value |
| Security incidents | Per disclosure | Any breach affecting customer data category |
| BAA/DPA compliance | Annually | Refusal to renew or terms degradation |
| Financial health | Quarterly | Negative press, layoffs > 20%, funding concerns |
| Support responsiveness | Monthly | P1 response > 4 hours consistently |
6. Disaster Recovery
6.1 RPO/RTO by Data Tier
| Tier | Data Types | RPO | RTO | Method | Cost |
|---|
| Tier 1: Zero-loss | Audit trail, e-signatures, compliance records | 0 | < 15 min | Synchronous replication to standby | High (dedicated standby) |
| Tier 2: Near-zero | Work orders, approvals, state machine state | < 1 min | < 30 min | Streaming replication (WAL) | Medium |
| Tier 3: Hourly | Agent execution logs, token usage, metrics | < 1 hour | < 2 hours | Point-in-time recovery (PITR) | Low |
| Tier 4: Daily | Temporary data, cache, session state | < 24 hours | < 4 hours | Daily snapshots | Minimal |
6.2 Backup Strategy
| Component | Method | Frequency | Retention | Verification |
|---|
| PostgreSQL (full) | pg_basebackup to Cloud Storage | Daily (2 AM UTC) | 30 days | Weekly restore test to ephemeral instance |
| PostgreSQL (WAL) | Continuous WAL archiving | Continuous | 7 days | Verified via replication lag monitoring |
| NATS (event replay) | Events reconstructable from audit trail | N/A (derived) | N/A | Replay test monthly |
| Vault (secrets) | Vault snapshot to encrypted Cloud Storage | Daily | 90 days | Quarterly restore test |
| Application config | Git (infrastructure-as-code, Terraform) | Per commit | Indefinite | Every deployment is a restore test |
| Container images | Container registry (GCR) with signing | Per build | 90 days (tagged releases indefinite) | Every deployment |
6.3 Recovery Procedures
Database Failover (Automated)
Trigger: Primary PostgreSQL unresponsive for > 30 seconds
OR replication lag > 60 seconds with primary unhealthy
Procedure:
1. Cloud SQL automatic failover promotes standby to primary
2. Application connection pool detects new primary (< 30s)
3. NATS publishes FAILOVER_COMPLETE event
4. Observability: verify query latency, replication re-established
5. Post-failover: create new standby from promoted primary
Verification:
- All RLS policies active on new primary
- Audit trail writes resuming without gaps
- E-signature verification passing
- No tenant data exposure during failover window
Duration: 60–120 seconds total
Full Region Failure (Manual)
Trigger: GCP region outage > 15 minutes with no ETA
OR regulatory requirement to relocate data
Procedure:
1. Activate DR region infrastructure (pre-provisioned via Terraform)
2. Restore PostgreSQL from latest WAL archive (RPO: < 1 min for Tier 1/2)
3. Update DNS to DR region load balancer
4. Verify RLS policies, audit trail integrity, e-signature chain
5. Notify customers of region failover via status page
6. Run compliance verification suite (subset of OQ)
Duration: 2–4 hours (dominated by data restore + verification)
6.4 DR Testing Schedule
| Test Type | Frequency | Scope | Duration | Participants |
|---|
| Automated failover | Monthly | Database failover to standby | 15 min | SRE (automated, reviewed) |
| Backup restore | Weekly | Restore latest backup to ephemeral instance, verify | 30 min | SRE (automated) |
| Full DR drill | Quarterly | Complete region failover simulation (non-production) | 4 hours | SRE + Engineering + Compliance |
| Tabletop exercise | Bi-annually | Scenario walkthrough (no live systems) | 2 hours | All teams + leadership |
| Compliance DR review | Annually | Auditor reviews DR evidence, procedures, test results | 1 day | Compliance Eng + External auditor |
7. Business Continuity
7.1 Degraded Mode Operations
When full functionality is unavailable, the system can operate in degraded modes:
| Scenario | Degraded Capability | What Still Works | Recovery |
|---|
| AI model provider down | No agent-driven transitions | Manual WO transitions, approvals, audit trail, e-signatures | Model router failover to secondary provider |
| NATS down | No async events, no real-time notifications | Synchronous WO operations, direct database writes | Events queued in PostgreSQL; replay on NATS recovery |
| Vault down | No new credential issuance | Cached credentials (TTL-based); no new agent executions | Vault recovery; credential refresh cycle |
| Observability down | No dashboards, no alerting | All business operations continue normally | Grafana/Prometheus recovery; no data loss (buffered) |
| Primary region degraded | Increased latency | Full functionality at higher latency | Auto-scaling in region; or regional failover |
7.2 Communication Plan
| Severity | Communication Channel | Timeline | Audience |
|---|
| P0 (data loss risk) | Status page + email + phone to affected customers | < 15 min initial; 30-min updates | All affected customers + internal leadership |
| P1 (service degraded) | Status page + email | < 30 min initial; hourly updates | All customers |
| P2 (non-critical issue) | Status page | < 1 hour; updates as available | Subscribers |
| P3 (cosmetic/minor) | Internal tracking only | Next business day | Internal only |
7.3 Regulatory Continuity
For FDA-regulated customers, business continuity has specific compliance requirements:
| Requirement | Implementation | Evidence |
|---|
| Audit trail continuity during DR | Write-ahead log ensures no gaps; chain verification post-failover | Chain verification report |
| E-signature validity after failover | Signatures tied to record content, not infrastructure; re-verifiable | Signature verification test results |
| Data integrity post-restore | Checksum verification of all L4 data after restore | Integrity verification report |
| Regulatory notification | If DR event affects validated system records, notify customer's QA | Notification log, customer acknowledgment |
8. SLA Framework
8.1 External SLAs (Customer-Facing)
| Tier | Availability | API P95 Latency | Audit Trail Write | Support Response |
|---|
| Starter | 99.5% | < 1s | < 200ms | < 24 hours (business) |
| Professional | 99.9% | < 500ms | < 100ms | < 4 hours (business) |
| Enterprise | 99.95% | < 300ms | < 50ms | < 1 hour (24/7) |
| Enterprise Plus | 99.99% | < 200ms | < 50ms | < 15 min (24/7, named) |
8.2 Internal SLOs (Engineering Targets)
Internal targets are stricter than external SLAs to provide margin:
| Component | SLO Target | Measurement | Alert Threshold |
|---|
| WO API availability | 99.95% | Synthetic monitoring (every 30s) | < 99.9% over 1 hour |
| State transition latency | P50 < 50ms, P95 < 200ms, P99 < 1s | Request tracing | P95 > 300ms |
| Audit trail write latency | P50 < 10ms, P95 < 50ms | Write acknowledgment timing | P95 > 75ms |
| Agent dispatch latency | < 1s from request to first agent action | Orchestrator instrumentation | > 2s |
| E-signature verification | < 100ms per signature | Verification request timing | > 200ms |
| Database query latency | P50 < 5ms, P95 < 50ms | pg_stat_statements | P95 > 75ms |
| Event delivery latency | < 500ms from publish to consumer | NATS metrics | > 1s |
| Error budget | 0.05% of requests (99.95% target) | Error rate tracking | > 0.03% (warning), > 0.04% (critical) |
8.3 Error Budget Policy
Monthly error budget = 100% - SLO = 0.05% of requests
Budget tracking:
Green: < 50% budget consumed → Normal development velocity
Yellow: 50–80% budget consumed → Reduce risky deployments, increase testing
Orange: 80–95% budget consumed → Freeze non-critical changes, focus on reliability
Red: > 95% budget consumed → All-hands reliability focus, no new features
Budget resets monthly. Unused budget does not carry over.
Compliance-critical paths (audit trail, e-signatures) have separate, stricter budgets.
9. On-Call Design
9.1 Rotation Structure
| Tier | Who | Scope | Hours | Response Time |
|---|
| Primary on-call | SRE + 1 backend engineer (rotating) | All P0/P1 alerts | 24/7 | < 15 min acknowledge |
| Secondary on-call | Platform Core engineer (rotating) | Escalation from primary | 24/7 | < 30 min acknowledge |
| Compliance on-call | Compliance engineer | Compliance-specific alerts only | Business hours + P0 24/7 | < 1 hour (business), < 30 min (P0) |
| Leadership escalation | CTO / VP Eng | P0 unresolved > 1 hour | 24/7 | < 15 min |
9.2 Rotation Schedule
- Weekly rotation (Monday 10 AM UTC to Monday 10 AM UTC)
- Minimum 3 people in primary rotation to avoid burnout (max 1 week in 3)
- Handoff meeting at rotation change: open issues, upcoming maintenance, known risks
- Post-on-call debrief for any P0/P1 incidents during the week
9.3 Alert Routing
| Alert Category | Examples | Primary Route | Escalation |
|---|
| Infrastructure | Node unhealthy, disk > 85%, memory pressure | SRE (PagerDuty) | Platform Core |
| Application | Error rate spike, latency degradation, circuit breaker OPEN | Primary on-call (PagerDuty) | Secondary on-call |
| Compliance | Audit trail write failure, hash chain break, PHI detection | Compliance on-call (PagerDuty + Slack) | Leadership |
| Security | Failed auth spike, privilege escalation, break-glass activation | Security + Primary on-call (PagerDuty) | Leadership |
| Business | Tenant approaching WO limit, token budget exhaustion | Slack notification (non-urgent) | Product team |
9.4 Incident Response Playbook
P0 Incident Flow:
0 min: Alert fires → PagerDuty pages primary on-call
5 min: Primary acknowledges → assess severity → confirm P0
10 min: Open incident channel (Slack #incident-YYYYMMDD-NNN)
Assign roles: Incident Commander (primary), Scribe (secondary), Comms (product)
15 min: First customer communication (status page update)
30 min: If unresolved → escalate to secondary on-call + relevant team lead
60 min: If unresolved → escalate to leadership
Compliance on-call verifies: audit trail integrity, no data exposure
Ongoing: 30-minute status updates until resolved
Resolution:
- Verify: service restored, data integrity confirmed, compliance checks pass
- Customer notification: "Resolved" status page update
- Within 24 hours: Preliminary incident report
- Within 5 business days: Full post-incident review (blameless)
- PIR outputs: action items with owners and deadlines, runbook updates
9.5 On-Call Wellness
| Practice | Implementation |
|---|
| Alert hygiene | Monthly alert review; snooze or delete alerts with < 10% actionable rate |
| Toil budget | Max 30% of on-call time on toil; remaining time for reliability improvements |
| Compensation | On-call stipend + incident response premium; time-off-in-lieu for overnight pages |
| Fatigue prevention | Max 2 P0/P1 incidents per rotation before mandatory rest day; no back-to-back weeks |
| Alert-free goals | Team OKR: reduce alert volume 10% per quarter through reliability investment |
10. Localization & Multi-Market Readiness
10.1 Current Scope
The WO system launches for US-based FDA-regulated organizations. Multi-market readiness is planned for Y2/Y3 expansion.
10.2 Internationalization Architecture (i18n)
| Component | Approach | Status |
|---|
| UI strings | Externalized to JSON resource bundles (ICU MessageFormat) | Design — implement before first non-US customer |
| Date/time | ISO 8601 storage; locale-aware display (Intl API) | Partially implemented (all timestamps UTC in DB) |
| Number/currency | Locale-aware formatting (Intl.NumberFormat) | Not started |
| Regulatory framework | Configurable compliance rules per jurisdiction | Architecture supports (compliance engine is rule-based) |
| Audit trail language | Audit events in English (system language); UI translatable | Design decision: audit in English for FDA consistency |
10.3 Regulatory Jurisdiction Mapping
| Market | Regulations | Configuration Required | Priority |
|---|
| US (primary) | FDA 21 CFR Part 11, HIPAA, SOC 2 | Implemented | Y1 |
| EU | EMA Annex 11, GDPR, EU MDR 2017/745 | Compliance engine rules + data residency | Y2 |
| UK | MHRA GxP, UK GDPR | Similar to EU with UK-specific rules | Y2 |
| Brazil | ANVISA, LGPD | Compliance rules + Brazil data residency | Y3 |
| Japan | PMDA, APPI | Compliance rules + Japan data residency | Y3 |
10.4 Data Residency for International
| Region | Deployment | Data Classification Impacted | Implementation |
|---|
| US (us-central1) | Primary | All tiers | Deployed |
| EU (europe-west1) | Required for EU customers | L3, L4 (PII, regulated) | Terraform-ready, deploy when first EU customer |
| Brazil (southamerica-east1) | Required for LGPD | L3, L4 | Planned Y3 |
Tenant-level configuration: tenant.data_residency_region controls where all L3/L4 data is stored and processed. L0/L1/L2 can be served from any region (CDN, global read replicas).
Operational readiness is not a launch checklist — it's a continuous capability. Every new feature, every new customer, every new region requires updating the runbooks, DR procedures, on-call scope, and cost model. This document is alive; it degrades the moment it stops being updated.
Appendix: Cross-Reference to Gap Closure Series
| Gap ID | Operational Readiness Impact | This Document Section |
|---|
| G01 (Credential Vault) | Vault runbook required; vendor risk for HashiCorp | §3.3, §5.1 |
| G08 (FDA Completeness) | Compliance validation in DR procedures | §6.3, §7.3 |
| G13 (SIEM Integration) | Alert routing for security events | §9.3 |
| G14 (Incident Response WO) | P0 incident auto-creates WO | §9.4 |
| G17 (Optimistic Locking) | Conflict resolution runbook | §3.2 |
| G27 (Artifact Updates) | Runbook review cadence | §3.1 (template) |
| G28 (Dashboard Sync) | Monitoring dashboard accuracy | §8.2 |
Copyright 2026 AZ1.AI Inc. All rights reserved.
Developer: Hal Casteel, CEO/CTO
Product: CODITECT-BIO-QMS | Part of the CODITECT Product Suite
Classification: Internal - Confidential