Work Order QMS Module — Deployment Architecture
Classification: Internal — Engineering / DevOps
Date: 2026-02-13
Artifact: 72 of WO System Series
Status: Proposed
Source Artifacts: 13-tdd.md §3, 14-c4-architecture.md, 65-testing-strategy.md §6, 66-operational-readiness.md, 69-versioning-evolution-strategy.md §6
1. Environment Strategy
1.1 Environment Tiers
| Environment | Purpose | Data | Refresh Cadence | Access |
|---|---|---|---|---|
| local | Developer workstation | Synthetic seed | On demand | Developer only |
| dev | Integration testing, feature branches | Synthetic seed | Daily reset | Engineering team |
| staging | Pre-release validation, OQ execution | Synthetic + golden dataset | Per release candidate | Engineering + QA + Compliance |
| production | Customer-facing | Real customer data (L3/L4) | N/A | SRE + on-call (limited) |
| dr | Disaster recovery standby | Replicated from production | Continuous (WAL) | SRE (activation only) |
1.2 Environment Parity Rules
- local/dev/staging use identical container images, only configuration differs.
- staging runs the same PostgreSQL version, RLS policies, and NATS configuration as production.
- staging audit trail verification and e-signature validation are fully active (not mocked).
- Feature flags (69-versioning-evolution-strategy.md §2) control feature availability per environment, never code branches.
- No production data ever flows to non-production environments (63-data-architecture.md §2.1).
1.3 Configuration Hierarchy
Base config (defaults)
└── Environment overlay (dev/staging/prod)
└── Tenant overlay (per-customer config)
└── Feature flags (dynamic, runtime)
Configuration sources:
| Source | Contents | Format | Secrets |
|---|---|---|---|
Git repo (config/) | Base + environment overlays | YAML | Never |
| Terraform state | Infrastructure resource IDs, endpoints | HCL → state | Outputs only |
| GCP Secret Manager / Vault | API keys, DB credentials, signing keys | Key-value | All secrets |
PostgreSQL tenant_config | Per-tenant settings, compliance framework selection | JSON | Never (references to secrets only) |
| Runtime feature flags | Feature toggles per tenant/role | PostgreSQL feature_flags table | Never |
2. Infrastructure as Code
2.1 Terraform Structure
infrastructure/
├── modules/
│ ├── gke-cluster/ # GKE cluster + node pools
│ ├── cloud-sql/ # PostgreSQL HA instance + read replicas
│ ├── nats/ # NATS JetStream cluster (Helm)
│ ├── vault/ # HashiCorp Vault (Helm + config)
│ ├── networking/ # VPC, subnets, firewall rules, Cloud NAT
│ ├── observability/ # Prometheus, Grafana, OTEL collector
│ ├── dns-ssl/ # Cloud DNS + managed SSL certificates
│ └── iam/ # Service accounts, IAM bindings
├── environments/
│ ├── dev/
│ │ ├── main.tf # Module instantiation with dev params
│ │ ├── variables.tf # Dev-specific variable values
│ │ └── backend.tf # GCS state bucket (dev)
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── backend.tf
│ └── production/
│ ├── main.tf
│ ├── variables.tf
│ └── backend.tf
├── dr/ # DR region infrastructure (mirrored)
│ └── main.tf
└── shared/
├── container-registry/ # Artifact Registry
├── kms/ # Cloud KMS keys for encryption
└── state-buckets/ # GCS buckets for Terraform state
2.2 Key Infrastructure Resources
| Resource | Service | Spec (Production) | HA Strategy |
|---|---|---|---|
| Kubernetes cluster | GKE Autopilot | 3 zones, auto-scaling | Multi-zone |
| PostgreSQL | Cloud SQL Enterprise | db-custom-4-16384, HA | Synchronous replica + PITR |
| Event bus | NATS JetStream | 3-node cluster on GKE | Built-in Raft consensus |
| Secrets | Vault on GKE | 3-node HA, GCS backend | Raft storage backend |
| Container registry | Artifact Registry | Standard | Multi-region replication |
| CDN / LB | Cloud Load Balancing | Global HTTP(S) LB | Anycast, multi-region |
| DNS | Cloud DNS | Public zone | 100% SLA |
| KMS | Cloud KMS | Automatic key rotation | Regional + multi-region |
| Object storage | Cloud Storage | Standard (backups, evidence) | Multi-region bucket |
2.3 Network Architecture
VPC: coditect-vpc (10.0.0.0/16)
├── Subnet: gke-nodes (10.0.0.0/20) — GKE node IPs
├── Subnet: gke-pods (10.4.0.0/14) — GKE pod IPs (secondary)
├── Subnet: gke-services (10.8.0.0/20) — GKE service IPs (secondary)
├── Subnet: cloud-sql (10.0.16.0/24) — Private Service Connect
└── Subnet: vault (10.0.17.0/24) — Vault cluster
Firewall rules:
- Ingress: HTTPS (443) from Cloud LB only
- Internal: All pods can reach NATS, PostgreSQL, Vault
- Egress: AI model APIs (Anthropic, OpenAI), SMTP, webhook delivery
- Deny: All other ingress/egress (default deny)
Private Service Connect: Cloud SQL, Secret Manager (no public IPs)
Cloud NAT: Outbound internet for AI API calls and webhook delivery
3. Container Strategy
3.1 Container Images
| Image | Base | Size Target | Contents |
|---|---|---|---|
wo-api | gcr.io/distroless/nodejs20-debian12 | < 150MB | Express API server, state machine, RBAC |
wo-compliance | gcr.io/distroless/python3-debian12 | < 200MB | Compliance engine, policy rules, PHI scanner |
wo-agents | gcr.io/distroless/python3-debian12 | < 200MB | Agent workers, model routing, orchestration |
wo-migrations | node:20-alpine | < 100MB | Prisma migrations + seed data |
wo-scheduler | gcr.io/distroless/nodejs20-debian12 | < 100MB | PM automation, schedule-based WO generation |
3.2 Image Security
| Control | Implementation | Reference |
|---|---|---|
| Base images | Distroless only (no shell, no package manager) | 64-security-architecture.md §6 |
| Vulnerability scanning | Trivy on every build; block Critical/High | CI pipeline stage |
| Image signing | Cosign with Cloud KMS key | Supply chain security |
| SBOM generation | Syft → SPDX format, attached to image | 64-security-architecture.md §6 |
| No root | All containers run as non-root (UID 1000) | Kubernetes SecurityContext |
| Read-only filesystem | readOnlyRootFilesystem: true | Pod spec |
3.3 Kubernetes Manifests
k8s/
├── base/ # Kustomize base
│ ├── kustomization.yaml
│ ├── namespace.yaml # wo-system namespace
│ ├── wo-api/
│ │ ├── deployment.yaml # 3 replicas, rolling update
│ │ ├── service.yaml # ClusterIP
│ │ ├── hpa.yaml # CPU 70% → scale 3-10
│ │ └── pdb.yaml # minAvailable: 2
│ ├── wo-compliance/
│ │ ├── deployment.yaml # 2 replicas
│ │ └── ...
│ ├── wo-agents/
│ │ ├── deployment.yaml # 2-8 replicas (auto-scale on queue depth)
│ │ └── ...
│ ├── wo-scheduler/
│ │ ├── cronjob.yaml # Schedule-based PM generation
│ │ └── ...
│ ├── networkpolicy.yaml # Pod-to-pod communication rules
│ └── serviceaccount.yaml # Workload Identity for GCP access
├── overlays/
│ ├── dev/ # 1 replica, lower resources
│ ├── staging/ # Production-like replicas
│ └── production/ # Full replicas, resource limits
└── jobs/
├── migration.yaml # Database migration Job
├── seed.yaml # Seed data Job (non-prod only)
└── audit-verify.yaml # Periodic audit chain verification
4. CI/CD Pipeline
4.1 Pipeline Architecture
┌─────────────┐
│ Developer │
│ pushes code │
└──────┬──────┘
│
┌──────▼──────┐
│ GitHub PR │
│ created │
└──────┬──────┘
│
┌────────────▼────────────┐
│ CI: Validate │
│ ┌─────────────────┐ │
│ │ Lint + Type Check│ │
│ │ Unit Tests │ │
│ │ Contract Tests │ │
│ │ Security Scan │ │
│ │ SBOM Generate │ │
│ └─────────────────┘ │
└────────────┬────────────┘
│ All pass
┌────────────▼────────────┐
│ CI: Build & Publish │
│ ┌─────────────────┐ │
│ │ Docker Build │ │
│ │ Trivy Scan │ │
│ │ Cosign Sign │ │
│ │ Push to Registry │ │
│ └─────────────────┘ │
└────────────┬────────────┘
│ Merge to main
┌────────────▼────────────┐
│ CD: Deploy to Dev │
│ (auto on merge) │
└────────────┬────────────┘
│ Integration tests pass
┌────────────▼────────────┐
│ CD: Deploy to Staging │
│ (auto after dev green) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Validation Gates │
│ ┌─────────────────┐ │
│ │ OQ Test Suite │ │ ← Automated (65-testing-strategy.md)
│ │ PQ Subset │ │
│ │ Compliance Check │ │ ← Compliance engine validation
│ │ Performance Test │ │ ← P95 < 500ms verified
│ └─────────────────┘ │
└────────────┬────────────┘
│ All pass
┌────────────▼────────────┐
│ Manual Gate: Release │ ← Human approval required
│ Approval (QA + Eng) │ for production deployment
└────────────┬────────────┘
│ Approved
┌────────────▼────────────┐
│ CD: Deploy to Prod │
│ (canary → rolling) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Post-Deploy Validation │
│ ┌─────────────────┐ │
│ │ Smoke Tests │ │
│ │ Audit Integrity │ │
│ │ Canary Metrics │ │
│ └─────────────────┘ │
└─────────────────────────┘
4.2 CI Stage Details
| Stage | Tool | Duration Target | Failure Action |
|---|---|---|---|
| Lint + format | ESLint, Prettier, Ruff | < 30s | Block PR |
| Type check | tsc --noEmit, mypy | < 60s | Block PR |
| Unit tests | Vitest (TS), pytest (Python) | < 2 min | Block PR |
| Contract tests | Pact | < 1 min | Block PR |
| Security scan | Snyk, semgrep | < 2 min | Block PR (Critical/High) |
| SBOM generation | Syft | < 30s | Informational |
| Docker build | Docker + Buildx | < 3 min | Block PR |
| Image vulnerability scan | Trivy | < 1 min | Block (Critical), warn (High) |
| Image signing | Cosign + Cloud KMS | < 15s | Block PR |
| Total CI time | < 10 min |
4.3 CD Stage Details
| Stage | Trigger | Strategy | Rollback |
|---|---|---|---|
| Deploy to dev | Merge to main | Replace (fast) | Re-deploy previous image |
| Deploy to staging | Dev integration tests pass | Rolling update | Automatic on test failure |
| Validation gates | Staging deploy complete | Automated test suites | Block promotion |
| Release approval | Validation gates pass | Manual (QA + Eng sign-off) | N/A |
| Deploy to production | Release approved | Canary (10% → 50% → 100%) | Automatic on error rate spike |
| Post-deploy validation | Production deploy complete | Smoke tests + canary metrics | Automatic rollback if fail |
4.4 Production Deployment Strategy
# Canary deployment progression
canary:
steps:
- setWeight: 10 # 10% of traffic to new version
pause: { duration: 5m }
analysis:
- metric: error_rate
threshold: 0.01 # < 1% error rate
- metric: p95_latency
threshold: 500 # < 500ms
- setWeight: 50 # 50% if canary healthy
pause: { duration: 10m }
analysis: [same metrics]
- setWeight: 100 # Full rollout if still healthy
rollback:
trigger: error_rate > 0.02 OR p95_latency > 1000
action: automatic_rollback
4.5 Database Migration Strategy
Migrations run as Kubernetes Jobs before application deployment:
1. Pre-migration snapshot (automated Cloud SQL backup)
2. Run migration Job (Prisma migrate deploy)
3. Verify migration success (check migration table)
4. If failed → automatic restore from snapshot
5. If succeeded → proceed with application deployment
Rules:
- All migrations are forward-only in production
- Expand-contract pattern for schema changes (69-versioning-evolution-strategy.md §3)
- L4 tables (audit_trail, electronic_signature): additive-only, never ALTER/DROP
- RLS policies verified post-migration
- Migration timing: off-peak hours (2–4 AM UTC) for schema changes
5. Release Process
5.1 Release Types
| Type | Cadence | Approval | DB Migration | Downtime |
|---|---|---|---|---|
| Patch (x.y.Z) | As needed | Eng lead | Rarely | Zero |
| Minor (x.Y.0) | Bi-weekly | Eng lead + QA | Sometimes | Zero |
| Major (X.0.0) | Quarterly | Eng lead + QA + Compliance + Product | Usually | Planned (< 15 min) |
| Hotfix | Emergency | CTO + on-call | If needed | Zero (canary) |
5.2 Release Checklist
## Release v[X.Y.Z] Checklist
Pre-Release:
☐ All CI checks passing on release branch
☐ Integration tests passing in staging
☐ OQ test suite passing (automated)
☐ Performance tests meeting SLA targets
☐ Security scan: zero Critical, zero High unmitigated
☐ SBOM generated and attached
☐ Changelog updated
☐ Feature flags configured for target audience
Regulatory (if applicable):
☐ Compliance engine validation passing
☐ Audit trail integrity verified in staging
☐ E-signature flow tested end-to-end
☐ IQ evidence package generated (for new infrastructure)
☐ OQ evidence package generated (for functional changes)
☐ PQ evidence package generated (for performance changes)
☐ Traceability matrix updated
Approval:
☐ Engineering lead sign-off
☐ QA sign-off (for minor+)
☐ Compliance sign-off (for major or regulatory changes)
☐ Product sign-off (for major)
Deploy:
☐ Database migration tested in staging
☐ Pre-migration backup verified
☐ Canary deployment initiated
☐ Canary metrics monitored (10% → 50% → 100%)
☐ Post-deploy smoke tests passing
Post-Release:
☐ Customer release notes published
☐ Status page updated
☐ Monitoring dashboards verified
☐ On-call team briefed on changes
6. Secrets Management in Deployment
| Secret | Storage | Injection Method | Rotation |
|---|---|---|---|
| PostgreSQL credentials | Vault / GCP Secret Manager | Kubernetes External Secrets Operator | 90 days |
| AI model API keys | Vault | Sidecar injection | 90 days |
| E-signature signing key | Cloud KMS | IAM-based access (Workload Identity) | Annual |
| NATS credentials | Vault | Kubernetes secret | 90 days |
| Webhook HMAC keys | Vault | Application bootstrap | Per customer |
| Container signing key | Cloud KMS | CI/CD service account | Annual |
Zero secrets in Git, environment variables, or container images. All secrets injected at runtime via Vault Agent sidecar or External Secrets Operator.
7. Observability in Deployment
| Signal | Collection | Storage | Retention |
|---|---|---|---|
| Metrics | Prometheus (scrape) | Grafana Cloud | 13 months |
| Traces | OTEL SDK → OTEL Collector | Grafana Tempo | 30 days |
| Logs | Structured JSON → Fluentbit | Grafana Loki | 90 days (L0–L2), 7 years (L3–L4 audit) |
| Alerts | Grafana Alerting → PagerDuty | PagerDuty | Per 66-operational-readiness.md §9 |
8. Cross-Reference
| Concern | Specification Source |
|---|---|
| Infrastructure cost model | 66-operational-readiness.md §4 |
| DR failover procedure | 66-operational-readiness.md §6 |
| Test pyramid in CI | 65-testing-strategy.md §1 |
| IQ/OQ/PQ in CD | 70-validation-protocol-templates.md |
| Schema evolution | 69-versioning-evolution-strategy.md §3 |
| Feature flags | 69-versioning-evolution-strategy.md §2 |
| Container security | 64-security-architecture.md §6 |
| API versioning | 69-versioning-evolution-strategy.md §1 |
| Environment config | 13-tdd.md §2 |
The deployment architecture is the bridge between design and reality. Every artifact in the WO system corpus is dead weight until this pipeline delivers it to production safely and repeatably. This document is the operational counterpart to the SDD — it answers "how does it get there" rather than "what does it look like."