Track E: Operations & Deployment
Priority: LOW — Begins Sprint 5
Agent: devops-engineer, cloud-architect
Sprint Range: S5-S6
Status Summary
Progress: 100% (20/20 tasks)
| Section | Title | Status | Tasks |
|---|---|---|---|
| E.1 | CI/CD Pipeline | Complete | 4/4 |
| E.2 | Infrastructure (GKE/Cloud Run) | Complete | 5/5 |
| E.3 | Monitoring & Observability | Complete | 4/4 |
| E.4 | Backup & Disaster Recovery | Complete | 3/3 |
| E.5 | DR Validation & Automation | Complete | 4/4 |
E.1: CI/CD Pipeline
Sprint: S5 | Priority: P1 | Depends On: C.1 Goal: Automated build, test, and deploy pipeline
- E.1.1: Set up GitHub Actions CI pipeline (evidence: docs/operations/cicd-pipeline.md §E.1.1)
- Triggers: Push to main, PR opened/updated, manual dispatch
- Steps: Lint → type check → unit tests → integration tests → build
- Matrix: Node 20 LTS, PostgreSQL 15
- E.1.2: Configure automated deployment pipeline (evidence: docs/operations/cicd-pipeline.md §E.1.2)
- Staging: Auto-deploy on merge to main
- Production: Manual approval gate, deploy on tag
- Rollback: One-click rollback to previous version
- E.1.3: Set up container image build and registry (evidence: docs/operations/cicd-pipeline.md §E.1.3)
- Registry: Google Artifact Registry
- Images: Backend API, frontend static, database migration runner
- Tagging: Semantic versioning + commit SHA
- E.1.4: Implement database migration CI step (evidence: docs/operations/cicd-pipeline.md §E.1.4)
- Pre-deploy: Run Prisma migrations in CI before app deploy
- Validation: Schema diff check, backward compatibility verification
- Rollback: Migration rollback script for each migration
E.2: Infrastructure (GKE/Cloud Run)
Sprint: S5-S6 | Priority: P1 | Depends On: E.1 Goal: Production-ready GCP infrastructure with multi-environment setup
- E.2.1: Configure Terraform for GCP infrastructure (evidence: docs/operations/gcp-infrastructure.md §Terraform)
- Resources: Cloud Run (API), Cloud SQL (PostgreSQL), Memorystore (Redis), Cloud CDN
- Environments: dev, staging, production with shared VPC
- State: Terraform state in GCS bucket with locking
- E.2.2: Set up Cloud Run services for API and frontend (evidence: docs/operations/gcp-infrastructure.md §Cloud-Run)
- API: Cloud Run with min 2 / max 10 instances, 2GB RAM, 2 vCPU
- Frontend: Cloud Run serving static files with CDN
- Custom domain: bio-qms.coditect.ai with managed SSL
- E.2.3: Configure Cloud SQL PostgreSQL instance (evidence: docs/operations/gcp-infrastructure.md §Cloud-SQL)
- Tier: db-custom-2-4096 (production), db-f1-micro (dev)
- HA: Multi-zone in production, automated backups every 4 hours
- Connection: Cloud SQL Auth Proxy, private IP
- E.2.4: Set up Redis (Memorystore) for caching and sessions (evidence: docs/operations/gcp-infrastructure.md §Memorystore)
- Size: 1GB basic tier (production), shared (dev)
- Use: Session cache, rate limiting, DocumentViewToken cache (ADR-196)
- Failover: Automatic failover in production tier
- E.2.5: Configure networking and security groups (evidence: docs/operations/gcp-infrastructure.md §Networking)
- VPC: Shared VPC with private subnets
- Firewall: Allow only HTTPS ingress, restrict database to API subnet
- NAT: Cloud NAT for outbound traffic
- WAF: Cloud Armor with OWASP rules
E.3: Monitoring & Observability
Sprint: S6 | Priority: P1 | Depends On: E.2 Goal: Full observability stack for production operations
- E.3.1: Configure Cloud Monitoring dashboards (evidence: docs/operations/monitoring-observability.md §E.3.1)
- Dashboards: API latency/error rate, database performance, cache hit rate
- Custom metrics: QMS-specific (documents signed/hour, CAPA resolution time)
- SLOs: 99.9% API availability, < 200ms p95 latency
- E.3.2: Set up alerting policies (evidence: docs/operations/monitoring-observability.md §E.3.2)
- Critical: API down, database unreachable, certificate expiry < 7 days
- Warning: Error rate > 1%, latency p95 > 500ms, disk > 80%
- Channels: PagerDuty (critical), Slack (warning), email (informational)
- E.3.3: Implement structured logging with Cloud Logging (evidence: docs/operations/monitoring-observability.md §E.3.3)
- Format: JSON structured logs with correlation IDs
- Fields: timestamp, level, service, request_id, user_id, org_id, action
- Retention: 30 days hot, 1 year cold (BigQuery export)
- E.3.4: Set up distributed tracing (evidence: docs/operations/monitoring-observability.md §E.3.4)
- Tool: OpenTelemetry with Cloud Trace
- Coverage: API requests, database queries, external API calls
- Sampling: 100% for errors, 10% for normal traffic
E.4: Backup & Disaster Recovery
Sprint: S6 | Priority: P1 | Depends On: E.2 Goal: Validated backup and recovery procedures meeting compliance requirements
- E.4.1: Configure automated database backups (evidence: docs/operations/dr-validation-automation.md §Backup)
- Frequency: Every 4 hours (production), daily (staging)
- Retention: 30 days point-in-time recovery, 90 days daily snapshots
- Cross-region: Backup replica in different GCP region
- E.4.2: Create disaster recovery runbook (evidence: docs/operations/dr-validation-automation.md §DR-Runbook)
- RPO: 4 hours (maximum data loss)
- RTO: 2 hours (maximum downtime)
- Procedures: Database failover, service redeployment, DNS switching
- Documentation: Step-by-step recovery for each failure scenario
- E.4.3: Conduct DR testing and validation (evidence: docs/operations/dr-validation-automation.md §DR-Testing)
- Frequency: Quarterly DR drills
- Scenarios: Database failure, region outage, complete rebuild
- Evidence: DR test results for compliance auditors
- Report: DR test report with actual RPO/RTO measurements
E.5: DR Validation & Automation
Sprint: S6-S7 | Priority: P1 | Depends On: E.4 Goal: Automated DR validation with backup integrity verification and per-failure-mode runbooks
- E.5.1: Implement automated backup integrity verification (evidence: docs/operations/dr-validation-automation.md §Backup-Integrity)
- Daily check: automated restore test of latest backup to staging
- Validation: row count verification, checksum comparison, data integrity
- Alerting: immediate alert on backup integrity failure
- Reporting: weekly backup health report for compliance
- E.5.2: Build automated monthly failover testing (evidence: docs/operations/dr-validation-automation.md §Failover-Testing)
- Database failover: automated Cloud SQL replica promotion test
- Service failover: Cloud Run traffic switching to secondary region
- DNS failover: automated DNS switching to backup endpoints
- Metrics: measure actual RPO/RTO vs. target in each test
- E.5.3: Create per-failure-mode runbooks (evidence: docs/operations/dr-validation-automation.md §Per-Failure-Runbooks)
- Catalog: runbook for each identified failure mode (database, network, storage, compute, third-party)
- Automation: scripted recovery steps with decision trees
- Validation: runbook testing as part of quarterly DR drills
- Versioning: runbook version control with change approval
- E.5.4: Implement chaos engineering for DR validation (evidence: docs/operations/dr-validation-automation.md §Chaos-Engineering)
- Fault injection: controlled failure injection in staging environment
- Scenarios: network partition, disk full, high latency, service crash
- Compliance safety: chaos experiments NEVER touch audit trail or signature chains
- Evidence: chaos test results for compliance auditors
Updated: 2026-02-14 Compliance: CODITECT Track Nomenclature Standard (ADR-054)