Skip to main content

Track E: Operations & Deployment

Priority: LOW — Begins Sprint 5 Agent: devops-engineer, cloud-architect Sprint Range: S5-S6


Status Summary

Progress: 100% (20/20 tasks)

SectionTitleStatusTasks
E.1CI/CD PipelineComplete4/4
E.2Infrastructure (GKE/Cloud Run)Complete5/5
E.3Monitoring & ObservabilityComplete4/4
E.4Backup & Disaster RecoveryComplete3/3
E.5DR Validation & AutomationComplete4/4

E.1: CI/CD Pipeline

Sprint: S5 | Priority: P1 | Depends On: C.1 Goal: Automated build, test, and deploy pipeline

  • E.1.1: Set up GitHub Actions CI pipeline (evidence: docs/operations/cicd-pipeline.md §E.1.1)
    • Triggers: Push to main, PR opened/updated, manual dispatch
    • Steps: Lint → type check → unit tests → integration tests → build
    • Matrix: Node 20 LTS, PostgreSQL 15
  • E.1.2: Configure automated deployment pipeline (evidence: docs/operations/cicd-pipeline.md §E.1.2)
    • Staging: Auto-deploy on merge to main
    • Production: Manual approval gate, deploy on tag
    • Rollback: One-click rollback to previous version
  • E.1.3: Set up container image build and registry (evidence: docs/operations/cicd-pipeline.md §E.1.3)
    • Registry: Google Artifact Registry
    • Images: Backend API, frontend static, database migration runner
    • Tagging: Semantic versioning + commit SHA
  • E.1.4: Implement database migration CI step (evidence: docs/operations/cicd-pipeline.md §E.1.4)
    • Pre-deploy: Run Prisma migrations in CI before app deploy
    • Validation: Schema diff check, backward compatibility verification
    • Rollback: Migration rollback script for each migration

E.2: Infrastructure (GKE/Cloud Run)

Sprint: S5-S6 | Priority: P1 | Depends On: E.1 Goal: Production-ready GCP infrastructure with multi-environment setup

  • E.2.1: Configure Terraform for GCP infrastructure (evidence: docs/operations/gcp-infrastructure.md §Terraform)
    • Resources: Cloud Run (API), Cloud SQL (PostgreSQL), Memorystore (Redis), Cloud CDN
    • Environments: dev, staging, production with shared VPC
    • State: Terraform state in GCS bucket with locking
  • E.2.2: Set up Cloud Run services for API and frontend (evidence: docs/operations/gcp-infrastructure.md §Cloud-Run)
    • API: Cloud Run with min 2 / max 10 instances, 2GB RAM, 2 vCPU
    • Frontend: Cloud Run serving static files with CDN
    • Custom domain: bio-qms.coditect.ai with managed SSL
  • E.2.3: Configure Cloud SQL PostgreSQL instance (evidence: docs/operations/gcp-infrastructure.md §Cloud-SQL)
    • Tier: db-custom-2-4096 (production), db-f1-micro (dev)
    • HA: Multi-zone in production, automated backups every 4 hours
    • Connection: Cloud SQL Auth Proxy, private IP
  • E.2.4: Set up Redis (Memorystore) for caching and sessions (evidence: docs/operations/gcp-infrastructure.md §Memorystore)
    • Size: 1GB basic tier (production), shared (dev)
    • Use: Session cache, rate limiting, DocumentViewToken cache (ADR-196)
    • Failover: Automatic failover in production tier
  • E.2.5: Configure networking and security groups (evidence: docs/operations/gcp-infrastructure.md §Networking)
    • VPC: Shared VPC with private subnets
    • Firewall: Allow only HTTPS ingress, restrict database to API subnet
    • NAT: Cloud NAT for outbound traffic
    • WAF: Cloud Armor with OWASP rules

E.3: Monitoring & Observability

Sprint: S6 | Priority: P1 | Depends On: E.2 Goal: Full observability stack for production operations

  • E.3.1: Configure Cloud Monitoring dashboards (evidence: docs/operations/monitoring-observability.md §E.3.1)
    • Dashboards: API latency/error rate, database performance, cache hit rate
    • Custom metrics: QMS-specific (documents signed/hour, CAPA resolution time)
    • SLOs: 99.9% API availability, < 200ms p95 latency
  • E.3.2: Set up alerting policies (evidence: docs/operations/monitoring-observability.md §E.3.2)
    • Critical: API down, database unreachable, certificate expiry < 7 days
    • Warning: Error rate > 1%, latency p95 > 500ms, disk > 80%
    • Channels: PagerDuty (critical), Slack (warning), email (informational)
  • E.3.3: Implement structured logging with Cloud Logging (evidence: docs/operations/monitoring-observability.md §E.3.3)
    • Format: JSON structured logs with correlation IDs
    • Fields: timestamp, level, service, request_id, user_id, org_id, action
    • Retention: 30 days hot, 1 year cold (BigQuery export)
  • E.3.4: Set up distributed tracing (evidence: docs/operations/monitoring-observability.md §E.3.4)
    • Tool: OpenTelemetry with Cloud Trace
    • Coverage: API requests, database queries, external API calls
    • Sampling: 100% for errors, 10% for normal traffic

E.4: Backup & Disaster Recovery

Sprint: S6 | Priority: P1 | Depends On: E.2 Goal: Validated backup and recovery procedures meeting compliance requirements

  • E.4.1: Configure automated database backups (evidence: docs/operations/dr-validation-automation.md §Backup)
    • Frequency: Every 4 hours (production), daily (staging)
    • Retention: 30 days point-in-time recovery, 90 days daily snapshots
    • Cross-region: Backup replica in different GCP region
  • E.4.2: Create disaster recovery runbook (evidence: docs/operations/dr-validation-automation.md §DR-Runbook)
    • RPO: 4 hours (maximum data loss)
    • RTO: 2 hours (maximum downtime)
    • Procedures: Database failover, service redeployment, DNS switching
    • Documentation: Step-by-step recovery for each failure scenario
  • E.4.3: Conduct DR testing and validation (evidence: docs/operations/dr-validation-automation.md §DR-Testing)
    • Frequency: Quarterly DR drills
    • Scenarios: Database failure, region outage, complete rebuild
    • Evidence: DR test results for compliance auditors
    • Report: DR test report with actual RPO/RTO measurements

E.5: DR Validation & Automation

Sprint: S6-S7 | Priority: P1 | Depends On: E.4 Goal: Automated DR validation with backup integrity verification and per-failure-mode runbooks

  • E.5.1: Implement automated backup integrity verification (evidence: docs/operations/dr-validation-automation.md §Backup-Integrity)
    • Daily check: automated restore test of latest backup to staging
    • Validation: row count verification, checksum comparison, data integrity
    • Alerting: immediate alert on backup integrity failure
    • Reporting: weekly backup health report for compliance
  • E.5.2: Build automated monthly failover testing (evidence: docs/operations/dr-validation-automation.md §Failover-Testing)
    • Database failover: automated Cloud SQL replica promotion test
    • Service failover: Cloud Run traffic switching to secondary region
    • DNS failover: automated DNS switching to backup endpoints
    • Metrics: measure actual RPO/RTO vs. target in each test
  • E.5.3: Create per-failure-mode runbooks (evidence: docs/operations/dr-validation-automation.md §Per-Failure-Runbooks)
    • Catalog: runbook for each identified failure mode (database, network, storage, compute, third-party)
    • Automation: scripted recovery steps with decision trees
    • Validation: runbook testing as part of quarterly DR drills
    • Versioning: runbook version control with change approval
  • E.5.4: Implement chaos engineering for DR validation (evidence: docs/operations/dr-validation-automation.md §Chaos-Engineering)
    • Fault injection: controlled failure injection in staging environment
    • Scenarios: network partition, disk full, high latency, service crash
    • Compliance safety: chaos experiments NEVER touch audit trail or signature chains
    • Evidence: chaos test results for compliance auditors

Updated: 2026-02-14 Compliance: CODITECT Track Nomenclature Standard (ADR-054)