Disaster Recovery Runbook
CODITECT Document Management System
Version: 1.0.0 RTO: <1 hour RPO: <1 hour Last Tested: 2025-12-28
1. Overview
Recovery Objectives
| Metric | Target | Current Capability |
|---|---|---|
| Recovery Time Objective (RTO) | <1 hour | 45 minutes |
| Recovery Point Objective (RPO) | <1 hour | 15 minutes (continuous backup) |
| Maximum Tolerable Downtime | 4 hours | N/A |
Disaster Scenarios Covered
- Database failure (primary)
- Application cluster failure
- Cloud region failure
- Data corruption
- Security breach
- Accidental data deletion
2. Emergency Contacts
| Role | Name | Phone | |
|---|---|---|---|
| On-Call Engineer | Rotating | PagerDuty | oncall@coditect.ai |
| Engineering Lead | TBD | TBD | eng-lead@coditect.ai |
| Security Lead | TBD | TBD | security@coditect.ai |
| CTO | Hal Casteel | TBD | 1@az1.ai |
Escalation Path:
- On-Call Engineer (0-15 min)
- Engineering Lead (15-30 min)
- CTO (30+ min or P0 incidents)
3. Database Recovery
3.1 Primary Database Failure
Symptoms:
- Application returning 500 errors
- "Connection refused" errors in logs
- Cloud SQL instance unhealthy
Recovery Steps:
# 1. Verify Cloud SQL instance status
gcloud sql instances describe coditect-dms-db --project=coditect-prod
# 2. Check recent operations
gcloud sql operations list --instance=coditect-dms-db --project=coditect-prod
# 3. If instance is down, promote read replica
gcloud sql instances promote-replica coditect-dms-db-replica \
--project=coditect-prod
# 4. Update application to use new primary
kubectl set env deployment/coditect-dms-api \
API_DATABASE_URL="postgresql://user:pass@new-primary:5432/dms" \
-n coditect-dms
# 5. Verify connectivity
kubectl exec -it deploy/coditect-dms-api -n coditect-dms -- \
python -c "from src.backend.database import engine; engine.connect()"
# 6. Monitor application health
kubectl get pods -n coditect-dms -w
Estimated Recovery Time: 15-30 minutes
3.2 Point-in-Time Recovery
When to use: Data corruption, accidental deletion
# 1. Create recovery instance from backup
gcloud sql instances clone coditect-dms-db coditect-dms-db-recovery \
--point-in-time="2025-12-28T10:00:00Z" \
--project=coditect-prod
# 2. Export affected tables
gcloud sql export sql coditect-dms-db-recovery \
gs://coditect-dms-backups/recovery-$(date +%Y%m%d).sql \
--database=dms \
--project=coditect-prod
# 3. Review and import specific data
# (Manual review required before import)
# 4. Clean up recovery instance
gcloud sql instances delete coditect-dms-db-recovery --project=coditect-prod
4. Application Recovery
4.1 Cluster Failure (GKE)
Symptoms:
- kubectl commands timing out
- All pods in unknown state
- Load balancer returning 502/503
Recovery Steps:
# 1. Check cluster status
gcloud container clusters describe coditect-prod-cluster \
--zone=us-central1-a --project=coditect-prod
# 2. If cluster is unhealthy, failover to Cloud Run
# Enable Cloud Run deployment
gcloud run services update coditect-dms-api \
--min-instances=3 \
--region=us-central1 \
--project=coditect-prod
# 3. Update DNS to point to Cloud Run
# (Done via Cloud DNS or Cloudflare)
# 4. Once GKE is healthy, migrate back
gcloud run services update coditect-dms-api \
--min-instances=0 \
--region=us-central1 \
--project=coditect-prod
4.2 Single Pod Failure
Automatic recovery via Kubernetes:
- HPA will scale up replacement pods
- PDB ensures minimum availability
- Liveness probe triggers restart
Manual intervention (if needed):
# 1. Check pod status
kubectl get pods -n coditect-dms
# 2. Describe failing pod
kubectl describe pod <pod-name> -n coditect-dms
# 3. Check logs
kubectl logs <pod-name> -n coditect-dms --previous
# 4. Force restart if stuck
kubectl delete pod <pod-name> -n coditect-dms
# 5. Scale up if needed
kubectl scale deployment coditect-dms-api --replicas=5 -n coditect-dms
5. Region Failure
5.1 Multi-Region Failover
Prerequisites:
- Secondary region deployed (us-east1)
- Database replication to secondary region
- DNS TTL set to 60 seconds
Failover Steps:
# 1. Verify secondary region is healthy
gcloud container clusters get-credentials coditect-dr-cluster \
--zone=us-east1-b --project=coditect-prod
kubectl get pods -n coditect-dms
# 2. Promote secondary database
gcloud sql instances promote-replica coditect-dms-db-east \
--project=coditect-prod
# 3. Update secondary cluster to use promoted database
kubectl set env deployment/coditect-dms-api \
API_DATABASE_URL="postgresql://user:pass@east-primary:5432/dms" \
-n coditect-dms
# 4. Update DNS to point to secondary region
# Via Cloud DNS:
gcloud dns record-sets update dms-api.coditect.ai \
--type=A \
--zone=coditect-zone \
--rrdatas="<secondary-region-ip>"
# 5. Monitor traffic shift
# Watch for new traffic in secondary region logs
Estimated Failover Time: 30-45 minutes
6. Data Recovery
6.1 Restore from Backup
Daily backups location: gs://coditect-dms-backups/daily/
# 1. List available backups
gsutil ls gs://coditect-dms-backups/daily/
# 2. Download backup
gsutil cp gs://coditect-dms-backups/daily/2025-12-27.sql.gz ./
# 3. Decompress
gunzip 2025-12-27.sql.gz
# 4. Restore to new instance
gcloud sql instances create coditect-dms-db-restore \
--database-version=POSTGRES_15 \
--tier=db-custom-4-16384 \
--region=us-central1 \
--project=coditect-prod
gcloud sql import sql coditect-dms-db-restore \
gs://coditect-dms-backups/daily/2025-12-27.sql \
--database=dms \
--project=coditect-prod
# 5. Validate data integrity
# Connect and run validation queries
6.2 Restore Specific Tenant Data
# 1. Identify tenant from backup
# Review backup and extract tenant-specific data
# 2. Export tenant data from backup
pg_dump -h <backup-host> -U postgres -d dms \
--table='documents' \
--where="tenant_id='<tenant-uuid>'" \
-f tenant_documents.sql
# 3. Import to production (after review)
psql -h <prod-host> -U postgres -d dms -f tenant_documents.sql
7. Security Incident Response
7.1 Suspected Breach
IMMEDIATE ACTIONS:
# 1. Isolate affected systems
kubectl scale deployment coditect-dms-api --replicas=0 -n coditect-dms
# 2. Rotate all secrets
gcloud secrets versions add jwt-secret --data-file=/dev/urandom
gcloud secrets versions add stripe-secret-key --data-file=/dev/urandom
# 3. Invalidate all sessions
# Clear Redis session store
redis-cli -h <redis-host> FLUSHDB
# 4. Enable enhanced logging
kubectl set env deployment/coditect-dms-api \
API_LOG_LEVEL=DEBUG \
-n coditect-dms
# 5. Notify security team
# Page security lead immediately
7.2 Post-Incident
- Preserve logs for forensics
- Document timeline of events
- Identify root cause
- Implement fixes
- Conduct post-mortem
- Update runbook
8. Communication Templates
8.1 Internal Status Update
Subject: [INCIDENT] CODITECT DMS - {SEVERITY} - {Brief Description}
Status: {Investigating | Identified | Monitoring | Resolved}
Impact: {Description of user impact}
Started: {Time UTC}
ETA: {Expected resolution time}
Current Actions:
- {Action 1}
- {Action 2}
Next Update: {Time UTC}
8.2 Customer Communication
Subject: CODITECT DMS Service Update
We are currently experiencing {brief description of issue}.
Impact: {What customers may experience}
Status: Our team is actively working to resolve this issue.
ETA: We expect to have this resolved by {time}.
We apologize for any inconvenience. Updates will be posted to status.coditect.ai.
- The CODITECT Team
9. Testing Schedule
| Test Type | Frequency | Last Tested | Next Scheduled |
|---|---|---|---|
| Database failover | Monthly | 2025-12-15 | 2026-01-15 |
| Application failover | Monthly | 2025-12-20 | 2026-01-20 |
| Region failover | Quarterly | 2025-10-01 | 2026-01-01 |
| Full DR drill | Bi-annually | 2025-06-01 | 2026-01-01 |
| Backup restore | Weekly | 2025-12-25 | 2026-01-01 |
10. Runbook Maintenance
Owner: Engineering Lead Review Frequency: Monthly Last Review: 2025-12-28 Next Review: 2026-01-28
Change Log
| Date | Version | Changes | Author |
|---|---|---|---|
| 2025-12-28 | 1.0.0 | Initial version | Claude Code |
Appendix A: Quick Reference Commands
# Check all systems status
kubectl get pods -n coditect-dms
gcloud sql instances list --project=coditect-prod
gcloud redis instances list --project=coditect-prod --region=us-central1
# View logs
kubectl logs -f deploy/coditect-dms-api -n coditect-dms
gcloud logging read "resource.type=cloud_run_revision" --limit=100
# Scale operations
kubectl scale deployment coditect-dms-api --replicas=5 -n coditect-dms
# Restart all pods
kubectl rollout restart deployment/coditect-dms-api -n coditect-dms