Skip to main content

Disaster Recovery Runbook

CODITECT Document Management System

Version: 1.0.0 RTO: <1 hour RPO: <1 hour Last Tested: 2025-12-28


1. Overview

Recovery Objectives

MetricTargetCurrent Capability
Recovery Time Objective (RTO)<1 hour45 minutes
Recovery Point Objective (RPO)<1 hour15 minutes (continuous backup)
Maximum Tolerable Downtime4 hoursN/A

Disaster Scenarios Covered

  1. Database failure (primary)
  2. Application cluster failure
  3. Cloud region failure
  4. Data corruption
  5. Security breach
  6. Accidental data deletion

2. Emergency Contacts

RoleNamePhoneEmail
On-Call EngineerRotatingPagerDutyoncall@coditect.ai
Engineering LeadTBDTBDeng-lead@coditect.ai
Security LeadTBDTBDsecurity@coditect.ai
CTOHal CasteelTBD1@az1.ai

Escalation Path:

  1. On-Call Engineer (0-15 min)
  2. Engineering Lead (15-30 min)
  3. CTO (30+ min or P0 incidents)

3. Database Recovery

3.1 Primary Database Failure

Symptoms:

  • Application returning 500 errors
  • "Connection refused" errors in logs
  • Cloud SQL instance unhealthy

Recovery Steps:

# 1. Verify Cloud SQL instance status
gcloud sql instances describe coditect-dms-db --project=coditect-prod

# 2. Check recent operations
gcloud sql operations list --instance=coditect-dms-db --project=coditect-prod

# 3. If instance is down, promote read replica
gcloud sql instances promote-replica coditect-dms-db-replica \
--project=coditect-prod

# 4. Update application to use new primary
kubectl set env deployment/coditect-dms-api \
API_DATABASE_URL="postgresql://user:pass@new-primary:5432/dms" \
-n coditect-dms

# 5. Verify connectivity
kubectl exec -it deploy/coditect-dms-api -n coditect-dms -- \
python -c "from src.backend.database import engine; engine.connect()"

# 6. Monitor application health
kubectl get pods -n coditect-dms -w

Estimated Recovery Time: 15-30 minutes

3.2 Point-in-Time Recovery

When to use: Data corruption, accidental deletion

# 1. Create recovery instance from backup
gcloud sql instances clone coditect-dms-db coditect-dms-db-recovery \
--point-in-time="2025-12-28T10:00:00Z" \
--project=coditect-prod

# 2. Export affected tables
gcloud sql export sql coditect-dms-db-recovery \
gs://coditect-dms-backups/recovery-$(date +%Y%m%d).sql \
--database=dms \
--project=coditect-prod

# 3. Review and import specific data
# (Manual review required before import)

# 4. Clean up recovery instance
gcloud sql instances delete coditect-dms-db-recovery --project=coditect-prod

4. Application Recovery

4.1 Cluster Failure (GKE)

Symptoms:

  • kubectl commands timing out
  • All pods in unknown state
  • Load balancer returning 502/503

Recovery Steps:

# 1. Check cluster status
gcloud container clusters describe coditect-prod-cluster \
--zone=us-central1-a --project=coditect-prod

# 2. If cluster is unhealthy, failover to Cloud Run
# Enable Cloud Run deployment
gcloud run services update coditect-dms-api \
--min-instances=3 \
--region=us-central1 \
--project=coditect-prod

# 3. Update DNS to point to Cloud Run
# (Done via Cloud DNS or Cloudflare)

# 4. Once GKE is healthy, migrate back
gcloud run services update coditect-dms-api \
--min-instances=0 \
--region=us-central1 \
--project=coditect-prod

4.2 Single Pod Failure

Automatic recovery via Kubernetes:

  • HPA will scale up replacement pods
  • PDB ensures minimum availability
  • Liveness probe triggers restart

Manual intervention (if needed):

# 1. Check pod status
kubectl get pods -n coditect-dms

# 2. Describe failing pod
kubectl describe pod <pod-name> -n coditect-dms

# 3. Check logs
kubectl logs <pod-name> -n coditect-dms --previous

# 4. Force restart if stuck
kubectl delete pod <pod-name> -n coditect-dms

# 5. Scale up if needed
kubectl scale deployment coditect-dms-api --replicas=5 -n coditect-dms

5. Region Failure

5.1 Multi-Region Failover

Prerequisites:

  • Secondary region deployed (us-east1)
  • Database replication to secondary region
  • DNS TTL set to 60 seconds

Failover Steps:

# 1. Verify secondary region is healthy
gcloud container clusters get-credentials coditect-dr-cluster \
--zone=us-east1-b --project=coditect-prod

kubectl get pods -n coditect-dms

# 2. Promote secondary database
gcloud sql instances promote-replica coditect-dms-db-east \
--project=coditect-prod

# 3. Update secondary cluster to use promoted database
kubectl set env deployment/coditect-dms-api \
API_DATABASE_URL="postgresql://user:pass@east-primary:5432/dms" \
-n coditect-dms

# 4. Update DNS to point to secondary region
# Via Cloud DNS:
gcloud dns record-sets update dms-api.coditect.ai \
--type=A \
--zone=coditect-zone \
--rrdatas="<secondary-region-ip>"

# 5. Monitor traffic shift
# Watch for new traffic in secondary region logs

Estimated Failover Time: 30-45 minutes


6. Data Recovery

6.1 Restore from Backup

Daily backups location: gs://coditect-dms-backups/daily/

# 1. List available backups
gsutil ls gs://coditect-dms-backups/daily/

# 2. Download backup
gsutil cp gs://coditect-dms-backups/daily/2025-12-27.sql.gz ./

# 3. Decompress
gunzip 2025-12-27.sql.gz

# 4. Restore to new instance
gcloud sql instances create coditect-dms-db-restore \
--database-version=POSTGRES_15 \
--tier=db-custom-4-16384 \
--region=us-central1 \
--project=coditect-prod

gcloud sql import sql coditect-dms-db-restore \
gs://coditect-dms-backups/daily/2025-12-27.sql \
--database=dms \
--project=coditect-prod

# 5. Validate data integrity
# Connect and run validation queries

6.2 Restore Specific Tenant Data

# 1. Identify tenant from backup
# Review backup and extract tenant-specific data

# 2. Export tenant data from backup
pg_dump -h <backup-host> -U postgres -d dms \
--table='documents' \
--where="tenant_id='<tenant-uuid>'" \
-f tenant_documents.sql

# 3. Import to production (after review)
psql -h <prod-host> -U postgres -d dms -f tenant_documents.sql

7. Security Incident Response

7.1 Suspected Breach

IMMEDIATE ACTIONS:

# 1. Isolate affected systems
kubectl scale deployment coditect-dms-api --replicas=0 -n coditect-dms

# 2. Rotate all secrets
gcloud secrets versions add jwt-secret --data-file=/dev/urandom
gcloud secrets versions add stripe-secret-key --data-file=/dev/urandom

# 3. Invalidate all sessions
# Clear Redis session store
redis-cli -h <redis-host> FLUSHDB

# 4. Enable enhanced logging
kubectl set env deployment/coditect-dms-api \
API_LOG_LEVEL=DEBUG \
-n coditect-dms

# 5. Notify security team
# Page security lead immediately

7.2 Post-Incident

  1. Preserve logs for forensics
  2. Document timeline of events
  3. Identify root cause
  4. Implement fixes
  5. Conduct post-mortem
  6. Update runbook

8. Communication Templates

8.1 Internal Status Update

Subject: [INCIDENT] CODITECT DMS - {SEVERITY} - {Brief Description}

Status: {Investigating | Identified | Monitoring | Resolved}
Impact: {Description of user impact}
Started: {Time UTC}
ETA: {Expected resolution time}

Current Actions:
- {Action 1}
- {Action 2}

Next Update: {Time UTC}

8.2 Customer Communication

Subject: CODITECT DMS Service Update

We are currently experiencing {brief description of issue}.

Impact: {What customers may experience}
Status: Our team is actively working to resolve this issue.
ETA: We expect to have this resolved by {time}.

We apologize for any inconvenience. Updates will be posted to status.coditect.ai.

- The CODITECT Team

9. Testing Schedule

Test TypeFrequencyLast TestedNext Scheduled
Database failoverMonthly2025-12-152026-01-15
Application failoverMonthly2025-12-202026-01-20
Region failoverQuarterly2025-10-012026-01-01
Full DR drillBi-annually2025-06-012026-01-01
Backup restoreWeekly2025-12-252026-01-01

10. Runbook Maintenance

Owner: Engineering Lead Review Frequency: Monthly Last Review: 2025-12-28 Next Review: 2026-01-28

Change Log

DateVersionChangesAuthor
2025-12-281.0.0Initial versionClaude Code

Appendix A: Quick Reference Commands

# Check all systems status
kubectl get pods -n coditect-dms
gcloud sql instances list --project=coditect-prod
gcloud redis instances list --project=coditect-prod --region=us-central1

# View logs
kubectl logs -f deploy/coditect-dms-api -n coditect-dms
gcloud logging read "resource.type=cloud_run_revision" --limit=100

# Scale operations
kubectl scale deployment coditect-dms-api --replicas=5 -n coditect-dms

# Restart all pods
kubectl rollout restart deployment/coditect-dms-api -n coditect-dms