FP&A Platform — Disaster Recovery Plan

Version: 1.0
Last Updated: 2026-02-03
Document ID: OPS-001
Classification: Confidential
Review Frequency: Quarterly

1. Executive Summary

This Disaster Recovery Plan (DRP) defines procedures for recovering the FP&A Platform from various failure scenarios. The plan ensures business continuity for financial operations while meeting regulatory requirements for data protection and availability.

Recovery Objectives

Metric	Target	Rationale
RTO (Recovery Time Objective)	4 hours	Month-end close cannot be delayed >4hrs
RPO (Recovery Point Objective)	15 minutes	Maximum acceptable data loss
MTTR (Mean Time to Recovery)	2 hours	Target for common failures
Availability	99.9%	8.76 hours downtime/year maximum

Service Tiers

Tier	Services	RTO	RPO
Tier 1 - Critical	GL, Auth, API Gateway	1 hour	5 min
Tier 2 - Essential	Reconciliation, Reporting	2 hours	15 min
Tier 3 - Important	AI Agents, Forecasting	4 hours	1 hour
Tier 4 - Deferrable	Analytics, Training	24 hours	24 hours

2. Infrastructure Overview

Production Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        GCP PRODUCTION                                │
├─────────────────────────────────────────────────────────────────────┤
│  Region: us-central1 (Primary) | us-east1 (DR)                      │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    GKE CLUSTER (Autopilot)                   │   │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐        │   │
│  │  │API GW   │  │GL Svc   │  │Recon Svc│  │AI Agents│        │   │
│  │  │(3 pods) │  │(3 pods) │  │(2 pods) │  │(2 pods) │        │   │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    DATA LAYER                                 │   │
│  │  ┌────────────────────┐  ┌──────────────┐  ┌────────────┐   │   │
│  │  │ Cloud SQL (HA)     │  │ Redis        │  │ Cloud      │   │   │
│  │  │ Primary + Replica  │  │ (HA)         │  │ Storage    │   │   │
│  │  │ us-central1-a/b    │  │ 4GB          │  │ (Multi-reg)│   │   │
│  │  └────────────────────┘  └──────────────┘  └────────────┘   │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Backup Infrastructure

Component	Backup Method	Frequency	Retention	Location
Cloud SQL	Automated + On-demand	Continuous WAL + Daily snapshot	30 days	Cross-region (us-east1)
Redis	RDB Snapshots	Hourly	7 days	Same region
Cloud Storage	Versioning + Replication	Real-time	365 days	Dual-region
Secrets	Secret Manager versioning	On change	90 days	Global
Configs	Git + Terraform state	On change	Unlimited	GitHub + GCS
Audit Logs	immudb replication	Real-time	7 years	Cross-region

3. Failure Scenarios and Recovery Procedures

3.1 Database Failures

Scenario A: Primary Database Failure

Trigger: Cloud SQL primary instance becomes unavailable

Detection:

Cloud SQL health check fails
Application connection errors spike
PagerDuty alert: cloudsql-primary-down

Recovery Procedure:

# Automatic failover (typically 60-120 seconds)
# Cloud SQL HA handles this automatically

# If automatic failover fails, manual steps:

# 1. Check instance status
gcloud sql instances describe fpa-platform-db --format='get(state)'

# 2. If SUSPENDED, attempt restart
gcloud sql instances restart fpa-platform-db

# 3. If restart fails, promote replica
gcloud sql instances promote-replica fpa-platform-db-replica

# 4. Update connection strings (via Secret Manager)
gcloud secrets versions add db-connection-string \
  --data-file=./new-connection-string.txt

# 5. Rolling restart of application pods
kubectl rollout restart deployment -n fpa-platform

# 6. Verify connectivity
kubectl exec -it $(kubectl get pod -l app=gl-service -o name | head -1) \
  -- psql $DATABASE_URL -c "SELECT 1"

Recovery Time: 2-5 minutes (auto), 15-30 minutes (manual)

Post-Recovery:

Verify all services healthy
Check for data loss (compare max timestamps)
Recreate replica if promoted
Update incident log

Scenario B: Database Corruption

Trigger: Data integrity issues, failed migrations, application bugs

Detection:

Data validation errors
Reconciliation failures
User-reported incorrect data

Recovery Procedure:

# 1. Assess scope of corruption
# Run data integrity checks
kubectl exec -it $(kubectl get pod -l app=gl-service -o name | head -1) -- \
  python -m fpa.scripts.data_integrity_check

# 2. Identify corruption time window
# Check audit logs in immudb
immudb-cli scan --prefix "table:journal_entries" --since "2026-02-01T00:00:00Z"

# 3. For limited corruption: Point-in-time recovery to new instance
gcloud sql instances clone fpa-platform-db fpa-platform-db-recovery \
  --point-in-time "2026-02-03T10:00:00Z"

# 4. Export clean data from recovery instance
pg_dump -h recovery-instance-ip -d fpa_platform -t affected_tables > clean_data.sql

# 5. Apply clean data to production (with downtime window)
# Announce maintenance window
# Stop affected services
kubectl scale deployment gl-service --replicas=0

# Import clean data
psql -h production-ip -d fpa_platform < clean_data.sql

# Restart services
kubectl scale deployment gl-service --replicas=3

# 6. Verify data integrity
python -m fpa.scripts.data_integrity_check --full

# 7. Delete recovery instance
gcloud sql instances delete fpa-platform-db-recovery

Recovery Time: 1-4 hours depending on scope

3.2 Application Failures

Scenario C: Service Crash/Memory Leak

Trigger: OOMKilled, unhandled exceptions, deadlocks

Detection:

Pod restart count increases
Error rate spikes in Prometheus
PagerDuty alert: high-pod-restarts

Recovery Procedure:

# 1. Check pod status
kubectl get pods -n fpa-platform -l app=affected-service

# 2. Check recent events
kubectl describe pod affected-pod-name

# 3. Check logs
kubectl logs affected-pod-name --previous

# 4. If widespread, rollback deployment
kubectl rollout undo deployment/affected-service

# 5. If memory issue, increase limits temporarily
kubectl patch deployment affected-service -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

# 6. Monitor recovery
kubectl rollout status deployment/affected-service

Recovery Time: 5-15 minutes

Scenario D: Complete Application Failure

Trigger: All services down, GKE cluster failure

Detection:

All health checks failing
No pods running
PagerDuty alert: platform-complete-outage

Recovery Procedure:

# 1. Check GKE cluster status
gcloud container clusters describe fpa-cluster --zone us-central1

# 2. If cluster unresponsive, check control plane
gcloud container operations list --filter="TARGET_LINK:fpa-cluster"

# 3. For control plane issues (rare with Autopilot)
# Contact GCP support immediately
# GCP Support: 1-866-777-0375

# 4. For node pool issues
gcloud container node-pools update default-pool \
  --cluster=fpa-cluster \
  --zone=us-central1 \
  --enable-autorepair

# 5. If cluster unrecoverable, recreate from Terraform
cd terraform/gke
terraform apply -var="cluster_suffix=recovery"

# 6. Restore workloads
kubectl apply -k kubernetes/overlays/production

# 7. Point DNS to new cluster
gcloud dns record-sets update api.fpa-platform.com \
  --rrdatas="NEW_INGRESS_IP" \
  --type=A \
  --ttl=60 \
  --zone=fpa-platform-zone

Recovery Time: 30-90 minutes

3.3 Infrastructure Failures

Scenario E: Zone Outage

Trigger: GCP zone (us-central1-a) becomes unavailable

Detection:

GCP status page shows zone issue
Pods in failed zone not responding
Automatic zone failover may handle

Recovery Procedure:

# Autopilot should automatically reschedule pods to healthy zones

# 1. Verify pods rescheduling
kubectl get pods -o wide -n fpa-platform

# 2. If Cloud SQL affected, check HA status
gcloud sql instances describe fpa-platform-db

# 3. Monitor recovery
watch kubectl get pods -n fpa-platform

# 4. Once zone recovers, rebalance
kubectl rollout restart deployment -n fpa-platform

Recovery Time: 5-15 minutes (automatic)

Scenario F: Region Outage

Trigger: Entire us-central1 region unavailable (rare)

Detection:

All services unreachable
GCP status shows region outage
PagerDuty escalation triggered

Recovery Procedure:

# CRITICAL: Full DR region activation

# 1. Confirm region outage (not network issue)
# Check: https://status.cloud.google.com/

# 2. Activate DR region (us-east1)
cd terraform/dr
terraform workspace select dr
terraform apply -var="activate_dr=true"

# 3. Promote DR database (if not using global)
gcloud sql instances promote-replica fpa-platform-db-dr-replica

# 4. Deploy workloads to DR cluster
kubectl config use-context gke_project_us-east1_fpa-cluster-dr
kubectl apply -k kubernetes/overlays/dr

# 5. Update DNS to DR region
gcloud dns record-sets update api.fpa-platform.com \
  --rrdatas="DR_REGION_IP" \
  --type=A \
  --ttl=60 \
  --zone=fpa-platform-zone

# 6. Notify customers
python scripts/send_incident_notification.py --severity=critical

# 7. Monitor DR environment
kubectl get pods -n fpa-platform
curl -f https://api.fpa-platform.com/health

Recovery Time: 30-60 minutes

Data Loss: Up to 15 minutes (cross-region replication lag)

3.4 Security Incidents

Scenario G: Credential Compromise

Trigger: Suspected or confirmed credential leak

Detection:

Unusual API activity patterns
Alerts from Secret Manager audit logs
External notification (bug bounty, etc.)

Recovery Procedure:

# IMMEDIATE ACTIONS (within 15 minutes)

# 1. Revoke compromised credentials
# Database credentials
gcloud sql users set-password app-user --instance=fpa-platform-db \
  --password=$(openssl rand -base64 32)

# Service account keys
gcloud iam service-accounts keys delete KEY_ID \
  --iam-account=fpa-app@project.iam.gserviceaccount.com

# API keys
gcloud alpha services api-keys update KEY_ID --clear-restrictions

# 2. Update secrets
gcloud secrets versions add db-password --data-file=./new-password.txt

# 3. Force credential refresh
kubectl rollout restart deployment -n fpa-platform

# 4. Enable additional logging
gcloud logging sinks create security-incident-sink \
  storage.googleapis.com/security-logs \
  --log-filter='protoPayload.authenticationInfo.principalEmail="compromised-account"'

# 5. Initiate forensic analysis
# Preserve logs
gcloud logging read 'timestamp>="2026-02-01T00:00:00Z"' \
  --format=json > incident-logs.json

# 6. Notify security team and legal
python scripts/send_security_alert.py --severity=critical

Recovery Time: 15-30 minutes for immediate containment

Scenario H: Ransomware/Data Breach

Trigger: Encryption of data, unauthorized data access

Detection:

Unable to read/write data
Ransom demand received
Unusual data exfiltration patterns

Recovery Procedure:

# CRITICAL: DO NOT PAY RANSOM

# 1. IMMEDIATE ISOLATION
# Disable external access
gcloud compute firewall-rules update allow-external --disabled

# Block compromised service accounts
gcloud iam service-accounts disable fpa-app@project.iam.gserviceaccount.com

# 2. PRESERVE EVIDENCE
# Snapshot all disks
gcloud compute disks snapshot $(gcloud compute disks list --format='value(name)')

# Export all logs
gcloud logging read 'timestamp>="INCIDENT_START_TIME"' --format=json > forensic-logs.json

# 3. NOTIFY
# Legal and compliance
# Law enforcement (FBI IC3 if US)
# Customers (per breach notification requirements)

# 4. RECOVERY FROM CLEAN BACKUPS
# Identify last known good backup
gcloud sql backups list --instance=fpa-platform-db

# Create new environment
cd terraform/clean-recovery
terraform apply

# Restore from backup
gcloud sql instances restore-backup fpa-platform-db-clean \
  --backup-id=LAST_GOOD_BACKUP_ID

# 5. VALIDATE CLEAN STATE
# Security scan of restored environment
# Data integrity verification
# Penetration test before going live

Recovery Time: 4-24 hours

4. Communication Plan

Escalation Matrix

Severity	Response Time	Notified	Approval Needed
P1 - Critical	15 min	On-call + Engineering Lead + CTO	None for containment
P2 - Major	30 min	On-call + Engineering Lead	None
P3 - Minor	2 hours	On-call	None
P4 - Low	24 hours	Ticket queue	None

Notification Templates

Internal Slack Alert:

🚨 INCIDENT: [SEVERITY] - [Brief Description]
Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Affected services and users]
ETA: [Estimated resolution time]
Lead: [@oncall-engineer]
Bridge: [Zoom link]

Customer Status Page:

[TIMESTAMP] - Investigating Issue
We are currently investigating reports of [issue description].
Affected: [Services]
We will provide updates every 30 minutes.

[TIMESTAMP] - Issue Identified
We have identified the root cause as [brief explanation].
Our team is working on a fix.
ETA: [Time estimate]

[TIMESTAMP] - Issue Resolved
The issue has been resolved. All services are operating normally.
Root Cause: [Brief explanation]
Duration: [X hours Y minutes]

Communication Channels

Channel	Purpose	Owner
Status Page	Customer-facing updates	On-call
Slack #incidents	Internal coordination	Engineering
PagerDuty	Alerting and escalation	Automated
Email	Customer notification (major)	Customer Success
Phone	Critical customer notification	Account Manager

5. Testing and Maintenance

DR Test Schedule

Test Type	Frequency	Duration	Participants
Backup Restore	Monthly	2 hours	DBA
Failover Test	Quarterly	4 hours	Engineering + Ops
Full DR Exercise	Annually	8 hours	All teams
Tabletop Exercise	Bi-annually	2 hours	Leadership

DR Test Checklist

## Pre-Test
- [ ] Announce maintenance window
- [ ] Verify backup freshness
- [ ] Confirm all participants available
- [ ] Prepare rollback procedures

## Test Execution
- [ ] Simulate failure scenario
- [ ] Execute recovery procedure
- [ ] Verify service restoration
- [ ] Validate data integrity
- [ ] Measure RTO/RPO achieved

## Post-Test
- [ ] Document lessons learned
- [ ] Update procedures if needed
- [ ] File test report
- [ ] Schedule next test

Last Test Results

Test Date	Scenario	RTO Achieved	RPO Achieved	Issues Found
2026-01-15	DB Failover	3 min	0	None
2025-12-01	Zone Outage	8 min	0	DNS TTL too high
2025-09-15	Full DR	45 min	12 min	Slow secret rotation

6. Contacts and Resources

Emergency Contacts

Role	Name	Phone	Email
On-Call Primary	PagerDuty	-	oncall@fpa-platform.com
Engineering Lead	TBD	TBD	eng-lead@fpa-platform.com
CTO	TBD	TBD	cto@fpa-platform.com
GCP Support	-	1-866-777-0375	-

Runbook Links

Runbook	Location
Database Recovery	`/runbooks/database-recovery.md`
Service Restoration	`/runbooks/service-restoration.md`
Security Incident	`/runbooks/security-incident.md`
Customer Communication	`/runbooks/customer-comms.md`

External Resources

GCP Status: https://status.cloud.google.com/
GCP Support: https://console.cloud.google.com/support
PagerDuty: https://fpa-platform.pagerduty.com/

7. Document Control

Version	Date	Author	Changes
1.0	2026-02-03	Platform Team	Initial version

Next Review: 2026-05-03

Approval:

Engineering Lead
Security Lead
CTO

Disaster Recovery Plan v1.0 — FP&A Platform Document ID: OPS-001 Classification: Confidential

1. Executive Summary​

Recovery Objectives​

Service Tiers​

2. Infrastructure Overview​

Production Architecture​

Backup Infrastructure​

3. Failure Scenarios and Recovery Procedures​

3.1 Database Failures​

Scenario A: Primary Database Failure​

Scenario B: Database Corruption​

3.2 Application Failures​

Scenario C: Service Crash/Memory Leak​

Scenario D: Complete Application Failure​

3.3 Infrastructure Failures​

Scenario E: Zone Outage​

Scenario F: Region Outage​

3.4 Security Incidents​

Scenario G: Credential Compromise​

Scenario H: Ransomware/Data Breach​

4. Communication Plan​

Escalation Matrix​

Notification Templates​

Communication Channels​

5. Testing and Maintenance​

DR Test Schedule​

DR Test Checklist​

Last Test Results​

6. Contacts and Resources​

Emergency Contacts​

Runbook Links​

External Resources​

7. Document Control​

1. Executive Summary

Recovery Objectives

Service Tiers

2. Infrastructure Overview

Production Architecture

Backup Infrastructure

3. Failure Scenarios and Recovery Procedures

3.1 Database Failures

Scenario A: Primary Database Failure

Scenario B: Database Corruption

3.2 Application Failures

Scenario C: Service Crash/Memory Leak

Scenario D: Complete Application Failure

3.3 Infrastructure Failures

Scenario E: Zone Outage

Scenario F: Region Outage

3.4 Security Incidents

Scenario G: Credential Compromise

Scenario H: Ransomware/Data Breach

4. Communication Plan

Escalation Matrix

Notification Templates

Communication Channels

5. Testing and Maintenance

DR Test Schedule

DR Test Checklist

Last Test Results

6. Contacts and Resources

Emergency Contacts

Runbook Links

External Resources

7. Document Control