FP&A Platform — Disaster Recovery Plan
Version: 1.0
Last Updated: 2026-02-03
Document ID: OPS-001
Classification: Confidential
Review Frequency: Quarterly
1. Executive Summary
This Disaster Recovery Plan (DRP) defines procedures for recovering the FP&A Platform from various failure scenarios. The plan ensures business continuity for financial operations while meeting regulatory requirements for data protection and availability.
Recovery Objectives
| Metric | Target | Rationale |
|---|---|---|
| RTO (Recovery Time Objective) | 4 hours | Month-end close cannot be delayed >4hrs |
| RPO (Recovery Point Objective) | 15 minutes | Maximum acceptable data loss |
| MTTR (Mean Time to Recovery) | 2 hours | Target for common failures |
| Availability | 99.9% | 8.76 hours downtime/year maximum |
Service Tiers
| Tier | Services | RTO | RPO |
|---|---|---|---|
| Tier 1 - Critical | GL, Auth, API Gateway | 1 hour | 5 min |
| Tier 2 - Essential | Reconciliation, Reporting | 2 hours | 15 min |
| Tier 3 - Important | AI Agents, Forecasting | 4 hours | 1 hour |
| Tier 4 - Deferrable | Analytics, Training | 24 hours | 24 hours |
2. Infrastructure Overview
Production Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ GCP PRODUCTION │
├─────────────────────────────────────────────────────────────────────┤
│ Region: us-central1 (Primary) | us-east1 (DR) │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GKE CLUSTER (Autopilot) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │API GW │ │GL Svc │ │Recon Svc│ │AI Agents│ │ │
│ │ │(3 pods) │ │(3 pods) │ │(2 pods) │ │(2 pods) │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ DATA LAYER │ │
│ │ ┌────────────────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ Cloud SQL (HA) │ │ Redis │ │ Cloud │ │ │
│ │ │ Primary + Replica │ │ (HA) │ │ Storage │ │ │
│ │ │ us-central1-a/b │ │ 4GB │ │ (Multi-reg)│ │ │
│ │ └────────────────────┘ └──────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Backup Infrastructure
| Component | Backup Method | Frequency | Retention | Location |
|---|---|---|---|---|
| Cloud SQL | Automated + On-demand | Continuous WAL + Daily snapshot | 30 days | Cross-region (us-east1) |
| Redis | RDB Snapshots | Hourly | 7 days | Same region |
| Cloud Storage | Versioning + Replication | Real-time | 365 days | Dual-region |
| Secrets | Secret Manager versioning | On change | 90 days | Global |
| Configs | Git + Terraform state | On change | Unlimited | GitHub + GCS |
| Audit Logs | immudb replication | Real-time | 7 years | Cross-region |
3. Failure Scenarios and Recovery Procedures
3.1 Database Failures
Scenario A: Primary Database Failure
Trigger: Cloud SQL primary instance becomes unavailable
Detection:
- Cloud SQL health check fails
- Application connection errors spike
- PagerDuty alert:
cloudsql-primary-down
Recovery Procedure:
# Automatic failover (typically 60-120 seconds)
# Cloud SQL HA handles this automatically
# If automatic failover fails, manual steps:
# 1. Check instance status
gcloud sql instances describe fpa-platform-db --format='get(state)'
# 2. If SUSPENDED, attempt restart
gcloud sql instances restart fpa-platform-db
# 3. If restart fails, promote replica
gcloud sql instances promote-replica fpa-platform-db-replica
# 4. Update connection strings (via Secret Manager)
gcloud secrets versions add db-connection-string \
--data-file=./new-connection-string.txt
# 5. Rolling restart of application pods
kubectl rollout restart deployment -n fpa-platform
# 6. Verify connectivity
kubectl exec -it $(kubectl get pod -l app=gl-service -o name | head -1) \
-- psql $DATABASE_URL -c "SELECT 1"
Recovery Time: 2-5 minutes (auto), 15-30 minutes (manual)
Post-Recovery:
- Verify all services healthy
- Check for data loss (compare max timestamps)
- Recreate replica if promoted
- Update incident log
Scenario B: Database Corruption
Trigger: Data integrity issues, failed migrations, application bugs
Detection:
- Data validation errors
- Reconciliation failures
- User-reported incorrect data
Recovery Procedure:
# 1. Assess scope of corruption
# Run data integrity checks
kubectl exec -it $(kubectl get pod -l app=gl-service -o name | head -1) -- \
python -m fpa.scripts.data_integrity_check
# 2. Identify corruption time window
# Check audit logs in immudb
immudb-cli scan --prefix "table:journal_entries" --since "2026-02-01T00:00:00Z"
# 3. For limited corruption: Point-in-time recovery to new instance
gcloud sql instances clone fpa-platform-db fpa-platform-db-recovery \
--point-in-time "2026-02-03T10:00:00Z"
# 4. Export clean data from recovery instance
pg_dump -h recovery-instance-ip -d fpa_platform -t affected_tables > clean_data.sql
# 5. Apply clean data to production (with downtime window)
# Announce maintenance window
# Stop affected services
kubectl scale deployment gl-service --replicas=0
# Import clean data
psql -h production-ip -d fpa_platform < clean_data.sql
# Restart services
kubectl scale deployment gl-service --replicas=3
# 6. Verify data integrity
python -m fpa.scripts.data_integrity_check --full
# 7. Delete recovery instance
gcloud sql instances delete fpa-platform-db-recovery
Recovery Time: 1-4 hours depending on scope
3.2 Application Failures
Scenario C: Service Crash/Memory Leak
Trigger: OOMKilled, unhandled exceptions, deadlocks
Detection:
- Pod restart count increases
- Error rate spikes in Prometheus
- PagerDuty alert:
high-pod-restarts
Recovery Procedure:
# 1. Check pod status
kubectl get pods -n fpa-platform -l app=affected-service
# 2. Check recent events
kubectl describe pod affected-pod-name
# 3. Check logs
kubectl logs affected-pod-name --previous
# 4. If widespread, rollback deployment
kubectl rollout undo deployment/affected-service
# 5. If memory issue, increase limits temporarily
kubectl patch deployment affected-service -p \
'{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"2Gi"}}}]}}}}'
# 6. Monitor recovery
kubectl rollout status deployment/affected-service
Recovery Time: 5-15 minutes
Scenario D: Complete Application Failure
Trigger: All services down, GKE cluster failure
Detection:
- All health checks failing
- No pods running
- PagerDuty alert:
platform-complete-outage
Recovery Procedure:
# 1. Check GKE cluster status
gcloud container clusters describe fpa-cluster --zone us-central1
# 2. If cluster unresponsive, check control plane
gcloud container operations list --filter="TARGET_LINK:fpa-cluster"
# 3. For control plane issues (rare with Autopilot)
# Contact GCP support immediately
# GCP Support: 1-866-777-0375
# 4. For node pool issues
gcloud container node-pools update default-pool \
--cluster=fpa-cluster \
--zone=us-central1 \
--enable-autorepair
# 5. If cluster unrecoverable, recreate from Terraform
cd terraform/gke
terraform apply -var="cluster_suffix=recovery"
# 6. Restore workloads
kubectl apply -k kubernetes/overlays/production
# 7. Point DNS to new cluster
gcloud dns record-sets update api.fpa-platform.com \
--rrdatas="NEW_INGRESS_IP" \
--type=A \
--ttl=60 \
--zone=fpa-platform-zone
Recovery Time: 30-90 minutes
3.3 Infrastructure Failures
Scenario E: Zone Outage
Trigger: GCP zone (us-central1-a) becomes unavailable
Detection:
- GCP status page shows zone issue
- Pods in failed zone not responding
- Automatic zone failover may handle
Recovery Procedure:
# Autopilot should automatically reschedule pods to healthy zones
# 1. Verify pods rescheduling
kubectl get pods -o wide -n fpa-platform
# 2. If Cloud SQL affected, check HA status
gcloud sql instances describe fpa-platform-db
# 3. Monitor recovery
watch kubectl get pods -n fpa-platform
# 4. Once zone recovers, rebalance
kubectl rollout restart deployment -n fpa-platform
Recovery Time: 5-15 minutes (automatic)
Scenario F: Region Outage
Trigger: Entire us-central1 region unavailable (rare)
Detection:
- All services unreachable
- GCP status shows region outage
- PagerDuty escalation triggered
Recovery Procedure:
# CRITICAL: Full DR region activation
# 1. Confirm region outage (not network issue)
# Check: https://status.cloud.google.com/
# 2. Activate DR region (us-east1)
cd terraform/dr
terraform workspace select dr
terraform apply -var="activate_dr=true"
# 3. Promote DR database (if not using global)
gcloud sql instances promote-replica fpa-platform-db-dr-replica
# 4. Deploy workloads to DR cluster
kubectl config use-context gke_project_us-east1_fpa-cluster-dr
kubectl apply -k kubernetes/overlays/dr
# 5. Update DNS to DR region
gcloud dns record-sets update api.fpa-platform.com \
--rrdatas="DR_REGION_IP" \
--type=A \
--ttl=60 \
--zone=fpa-platform-zone
# 6. Notify customers
python scripts/send_incident_notification.py --severity=critical
# 7. Monitor DR environment
kubectl get pods -n fpa-platform
curl -f https://api.fpa-platform.com/health
Recovery Time: 30-60 minutes
Data Loss: Up to 15 minutes (cross-region replication lag)
3.4 Security Incidents
Scenario G: Credential Compromise
Trigger: Suspected or confirmed credential leak
Detection:
- Unusual API activity patterns
- Alerts from Secret Manager audit logs
- External notification (bug bounty, etc.)
Recovery Procedure:
# IMMEDIATE ACTIONS (within 15 minutes)
# 1. Revoke compromised credentials
# Database credentials
gcloud sql users set-password app-user --instance=fpa-platform-db \
--password=$(openssl rand -base64 32)
# Service account keys
gcloud iam service-accounts keys delete KEY_ID \
--iam-account=fpa-app@project.iam.gserviceaccount.com
# API keys
gcloud alpha services api-keys update KEY_ID --clear-restrictions
# 2. Update secrets
gcloud secrets versions add db-password --data-file=./new-password.txt
# 3. Force credential refresh
kubectl rollout restart deployment -n fpa-platform
# 4. Enable additional logging
gcloud logging sinks create security-incident-sink \
storage.googleapis.com/security-logs \
--log-filter='protoPayload.authenticationInfo.principalEmail="compromised-account"'
# 5. Initiate forensic analysis
# Preserve logs
gcloud logging read 'timestamp>="2026-02-01T00:00:00Z"' \
--format=json > incident-logs.json
# 6. Notify security team and legal
python scripts/send_security_alert.py --severity=critical
Recovery Time: 15-30 minutes for immediate containment
Scenario H: Ransomware/Data Breach
Trigger: Encryption of data, unauthorized data access
Detection:
- Unable to read/write data
- Ransom demand received
- Unusual data exfiltration patterns
Recovery Procedure:
# CRITICAL: DO NOT PAY RANSOM
# 1. IMMEDIATE ISOLATION
# Disable external access
gcloud compute firewall-rules update allow-external --disabled
# Block compromised service accounts
gcloud iam service-accounts disable fpa-app@project.iam.gserviceaccount.com
# 2. PRESERVE EVIDENCE
# Snapshot all disks
gcloud compute disks snapshot $(gcloud compute disks list --format='value(name)')
# Export all logs
gcloud logging read 'timestamp>="INCIDENT_START_TIME"' --format=json > forensic-logs.json
# 3. NOTIFY
# Legal and compliance
# Law enforcement (FBI IC3 if US)
# Customers (per breach notification requirements)
# 4. RECOVERY FROM CLEAN BACKUPS
# Identify last known good backup
gcloud sql backups list --instance=fpa-platform-db
# Create new environment
cd terraform/clean-recovery
terraform apply
# Restore from backup
gcloud sql instances restore-backup fpa-platform-db-clean \
--backup-id=LAST_GOOD_BACKUP_ID
# 5. VALIDATE CLEAN STATE
# Security scan of restored environment
# Data integrity verification
# Penetration test before going live
Recovery Time: 4-24 hours
4. Communication Plan
Escalation Matrix
| Severity | Response Time | Notified | Approval Needed |
|---|---|---|---|
| P1 - Critical | 15 min | On-call + Engineering Lead + CTO | None for containment |
| P2 - Major | 30 min | On-call + Engineering Lead | None |
| P3 - Minor | 2 hours | On-call | None |
| P4 - Low | 24 hours | Ticket queue | None |
Notification Templates
Internal Slack Alert:
🚨 INCIDENT: [SEVERITY] - [Brief Description]
Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Affected services and users]
ETA: [Estimated resolution time]
Lead: [@oncall-engineer]
Bridge: [Zoom link]
Customer Status Page:
[TIMESTAMP] - Investigating Issue
We are currently investigating reports of [issue description].
Affected: [Services]
We will provide updates every 30 minutes.
[TIMESTAMP] - Issue Identified
We have identified the root cause as [brief explanation].
Our team is working on a fix.
ETA: [Time estimate]
[TIMESTAMP] - Issue Resolved
The issue has been resolved. All services are operating normally.
Root Cause: [Brief explanation]
Duration: [X hours Y minutes]
Communication Channels
| Channel | Purpose | Owner |
|---|---|---|
| Status Page | Customer-facing updates | On-call |
| Slack #incidents | Internal coordination | Engineering |
| PagerDuty | Alerting and escalation | Automated |
| Customer notification (major) | Customer Success | |
| Phone | Critical customer notification | Account Manager |
5. Testing and Maintenance
DR Test Schedule
| Test Type | Frequency | Duration | Participants |
|---|---|---|---|
| Backup Restore | Monthly | 2 hours | DBA |
| Failover Test | Quarterly | 4 hours | Engineering + Ops |
| Full DR Exercise | Annually | 8 hours | All teams |
| Tabletop Exercise | Bi-annually | 2 hours | Leadership |
DR Test Checklist
## Pre-Test
- [ ] Announce maintenance window
- [ ] Verify backup freshness
- [ ] Confirm all participants available
- [ ] Prepare rollback procedures
## Test Execution
- [ ] Simulate failure scenario
- [ ] Execute recovery procedure
- [ ] Verify service restoration
- [ ] Validate data integrity
- [ ] Measure RTO/RPO achieved
## Post-Test
- [ ] Document lessons learned
- [ ] Update procedures if needed
- [ ] File test report
- [ ] Schedule next test
Last Test Results
| Test Date | Scenario | RTO Achieved | RPO Achieved | Issues Found |
|---|---|---|---|---|
| 2026-01-15 | DB Failover | 3 min | 0 | None |
| 2025-12-01 | Zone Outage | 8 min | 0 | DNS TTL too high |
| 2025-09-15 | Full DR | 45 min | 12 min | Slow secret rotation |
6. Contacts and Resources
Emergency Contacts
| Role | Name | Phone | |
|---|---|---|---|
| On-Call Primary | PagerDuty | - | oncall@fpa-platform.com |
| Engineering Lead | TBD | TBD | eng-lead@fpa-platform.com |
| CTO | TBD | TBD | cto@fpa-platform.com |
| GCP Support | - | 1-866-777-0375 | - |
Runbook Links
| Runbook | Location |
|---|---|
| Database Recovery | /runbooks/database-recovery.md |
| Service Restoration | /runbooks/service-restoration.md |
| Security Incident | /runbooks/security-incident.md |
| Customer Communication | /runbooks/customer-comms.md |
External Resources
- GCP Status: https://status.cloud.google.com/
- GCP Support: https://console.cloud.google.com/support
- PagerDuty: https://fpa-platform.pagerduty.com/
7. Document Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-02-03 | Platform Team | Initial version |
Next Review: 2026-05-03
Approval:
- Engineering Lead
- Security Lead
- CTO
Disaster Recovery Plan v1.0 — FP&A Platform Document ID: OPS-001 Classification: Confidential