Skip to main content

FP&A Platform — Disaster Recovery Plan

Version: 1.0
Last Updated: 2026-02-03
Document ID: OPS-001
Classification: Confidential
Review Frequency: Quarterly


1. Executive Summary

This Disaster Recovery Plan (DRP) defines procedures for recovering the FP&A Platform from various failure scenarios. The plan ensures business continuity for financial operations while meeting regulatory requirements for data protection and availability.

Recovery Objectives

MetricTargetRationale
RTO (Recovery Time Objective)4 hoursMonth-end close cannot be delayed >4hrs
RPO (Recovery Point Objective)15 minutesMaximum acceptable data loss
MTTR (Mean Time to Recovery)2 hoursTarget for common failures
Availability99.9%8.76 hours downtime/year maximum

Service Tiers

TierServicesRTORPO
Tier 1 - CriticalGL, Auth, API Gateway1 hour5 min
Tier 2 - EssentialReconciliation, Reporting2 hours15 min
Tier 3 - ImportantAI Agents, Forecasting4 hours1 hour
Tier 4 - DeferrableAnalytics, Training24 hours24 hours

2. Infrastructure Overview

Production Architecture

┌─────────────────────────────────────────────────────────────────────┐
│ GCP PRODUCTION │
├─────────────────────────────────────────────────────────────────────┤
│ Region: us-central1 (Primary) | us-east1 (DR) │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GKE CLUSTER (Autopilot) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │API GW │ │GL Svc │ │Recon Svc│ │AI Agents│ │ │
│ │ │(3 pods) │ │(3 pods) │ │(2 pods) │ │(2 pods) │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ DATA LAYER │ │
│ │ ┌────────────────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ Cloud SQL (HA) │ │ Redis │ │ Cloud │ │ │
│ │ │ Primary + Replica │ │ (HA) │ │ Storage │ │ │
│ │ │ us-central1-a/b │ │ 4GB │ │ (Multi-reg)│ │ │
│ │ └────────────────────┘ └──────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Backup Infrastructure

ComponentBackup MethodFrequencyRetentionLocation
Cloud SQLAutomated + On-demandContinuous WAL + Daily snapshot30 daysCross-region (us-east1)
RedisRDB SnapshotsHourly7 daysSame region
Cloud StorageVersioning + ReplicationReal-time365 daysDual-region
SecretsSecret Manager versioningOn change90 daysGlobal
ConfigsGit + Terraform stateOn changeUnlimitedGitHub + GCS
Audit Logsimmudb replicationReal-time7 yearsCross-region

3. Failure Scenarios and Recovery Procedures

3.1 Database Failures

Scenario A: Primary Database Failure

Trigger: Cloud SQL primary instance becomes unavailable

Detection:

  • Cloud SQL health check fails
  • Application connection errors spike
  • PagerDuty alert: cloudsql-primary-down

Recovery Procedure:

# Automatic failover (typically 60-120 seconds)
# Cloud SQL HA handles this automatically

# If automatic failover fails, manual steps:

# 1. Check instance status
gcloud sql instances describe fpa-platform-db --format='get(state)'

# 2. If SUSPENDED, attempt restart
gcloud sql instances restart fpa-platform-db

# 3. If restart fails, promote replica
gcloud sql instances promote-replica fpa-platform-db-replica

# 4. Update connection strings (via Secret Manager)
gcloud secrets versions add db-connection-string \
--data-file=./new-connection-string.txt

# 5. Rolling restart of application pods
kubectl rollout restart deployment -n fpa-platform

# 6. Verify connectivity
kubectl exec -it $(kubectl get pod -l app=gl-service -o name | head -1) \
-- psql $DATABASE_URL -c "SELECT 1"

Recovery Time: 2-5 minutes (auto), 15-30 minutes (manual)

Post-Recovery:

  1. Verify all services healthy
  2. Check for data loss (compare max timestamps)
  3. Recreate replica if promoted
  4. Update incident log

Scenario B: Database Corruption

Trigger: Data integrity issues, failed migrations, application bugs

Detection:

  • Data validation errors
  • Reconciliation failures
  • User-reported incorrect data

Recovery Procedure:

# 1. Assess scope of corruption
# Run data integrity checks
kubectl exec -it $(kubectl get pod -l app=gl-service -o name | head -1) -- \
python -m fpa.scripts.data_integrity_check

# 2. Identify corruption time window
# Check audit logs in immudb
immudb-cli scan --prefix "table:journal_entries" --since "2026-02-01T00:00:00Z"

# 3. For limited corruption: Point-in-time recovery to new instance
gcloud sql instances clone fpa-platform-db fpa-platform-db-recovery \
--point-in-time "2026-02-03T10:00:00Z"

# 4. Export clean data from recovery instance
pg_dump -h recovery-instance-ip -d fpa_platform -t affected_tables > clean_data.sql

# 5. Apply clean data to production (with downtime window)
# Announce maintenance window
# Stop affected services
kubectl scale deployment gl-service --replicas=0

# Import clean data
psql -h production-ip -d fpa_platform < clean_data.sql

# Restart services
kubectl scale deployment gl-service --replicas=3

# 6. Verify data integrity
python -m fpa.scripts.data_integrity_check --full

# 7. Delete recovery instance
gcloud sql instances delete fpa-platform-db-recovery

Recovery Time: 1-4 hours depending on scope


3.2 Application Failures

Scenario C: Service Crash/Memory Leak

Trigger: OOMKilled, unhandled exceptions, deadlocks

Detection:

  • Pod restart count increases
  • Error rate spikes in Prometheus
  • PagerDuty alert: high-pod-restarts

Recovery Procedure:

# 1. Check pod status
kubectl get pods -n fpa-platform -l app=affected-service

# 2. Check recent events
kubectl describe pod affected-pod-name

# 3. Check logs
kubectl logs affected-pod-name --previous

# 4. If widespread, rollback deployment
kubectl rollout undo deployment/affected-service

# 5. If memory issue, increase limits temporarily
kubectl patch deployment affected-service -p \
'{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

# 6. Monitor recovery
kubectl rollout status deployment/affected-service

Recovery Time: 5-15 minutes


Scenario D: Complete Application Failure

Trigger: All services down, GKE cluster failure

Detection:

  • All health checks failing
  • No pods running
  • PagerDuty alert: platform-complete-outage

Recovery Procedure:

# 1. Check GKE cluster status
gcloud container clusters describe fpa-cluster --zone us-central1

# 2. If cluster unresponsive, check control plane
gcloud container operations list --filter="TARGET_LINK:fpa-cluster"

# 3. For control plane issues (rare with Autopilot)
# Contact GCP support immediately
# GCP Support: 1-866-777-0375

# 4. For node pool issues
gcloud container node-pools update default-pool \
--cluster=fpa-cluster \
--zone=us-central1 \
--enable-autorepair

# 5. If cluster unrecoverable, recreate from Terraform
cd terraform/gke
terraform apply -var="cluster_suffix=recovery"

# 6. Restore workloads
kubectl apply -k kubernetes/overlays/production

# 7. Point DNS to new cluster
gcloud dns record-sets update api.fpa-platform.com \
--rrdatas="NEW_INGRESS_IP" \
--type=A \
--ttl=60 \
--zone=fpa-platform-zone

Recovery Time: 30-90 minutes


3.3 Infrastructure Failures

Scenario E: Zone Outage

Trigger: GCP zone (us-central1-a) becomes unavailable

Detection:

  • GCP status page shows zone issue
  • Pods in failed zone not responding
  • Automatic zone failover may handle

Recovery Procedure:

# Autopilot should automatically reschedule pods to healthy zones

# 1. Verify pods rescheduling
kubectl get pods -o wide -n fpa-platform

# 2. If Cloud SQL affected, check HA status
gcloud sql instances describe fpa-platform-db

# 3. Monitor recovery
watch kubectl get pods -n fpa-platform

# 4. Once zone recovers, rebalance
kubectl rollout restart deployment -n fpa-platform

Recovery Time: 5-15 minutes (automatic)


Scenario F: Region Outage

Trigger: Entire us-central1 region unavailable (rare)

Detection:

  • All services unreachable
  • GCP status shows region outage
  • PagerDuty escalation triggered

Recovery Procedure:

# CRITICAL: Full DR region activation

# 1. Confirm region outage (not network issue)
# Check: https://status.cloud.google.com/

# 2. Activate DR region (us-east1)
cd terraform/dr
terraform workspace select dr
terraform apply -var="activate_dr=true"

# 3. Promote DR database (if not using global)
gcloud sql instances promote-replica fpa-platform-db-dr-replica

# 4. Deploy workloads to DR cluster
kubectl config use-context gke_project_us-east1_fpa-cluster-dr
kubectl apply -k kubernetes/overlays/dr

# 5. Update DNS to DR region
gcloud dns record-sets update api.fpa-platform.com \
--rrdatas="DR_REGION_IP" \
--type=A \
--ttl=60 \
--zone=fpa-platform-zone

# 6. Notify customers
python scripts/send_incident_notification.py --severity=critical

# 7. Monitor DR environment
kubectl get pods -n fpa-platform
curl -f https://api.fpa-platform.com/health

Recovery Time: 30-60 minutes

Data Loss: Up to 15 minutes (cross-region replication lag)


3.4 Security Incidents

Scenario G: Credential Compromise

Trigger: Suspected or confirmed credential leak

Detection:

  • Unusual API activity patterns
  • Alerts from Secret Manager audit logs
  • External notification (bug bounty, etc.)

Recovery Procedure:

# IMMEDIATE ACTIONS (within 15 minutes)

# 1. Revoke compromised credentials
# Database credentials
gcloud sql users set-password app-user --instance=fpa-platform-db \
--password=$(openssl rand -base64 32)

# Service account keys
gcloud iam service-accounts keys delete KEY_ID \
--iam-account=fpa-app@project.iam.gserviceaccount.com

# API keys
gcloud alpha services api-keys update KEY_ID --clear-restrictions

# 2. Update secrets
gcloud secrets versions add db-password --data-file=./new-password.txt

# 3. Force credential refresh
kubectl rollout restart deployment -n fpa-platform

# 4. Enable additional logging
gcloud logging sinks create security-incident-sink \
storage.googleapis.com/security-logs \
--log-filter='protoPayload.authenticationInfo.principalEmail="compromised-account"'

# 5. Initiate forensic analysis
# Preserve logs
gcloud logging read 'timestamp>="2026-02-01T00:00:00Z"' \
--format=json > incident-logs.json

# 6. Notify security team and legal
python scripts/send_security_alert.py --severity=critical

Recovery Time: 15-30 minutes for immediate containment


Scenario H: Ransomware/Data Breach

Trigger: Encryption of data, unauthorized data access

Detection:

  • Unable to read/write data
  • Ransom demand received
  • Unusual data exfiltration patterns

Recovery Procedure:

# CRITICAL: DO NOT PAY RANSOM

# 1. IMMEDIATE ISOLATION
# Disable external access
gcloud compute firewall-rules update allow-external --disabled

# Block compromised service accounts
gcloud iam service-accounts disable fpa-app@project.iam.gserviceaccount.com

# 2. PRESERVE EVIDENCE
# Snapshot all disks
gcloud compute disks snapshot $(gcloud compute disks list --format='value(name)')

# Export all logs
gcloud logging read 'timestamp>="INCIDENT_START_TIME"' --format=json > forensic-logs.json

# 3. NOTIFY
# Legal and compliance
# Law enforcement (FBI IC3 if US)
# Customers (per breach notification requirements)

# 4. RECOVERY FROM CLEAN BACKUPS
# Identify last known good backup
gcloud sql backups list --instance=fpa-platform-db

# Create new environment
cd terraform/clean-recovery
terraform apply

# Restore from backup
gcloud sql instances restore-backup fpa-platform-db-clean \
--backup-id=LAST_GOOD_BACKUP_ID

# 5. VALIDATE CLEAN STATE
# Security scan of restored environment
# Data integrity verification
# Penetration test before going live

Recovery Time: 4-24 hours


4. Communication Plan

Escalation Matrix

SeverityResponse TimeNotifiedApproval Needed
P1 - Critical15 minOn-call + Engineering Lead + CTONone for containment
P2 - Major30 minOn-call + Engineering LeadNone
P3 - Minor2 hoursOn-callNone
P4 - Low24 hoursTicket queueNone

Notification Templates

Internal Slack Alert:

🚨 INCIDENT: [SEVERITY] - [Brief Description]
Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Affected services and users]
ETA: [Estimated resolution time]
Lead: [@oncall-engineer]
Bridge: [Zoom link]

Customer Status Page:

[TIMESTAMP] - Investigating Issue
We are currently investigating reports of [issue description].
Affected: [Services]
We will provide updates every 30 minutes.

[TIMESTAMP] - Issue Identified
We have identified the root cause as [brief explanation].
Our team is working on a fix.
ETA: [Time estimate]

[TIMESTAMP] - Issue Resolved
The issue has been resolved. All services are operating normally.
Root Cause: [Brief explanation]
Duration: [X hours Y minutes]

Communication Channels

ChannelPurposeOwner
Status PageCustomer-facing updatesOn-call
Slack #incidentsInternal coordinationEngineering
PagerDutyAlerting and escalationAutomated
EmailCustomer notification (major)Customer Success
PhoneCritical customer notificationAccount Manager

5. Testing and Maintenance

DR Test Schedule

Test TypeFrequencyDurationParticipants
Backup RestoreMonthly2 hoursDBA
Failover TestQuarterly4 hoursEngineering + Ops
Full DR ExerciseAnnually8 hoursAll teams
Tabletop ExerciseBi-annually2 hoursLeadership

DR Test Checklist

## Pre-Test
- [ ] Announce maintenance window
- [ ] Verify backup freshness
- [ ] Confirm all participants available
- [ ] Prepare rollback procedures

## Test Execution
- [ ] Simulate failure scenario
- [ ] Execute recovery procedure
- [ ] Verify service restoration
- [ ] Validate data integrity
- [ ] Measure RTO/RPO achieved

## Post-Test
- [ ] Document lessons learned
- [ ] Update procedures if needed
- [ ] File test report
- [ ] Schedule next test

Last Test Results

Test DateScenarioRTO AchievedRPO AchievedIssues Found
2026-01-15DB Failover3 min0None
2025-12-01Zone Outage8 min0DNS TTL too high
2025-09-15Full DR45 min12 minSlow secret rotation

6. Contacts and Resources

Emergency Contacts

RoleNamePhoneEmail
On-Call PrimaryPagerDuty-oncall@fpa-platform.com
Engineering LeadTBDTBDeng-lead@fpa-platform.com
CTOTBDTBDcto@fpa-platform.com
GCP Support-1-866-777-0375-
RunbookLocation
Database Recovery/runbooks/database-recovery.md
Service Restoration/runbooks/service-restoration.md
Security Incident/runbooks/security-incident.md
Customer Communication/runbooks/customer-comms.md

External Resources


7. Document Control

VersionDateAuthorChanges
1.02026-02-03Platform TeamInitial version

Next Review: 2026-05-03

Approval:

  • Engineering Lead
  • Security Lead
  • CTO

Disaster Recovery Plan v1.0 — FP&A Platform Document ID: OPS-001 Classification: Confidential