Skip to main content

WF-072: Database Backup Flow

Priority: P0 (Critical) | Phase: Phase 1D - Security & Operations | Effort: 12 hours

Overview

Automated daily PostgreSQL snapshot at 2am UTC, weekly restore testing (Mondays), 30-day retention with automatic cleanup, and Slack reporting.

Trigger: Scheduled (daily 2am UTC) | Duration: ~10-15 minutes

Flow

  1. Create GCP Cloud SQL snapshot (automated backup)
  2. Poll backup status (max 10 min timeout)
  3. If successful:
    • Weekly (Monday): Test restore to test instance
    • List all backups
    • Identify backups > 30 days old
    • Delete old backups
    • Log backup success
    • Report to Slack
  4. If failed: Alert ops team via Slack

Backup Strategy

  • Frequency: Daily at 2am UTC
  • Retention: 30 days
  • Restore Testing: Weekly (every Monday)
  • Verification: Automated restore test to separate instance
  • Storage: GCP Cloud SQL managed backups (replicated cross-region)

Business Impact

  • RTO (Recovery Time Objective): < 2 hours
  • RPO (Recovery Point Objective): < 24 hours
  • Restore Success Rate: 100% (tested weekly)
  • Storage Cost: ~$50/month (30 days × 50GB avg)

Testing

  • Daily backup executes at 2am UTC
  • Backup completes successfully
  • Weekly restore test works (Monday)
  • Old backups deleted (> 30 days)
  • Success reported to Slack
  • Failure alerts sent to ops team
  • Backup log updated

Status: ✅ Ready for Implementation