08 — Infrastructure: Docker, Kubernetes, Helm, Terraform & DR
Domain: Container orchestration, IaC, CI/CD, monitoring, disaster recovery Dependencies: 01–04 (all services defined), needs full service inventory Outputs: Docker Compose, K8s manifests, Helm charts, Terraform modules, DR playbook
ROLE
You are a Senior Platform Engineer specializing in Kubernetes-based deployments for regulated financial platforms. You design infrastructure that passes SOC 2 audits, supports multi-region deployment, achieves 99.9% uptime, and enables a senior engineer to deploy the full stack in under 4 hours.
OBJECTIVE
Design the complete infrastructure layer for the Avivatec FP&A platform: local development environment (Docker Compose), production deployment (Kubernetes + Helm), cloud provisioning (Terraform), CI/CD pipelines, monitoring stack, and disaster recovery.
DELIVERABLES
D1. Docker Compose (Local Development)
12-Service Stack:
services:
# Data Layer
postgres: # PostgreSQL 16 with pgaudit, pg_partman
immudb: # Immutable audit trail
redis: # Cache + session store + agent memory
duckdb-api: # DuckDB analytics (sidecar REST API)
# Application Layer
api-python: # FastAPI (AI/ML, NLQ, agents)
api-go: # Go (high-throughput: reconciliation, real-time)
frontend: # React (Refine) development server
# AI Layer
vllm: # DeepSeek-R1 model serving (GPU optional)
# OR ollama: # CPU fallback for local dev without GPU
# Integration Layer
airbyte: # ELT connector management
dagster: # Pipeline orchestration (webserver + daemon)
dbt: # Transformation runner (on-demand)
# Security Layer
openfga: # Authorization engine
zitadel: # Identity provider
# Monitoring
prometheus: # Metrics collection
grafana: # Dashboards
vector: # Log routing
Requirements:
- Single
docker compose upbrings up entire stack - Health checks on every service
- Volume mounts for data persistence across restarts
- GPU passthrough config for vLLM (optional — falls back to Ollama)
- Seed data scripts for demo tenant
- Environment variable configuration via
.envfile - Network isolation: services communicate via internal Docker network
- Port mapping: only frontend (3000), API (8000/8001), Grafana (3001) exposed
D2. Kubernetes Production Manifests
Namespace Strategy:
avivatec-data # PostgreSQL, immudb, Redis, DuckDB
avivatec-app # FastAPI, Go API, Frontend
avivatec-ai # vLLM, NeuralProphet training jobs
avivatec-integration # Airbyte, Dagster, dbt
avivatec-security # OpenFGA, Zitadel
avivatec-monitoring # Prometheus, Grafana, Vector
Workload Types:
| Service | K8s Type | Replicas | Scaling |
|---|---|---|---|
| PostgreSQL | StatefulSet | 1 primary + 2 read replicas | Manual |
| immudb | StatefulSet | 1 (single writer) | Manual |
| Redis | StatefulSet | 3 (Sentinel) | Manual |
| FastAPI | Deployment | 3–10 | HPA (CPU 70%, Memory 80%) |
| Go API | Deployment | 3–10 | HPA (CPU 70%, RPS-based) |
| Frontend | Deployment | 2–5 | HPA (CPU 60%) |
| vLLM | Deployment | 1–3 | GPU-aware HPA |
| Airbyte | StatefulSet | 1 | Manual |
| Dagster | Deployment | 1 webserver + N workers | Manual |
| OpenFGA | Deployment | 2–3 | HPA |
| Zitadel | Deployment | 2–3 | HPA |
| Prometheus | StatefulSet | 1 | Manual |
| Grafana | Deployment | 1 | Manual |
CronJobs:
| Job | Schedule | Description |
|---|---|---|
forecast-nightly | 0 2 * * * | NeuralProphet retraining + forecast generation |
anomaly-scan | 0 3 * * * | Full GL anomaly detection scan |
dbt-run | 0 1 * * * | Nightly full dbt transformation |
backup-postgres | 0 0 * * * | Database backup to object storage |
backup-immudb | 0 0 * * * | Audit trail backup |
soc2-evidence | 0 6 1 * * | Monthly SOC 2 evidence collection |
cert-rotate | 0 0 1 * * | TLS certificate rotation check |
D3. Helm Charts
Chart Structure:
charts/
├── avivatec/ # Umbrella chart
│ ├── Chart.yaml
│ ├── values.yaml # Default values
│ ├── values-dev.yaml # Development overrides
│ ├── values-staging.yaml # Staging overrides
│ ├── values-prod.yaml # Production overrides
│ └── charts/
│ ├── data/ # PostgreSQL, immudb, Redis
│ ├── app/ # FastAPI, Go, Frontend
│ ├── ai/ # vLLM, training jobs
│ ├── integration/ # Airbyte, Dagster, dbt
│ ├── security/ # OpenFGA, Zitadel
│ └── monitoring/ # Prometheus, Grafana, Vector
Configurable Values:
- Tenant isolation mode (shared DB vs. dedicated DB)
- GPU allocation (vLLM replicas, GPU type)
- Storage class (SSD vs. HDD, retention)
- Ingress configuration (domain, TLS, WAF)
- Resource limits per service
- Feature flags (enable/disable AI agents, specific integrations)
D4. Terraform Modules (GCP Primary, AWS Secondary)
GCP Modules:
terraform/
├── modules/
│ ├── gke/ # GKE cluster with GPU node pools
│ ├── cloudsql/ # Cloud SQL for PostgreSQL (HA)
│ ├── gcs/ # Cloud Storage (backups, artifacts)
│ ├── secret-manager/ # Secrets management
│ ├── cloud-build/ # CI/CD pipeline
│ ├── artifact-reg/ # Container registry
│ ├── vpc/ # Network with private subnets
│ ├── cloud-armor/ # WAF and DDoS protection
│ ├── memorystore/ # Managed Redis
│ └── monitoring/ # Cloud Monitoring + alerting
├── environments/
│ ├── dev/
│ ├── staging/
│ └── prod/
└── backend.tf # State storage (GCS)
D5. CI/CD Pipeline
Pipeline Stages:
[Push to main] → Lint → Unit Tests → Build Containers
→ Security Scan (Trivy) → Integration Tests
→ Deploy to Staging → E2E Tests → Manual Approval
→ Deploy to Production → Smoke Tests → Monitor
Requirements:
- GitOps: Argo CD or Flux for K8s deployment
- Container scanning: Trivy for CVE detection
- SAST: Semgrep or CodeQL for code security
- DAST: OWASP ZAP for runtime vulnerability testing
- dbt CI: run dbt test on every PR that touches models
- Immutable artifacts: every build produces versioned container image
- Rollback: one-command rollback to any previous version
- Audit: every deployment logged with approver, timestamp, change scope
D6. Monitoring & Observability
Metrics (Prometheus):
- API latency (P50, P95, P99) per endpoint
- Error rate (4xx, 5xx) per service
- Database connection pool utilization
- AI inference latency and throughput
- Agent task completion rate and duration
- Queue depths (Dagster, agent tasks)
- Tenant-level resource consumption
Dashboards (Grafana):
- Platform Overview (uptime, latency, errors, active users)
- AI Performance (inference latency, model accuracy, token usage)
- Data Pipeline Health (Airbyte sync status, dbt run times, freshness)
- Security (auth failures, RBAC denials, anomaly alerts)
- Tenant Usage (per-tenant resource consumption for billing)
Alerting:
| Alert | Condition | Severity | Channel |
|---|---|---|---|
| API P95 > 500ms | 5-min window | Warning | Slack |
| Error rate > 1% | 5-min window | Critical | PagerDuty |
| DB connections > 80% | Instant | Warning | Slack |
| AI inference timeout | 3 consecutive | Critical | PagerDuty |
| Disk usage > 85% | Instant | Warning | |
| Backup failure | Any | Critical | PagerDuty + Email |
| SOC 2 evidence gap | Daily check | High | Email + Slack |
D7. Disaster Recovery
Targets:
- RTO (Recovery Time Objective): < 4 hours
- RPO (Recovery Point Objective): < 15 minutes
Strategy:
- PostgreSQL: Streaming replication to standby region + WAL archival to object storage
- immudb: Cross-region backup (daily full + hourly incremental)
- Application: Multi-region K8s with failover DNS (Cloud DNS / Route 53)
- Data: Airbyte sync replays from last checkpoint
- Secrets: Vault replication to DR region
DR Playbook (runbook):
- Detect failure (automated health check or manual escalation)
- Activate DR region DNS failover
- Promote PostgreSQL standby to primary
- Verify immudb integrity (Merkle tree verification)
- Scale up application pods in DR region
- Run smoke tests against DR deployment
- Notify stakeholders
- Post-incident: root cause analysis + playbook update
CONSTRAINTS
- All infrastructure as code — no manual configuration in production
- Every service must have health checks (liveness + readiness probes)
- GPU workloads must be schedulable to specific node pools
- Storage must support encryption at rest
- Network policies: deny-all default, explicit allow rules
- All secrets via HashiCorp Vault or cloud-native secret manager — never in code/config
RESEARCH QUESTIONS
- What is the optimal GKE node pool configuration for mixed CPU/GPU workloads (API + vLLM)?
- How should PostgreSQL HA be configured for a multi-tenant FP&A platform (Patroni vs. Cloud SQL)?
- What is the best Argo CD strategy for managing multiple environments with Helm value overrides?
- How to implement zero-downtime deployments for stateful services (PostgreSQL, immudb)?
- What is the recommended approach for GPU auto-scaling (vLLM) based on inference queue depth?