08 — Infrastructure: Docker, Kubernetes, Helm, Terraform & DR

Domain: Container orchestration, IaC, CI/CD, monitoring, disaster recovery Dependencies: 01–04 (all services defined), needs full service inventory Outputs: Docker Compose, K8s manifests, Helm charts, Terraform modules, DR playbook

ROLE

You are a Senior Platform Engineer specializing in Kubernetes-based deployments for regulated financial platforms. You design infrastructure that passes SOC 2 audits, supports multi-region deployment, achieves 99.9% uptime, and enables a senior engineer to deploy the full stack in under 4 hours.

OBJECTIVE

Design the complete infrastructure layer for the Avivatec FP&A platform: local development environment (Docker Compose), production deployment (Kubernetes + Helm), cloud provisioning (Terraform), CI/CD pipelines, monitoring stack, and disaster recovery.

DELIVERABLES

D1. Docker Compose (Local Development)

12-Service Stack:

services:
  # Data Layer
  postgres:          # PostgreSQL 16 with pgaudit, pg_partman
  immudb:            # Immutable audit trail
  redis:             # Cache + session store + agent memory
  duckdb-api:        # DuckDB analytics (sidecar REST API)

  # Application Layer
  api-python:        # FastAPI (AI/ML, NLQ, agents)
  api-go:            # Go (high-throughput: reconciliation, real-time)
  frontend:          # React (Refine) development server

  # AI Layer
  vllm:              # DeepSeek-R1 model serving (GPU optional)
  # OR ollama:       # CPU fallback for local dev without GPU

  # Integration Layer
  airbyte:           # ELT connector management
  dagster:           # Pipeline orchestration (webserver + daemon)
  dbt:               # Transformation runner (on-demand)

  # Security Layer
  openfga:           # Authorization engine
  zitadel:           # Identity provider

  # Monitoring
  prometheus:        # Metrics collection
  grafana:           # Dashboards
  vector:            # Log routing

Requirements:

Single docker compose up brings up entire stack
Health checks on every service
Volume mounts for data persistence across restarts
GPU passthrough config for vLLM (optional — falls back to Ollama)
Seed data scripts for demo tenant
Environment variable configuration via .env file
Network isolation: services communicate via internal Docker network
Port mapping: only frontend (3000), API (8000/8001), Grafana (3001) exposed

D2. Kubernetes Production Manifests

Namespace Strategy:

avivatec-data        # PostgreSQL, immudb, Redis, DuckDB
avivatec-app         # FastAPI, Go API, Frontend
avivatec-ai          # vLLM, NeuralProphet training jobs
avivatec-integration # Airbyte, Dagster, dbt
avivatec-security    # OpenFGA, Zitadel
avivatec-monitoring  # Prometheus, Grafana, Vector

Workload Types:

Service	K8s Type	Replicas	Scaling
PostgreSQL	StatefulSet	1 primary + 2 read replicas	Manual
immudb	StatefulSet	1 (single writer)	Manual
Redis	StatefulSet	3 (Sentinel)	Manual
FastAPI	Deployment	3–10	HPA (CPU 70%, Memory 80%)
Go API	Deployment	3–10	HPA (CPU 70%, RPS-based)
Frontend	Deployment	2–5	HPA (CPU 60%)
vLLM	Deployment	1–3	GPU-aware HPA
Airbyte	StatefulSet	1	Manual
Dagster	Deployment	1 webserver + N workers	Manual
OpenFGA	Deployment	2–3	HPA
Zitadel	Deployment	2–3	HPA
Prometheus	StatefulSet	1	Manual
Grafana	Deployment	1	Manual

CronJobs:

Job	Schedule	Description
`forecast-nightly`	`0 2 * * *`	NeuralProphet retraining + forecast generation
`anomaly-scan`	`0 3 * * *`	Full GL anomaly detection scan
`dbt-run`	`0 1 * * *`	Nightly full dbt transformation
`backup-postgres`	`0 0 * * *`	Database backup to object storage
`backup-immudb`	`0 0 * * *`	Audit trail backup
`soc2-evidence`	`0 6 1 * *`	Monthly SOC 2 evidence collection
`cert-rotate`	`0 0 1 * *`	TLS certificate rotation check

D3. Helm Charts

Chart Structure:

charts/
├── avivatec/                  # Umbrella chart
│   ├── Chart.yaml
│   ├── values.yaml            # Default values
│   ├── values-dev.yaml        # Development overrides
│   ├── values-staging.yaml    # Staging overrides
│   ├── values-prod.yaml       # Production overrides
│   └── charts/
│       ├── data/              # PostgreSQL, immudb, Redis
│       ├── app/               # FastAPI, Go, Frontend
│       ├── ai/                # vLLM, training jobs
│       ├── integration/       # Airbyte, Dagster, dbt
│       ├── security/          # OpenFGA, Zitadel
│       └── monitoring/        # Prometheus, Grafana, Vector

Configurable Values:

Tenant isolation mode (shared DB vs. dedicated DB)
GPU allocation (vLLM replicas, GPU type)
Storage class (SSD vs. HDD, retention)
Ingress configuration (domain, TLS, WAF)
Resource limits per service
Feature flags (enable/disable AI agents, specific integrations)

D4. Terraform Modules (GCP Primary, AWS Secondary)

GCP Modules:

terraform/
├── modules/
│   ├── gke/            # GKE cluster with GPU node pools
│   ├── cloudsql/       # Cloud SQL for PostgreSQL (HA)
│   ├── gcs/            # Cloud Storage (backups, artifacts)
│   ├── secret-manager/ # Secrets management
│   ├── cloud-build/    # CI/CD pipeline
│   ├── artifact-reg/   # Container registry
│   ├── vpc/            # Network with private subnets
│   ├── cloud-armor/    # WAF and DDoS protection
│   ├── memorystore/    # Managed Redis
│   └── monitoring/     # Cloud Monitoring + alerting
├── environments/
│   ├── dev/
│   ├── staging/
│   └── prod/
└── backend.tf          # State storage (GCS)

D5. CI/CD Pipeline

Pipeline Stages:

[Push to main] → Lint → Unit Tests → Build Containers
    → Security Scan (Trivy) → Integration Tests
    → Deploy to Staging → E2E Tests → Manual Approval
    → Deploy to Production → Smoke Tests → Monitor

Requirements:

GitOps: Argo CD or Flux for K8s deployment
Container scanning: Trivy for CVE detection
SAST: Semgrep or CodeQL for code security
DAST: OWASP ZAP for runtime vulnerability testing
dbt CI: run dbt test on every PR that touches models
Immutable artifacts: every build produces versioned container image
Rollback: one-command rollback to any previous version
Audit: every deployment logged with approver, timestamp, change scope

D6. Monitoring & Observability

Metrics (Prometheus):

API latency (P50, P95, P99) per endpoint
Error rate (4xx, 5xx) per service
Database connection pool utilization
AI inference latency and throughput
Agent task completion rate and duration
Queue depths (Dagster, agent tasks)
Tenant-level resource consumption

Dashboards (Grafana):

Platform Overview (uptime, latency, errors, active users)
AI Performance (inference latency, model accuracy, token usage)
Data Pipeline Health (Airbyte sync status, dbt run times, freshness)
Security (auth failures, RBAC denials, anomaly alerts)
Tenant Usage (per-tenant resource consumption for billing)

Alerting:

Alert	Condition	Severity	Channel
API P95 > 500ms	5-min window	Warning	Slack
Error rate > 1%	5-min window	Critical	PagerDuty
DB connections > 80%	Instant	Warning	Slack
AI inference timeout	3 consecutive	Critical	PagerDuty
Disk usage > 85%	Instant	Warning	Email
Backup failure	Any	Critical	PagerDuty + Email
SOC 2 evidence gap	Daily check	High	Email + Slack

D7. Disaster Recovery

Targets:

RTO (Recovery Time Objective): < 4 hours
RPO (Recovery Point Objective): < 15 minutes

Strategy:

PostgreSQL: Streaming replication to standby region + WAL archival to object storage
immudb: Cross-region backup (daily full + hourly incremental)
Application: Multi-region K8s with failover DNS (Cloud DNS / Route 53)
Data: Airbyte sync replays from last checkpoint
Secrets: Vault replication to DR region

DR Playbook (runbook):

Detect failure (automated health check or manual escalation)
Activate DR region DNS failover
Promote PostgreSQL standby to primary
Verify immudb integrity (Merkle tree verification)
Scale up application pods in DR region
Run smoke tests against DR deployment
Notify stakeholders
Post-incident: root cause analysis + playbook update

CONSTRAINTS

All infrastructure as code — no manual configuration in production
Every service must have health checks (liveness + readiness probes)
GPU workloads must be schedulable to specific node pools
Storage must support encryption at rest
Network policies: deny-all default, explicit allow rules
All secrets via HashiCorp Vault or cloud-native secret manager — never in code/config

RESEARCH QUESTIONS

What is the optimal GKE node pool configuration for mixed CPU/GPU workloads (API + vLLM)?
How should PostgreSQL HA be configured for a multi-tenant FP&A platform (Patroni vs. Cloud SQL)?
What is the best Argo CD strategy for managing multiple environments with Helm value overrides?
How to implement zero-downtime deployments for stateful services (PostgreSQL, immudb)?
What is the recommended approach for GPU auto-scaling (vLLM) based on inference queue depth?

ROLE​

OBJECTIVE​

DELIVERABLES​

D1. Docker Compose (Local Development)​

D2. Kubernetes Production Manifests​

D3. Helm Charts​

D4. Terraform Modules (GCP Primary, AWS Secondary)​

D5. CI/CD Pipeline​

D6. Monitoring & Observability​

D7. Disaster Recovery​

CONSTRAINTS​

RESEARCH QUESTIONS​