Skip to main content

08 — Infrastructure: Docker, Kubernetes, Helm, Terraform & DR

Domain: Container orchestration, IaC, CI/CD, monitoring, disaster recovery Dependencies: 01–04 (all services defined), needs full service inventory Outputs: Docker Compose, K8s manifests, Helm charts, Terraform modules, DR playbook


ROLE

You are a Senior Platform Engineer specializing in Kubernetes-based deployments for regulated financial platforms. You design infrastructure that passes SOC 2 audits, supports multi-region deployment, achieves 99.9% uptime, and enables a senior engineer to deploy the full stack in under 4 hours.


OBJECTIVE

Design the complete infrastructure layer for the Avivatec FP&A platform: local development environment (Docker Compose), production deployment (Kubernetes + Helm), cloud provisioning (Terraform), CI/CD pipelines, monitoring stack, and disaster recovery.


DELIVERABLES

D1. Docker Compose (Local Development)

12-Service Stack:

services:
# Data Layer
postgres: # PostgreSQL 16 with pgaudit, pg_partman
immudb: # Immutable audit trail
redis: # Cache + session store + agent memory
duckdb-api: # DuckDB analytics (sidecar REST API)

# Application Layer
api-python: # FastAPI (AI/ML, NLQ, agents)
api-go: # Go (high-throughput: reconciliation, real-time)
frontend: # React (Refine) development server

# AI Layer
vllm: # DeepSeek-R1 model serving (GPU optional)
# OR ollama: # CPU fallback for local dev without GPU

# Integration Layer
airbyte: # ELT connector management
dagster: # Pipeline orchestration (webserver + daemon)
dbt: # Transformation runner (on-demand)

# Security Layer
openfga: # Authorization engine
zitadel: # Identity provider

# Monitoring
prometheus: # Metrics collection
grafana: # Dashboards
vector: # Log routing

Requirements:

  • Single docker compose up brings up entire stack
  • Health checks on every service
  • Volume mounts for data persistence across restarts
  • GPU passthrough config for vLLM (optional — falls back to Ollama)
  • Seed data scripts for demo tenant
  • Environment variable configuration via .env file
  • Network isolation: services communicate via internal Docker network
  • Port mapping: only frontend (3000), API (8000/8001), Grafana (3001) exposed

D2. Kubernetes Production Manifests

Namespace Strategy:

avivatec-data        # PostgreSQL, immudb, Redis, DuckDB
avivatec-app # FastAPI, Go API, Frontend
avivatec-ai # vLLM, NeuralProphet training jobs
avivatec-integration # Airbyte, Dagster, dbt
avivatec-security # OpenFGA, Zitadel
avivatec-monitoring # Prometheus, Grafana, Vector

Workload Types:

ServiceK8s TypeReplicasScaling
PostgreSQLStatefulSet1 primary + 2 read replicasManual
immudbStatefulSet1 (single writer)Manual
RedisStatefulSet3 (Sentinel)Manual
FastAPIDeployment3–10HPA (CPU 70%, Memory 80%)
Go APIDeployment3–10HPA (CPU 70%, RPS-based)
FrontendDeployment2–5HPA (CPU 60%)
vLLMDeployment1–3GPU-aware HPA
AirbyteStatefulSet1Manual
DagsterDeployment1 webserver + N workersManual
OpenFGADeployment2–3HPA
ZitadelDeployment2–3HPA
PrometheusStatefulSet1Manual
GrafanaDeployment1Manual

CronJobs:

JobScheduleDescription
forecast-nightly0 2 * * *NeuralProphet retraining + forecast generation
anomaly-scan0 3 * * *Full GL anomaly detection scan
dbt-run0 1 * * *Nightly full dbt transformation
backup-postgres0 0 * * *Database backup to object storage
backup-immudb0 0 * * *Audit trail backup
soc2-evidence0 6 1 * *Monthly SOC 2 evidence collection
cert-rotate0 0 1 * *TLS certificate rotation check

D3. Helm Charts

Chart Structure:

charts/
├── avivatec/ # Umbrella chart
│ ├── Chart.yaml
│ ├── values.yaml # Default values
│ ├── values-dev.yaml # Development overrides
│ ├── values-staging.yaml # Staging overrides
│ ├── values-prod.yaml # Production overrides
│ └── charts/
│ ├── data/ # PostgreSQL, immudb, Redis
│ ├── app/ # FastAPI, Go, Frontend
│ ├── ai/ # vLLM, training jobs
│ ├── integration/ # Airbyte, Dagster, dbt
│ ├── security/ # OpenFGA, Zitadel
│ └── monitoring/ # Prometheus, Grafana, Vector

Configurable Values:

  • Tenant isolation mode (shared DB vs. dedicated DB)
  • GPU allocation (vLLM replicas, GPU type)
  • Storage class (SSD vs. HDD, retention)
  • Ingress configuration (domain, TLS, WAF)
  • Resource limits per service
  • Feature flags (enable/disable AI agents, specific integrations)

D4. Terraform Modules (GCP Primary, AWS Secondary)

GCP Modules:

terraform/
├── modules/
│ ├── gke/ # GKE cluster with GPU node pools
│ ├── cloudsql/ # Cloud SQL for PostgreSQL (HA)
│ ├── gcs/ # Cloud Storage (backups, artifacts)
│ ├── secret-manager/ # Secrets management
│ ├── cloud-build/ # CI/CD pipeline
│ ├── artifact-reg/ # Container registry
│ ├── vpc/ # Network with private subnets
│ ├── cloud-armor/ # WAF and DDoS protection
│ ├── memorystore/ # Managed Redis
│ └── monitoring/ # Cloud Monitoring + alerting
├── environments/
│ ├── dev/
│ ├── staging/
│ └── prod/
└── backend.tf # State storage (GCS)

D5. CI/CD Pipeline

Pipeline Stages:

[Push to main] → Lint → Unit Tests → Build Containers
→ Security Scan (Trivy) → Integration Tests
→ Deploy to Staging → E2E Tests → Manual Approval
→ Deploy to Production → Smoke Tests → Monitor

Requirements:

  • GitOps: Argo CD or Flux for K8s deployment
  • Container scanning: Trivy for CVE detection
  • SAST: Semgrep or CodeQL for code security
  • DAST: OWASP ZAP for runtime vulnerability testing
  • dbt CI: run dbt test on every PR that touches models
  • Immutable artifacts: every build produces versioned container image
  • Rollback: one-command rollback to any previous version
  • Audit: every deployment logged with approver, timestamp, change scope

D6. Monitoring & Observability

Metrics (Prometheus):

  • API latency (P50, P95, P99) per endpoint
  • Error rate (4xx, 5xx) per service
  • Database connection pool utilization
  • AI inference latency and throughput
  • Agent task completion rate and duration
  • Queue depths (Dagster, agent tasks)
  • Tenant-level resource consumption

Dashboards (Grafana):

  • Platform Overview (uptime, latency, errors, active users)
  • AI Performance (inference latency, model accuracy, token usage)
  • Data Pipeline Health (Airbyte sync status, dbt run times, freshness)
  • Security (auth failures, RBAC denials, anomaly alerts)
  • Tenant Usage (per-tenant resource consumption for billing)

Alerting:

AlertConditionSeverityChannel
API P95 > 500ms5-min windowWarningSlack
Error rate > 1%5-min windowCriticalPagerDuty
DB connections > 80%InstantWarningSlack
AI inference timeout3 consecutiveCriticalPagerDuty
Disk usage > 85%InstantWarningEmail
Backup failureAnyCriticalPagerDuty + Email
SOC 2 evidence gapDaily checkHighEmail + Slack

D7. Disaster Recovery

Targets:

  • RTO (Recovery Time Objective): < 4 hours
  • RPO (Recovery Point Objective): < 15 minutes

Strategy:

  • PostgreSQL: Streaming replication to standby region + WAL archival to object storage
  • immudb: Cross-region backup (daily full + hourly incremental)
  • Application: Multi-region K8s with failover DNS (Cloud DNS / Route 53)
  • Data: Airbyte sync replays from last checkpoint
  • Secrets: Vault replication to DR region

DR Playbook (runbook):

  1. Detect failure (automated health check or manual escalation)
  2. Activate DR region DNS failover
  3. Promote PostgreSQL standby to primary
  4. Verify immudb integrity (Merkle tree verification)
  5. Scale up application pods in DR region
  6. Run smoke tests against DR deployment
  7. Notify stakeholders
  8. Post-incident: root cause analysis + playbook update

CONSTRAINTS

  • All infrastructure as code — no manual configuration in production
  • Every service must have health checks (liveness + readiness probes)
  • GPU workloads must be schedulable to specific node pools
  • Storage must support encryption at rest
  • Network policies: deny-all default, explicit allow rules
  • All secrets via HashiCorp Vault or cloud-native secret manager — never in code/config

RESEARCH QUESTIONS

  1. What is the optimal GKE node pool configuration for mixed CPU/GPU workloads (API + vLLM)?
  2. How should PostgreSQL HA be configured for a multi-tenant FP&A platform (Patroni vs. Cloud SQL)?
  3. What is the best Argo CD strategy for managing multiple environments with Helm value overrides?
  4. How to implement zero-downtime deployments for stateful services (PostgreSQL, immudb)?
  5. What is the recommended approach for GPU auto-scaling (vLLM) based on inference queue depth?