ADR-009: GCP Infrastructure Architecture

Status: Accepted Date: 2025-11-30 Deciders: Hal Casteel (CTO), Architecture Team Related ADRs:

ADR-002: Redis Caching Strategy - Redis Memorystore configuration
ADR-006: Secret Management Strategy - Cloud KMS encryption
ADR-007: Django Multi-Tenant Architecture - Cloud SQL PostgreSQL requirements

Context

The CODITECT license server requires production-ready infrastructure to support multi-tenant SaaS operations with high availability, security, and cost efficiency.

Business Requirements

Scale Targets:

Support 10,000 concurrent license sessions
Handle 1M+ license validations per day
Serve 1,000+ tenant organizations
100,000+ total registered users

Availability Requirements:

99.9% uptime SLA (8.76 hours downtime/year max)
<100ms license validation latency (p99)
Zero-downtime deployments for application updates
Automated failover for database failures

Security Requirements:

Encrypted data at rest (database, Redis, secrets)
Encrypted data in transit (TLS 1.3 for all connections)
Network isolation (private VPC, no public IPs)
Compliance: SOC 2 Type II, GDPR, HIPAA-ready

Cost Requirements:

Development environment: <$500/month
Production environment: <$2,000/month at launch
Auto-scaling to minimize costs during low usage
Reserved instances for predictable workloads

Technical Constraints

Platform Requirements:

Google Cloud Platform (GCP) - existing organization account
Kubernetes for container orchestration
PostgreSQL 15+ for database (RLS support required)
Redis for caching and session management

Deployment Environment:

Geographic regions: us-central1 (primary), us-east1 (DR)
Multi-AZ deployment for high availability
Terraform/OpenTofu for infrastructure as code
GitHub Actions for CI/CD

Integration Requirements:

Cloud SQL for managed PostgreSQL
Redis Memorystore for managed Redis
Cloud KMS for encryption key management
Cloud Load Balancing for ingress
Cloud Monitoring for observability

Decision

We will deploy on Google Kubernetes Engine (GKE) with Cloud SQL PostgreSQL, Redis Memorystore, and Cloud KMS for production infrastructure.

Core Architecture

Infrastructure Components

1. Google Kubernetes Engine (GKE)

Regional GKE cluster with 3 zones for high availability, node auto-scaling, and automatic cluster upgrades.

2. Cloud SQL PostgreSQL

Managed PostgreSQL 15 with High Availability (HA) configuration, automatic failover, and point-in-time recovery.

3. Redis Memorystore

Managed Redis cluster (Standard Tier) with automatic failover and backup.

4. Cloud KMS

Managed encryption key service for encrypting sensitive tenant data.

5. Cloud Load Balancer

Global HTTPS load balancer with SSL termination and DDoS protection.

Implementation

Phase 1: Network Infrastructure (OpenTofu)

VPC Network

# infrastructure/terraform/modules/vpc/main.tf

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

variable "project_id" {
  type        = string
  description = "GCP project ID"
}

variable "region" {
  type        = string
  description = "GCP region"
  default     = "us-central1"
}

variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod"
  }
}

# VPC Network
resource "google_compute_network" "vpc" {
  name                    = "coditect-${var.environment}-vpc"
  auto_create_subnetworks = false
  project                 = var.project_id
}

# GKE Subnet
resource "google_compute_subnetwork" "gke_subnet" {
  name          = "coditect-${var.environment}-gke-subnet"
  ip_cidr_range = "10.0.0.0/20"
  region        = var.region
  network       = google_compute_network.vpc.id
  project       = var.project_id

  # Secondary ranges for GKE pods and services
  secondary_ip_range {
    range_name    = "pods"
    ip_cidr_range = "10.1.0.0/16"
  }

  secondary_ip_range {
    range_name    = "services"
    ip_cidr_range = "10.2.0.0/20"
  }

  # Enable Private Google Access for managed services
  private_ip_google_access = true
}

# Cloud SQL Subnet
resource "google_compute_subnetwork" "cloudsql_subnet" {
  name          = "coditect-${var.environment}-cloudsql-subnet"
  ip_cidr_range = "10.10.0.0/24"
  region        = var.region
  network       = google_compute_network.vpc.id
  project       = var.project_id

  private_ip_google_access = true
}

# Redis Subnet
resource "google_compute_subnetwork" "redis_subnet" {
  name          = "coditect-${var.environment}-redis-subnet"
  ip_cidr_range = "10.20.0.0/24"
  region        = var.region
  network       = google_compute_network.vpc.id
  project       = var.project_id

  private_ip_google_access = true
}

# Cloud Router for NAT
resource "google_compute_router" "router" {
  name    = "coditect-${var.environment}-router"
  region  = var.region
  network = google_compute_network.vpc.id
  project = var.project_id
}

# Cloud NAT for outbound internet access
resource "google_compute_router_nat" "nat" {
  name                               = "coditect-${var.environment}-nat"
  router                             = google_compute_router.router.name
  region                             = var.region
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
  project                            = var.project_id

  log_config {
    enable = true
    filter = "ERRORS_ONLY"
  }
}

# Firewall Rules
resource "google_compute_firewall" "allow_internal" {
  name    = "coditect-${var.environment}-allow-internal"
  network = google_compute_network.vpc.name
  project = var.project_id

  allow {
    protocol = "tcp"
    ports    = ["0-65535"]
  }

  allow {
    protocol = "udp"
    ports    = ["0-65535"]
  }

  allow {
    protocol = "icmp"
  }

  source_ranges = [
    "10.0.0.0/20",   # GKE subnet
    "10.1.0.0/16",   # Pods
    "10.2.0.0/20",   # Services
    "10.10.0.0/24",  # Cloud SQL
    "10.20.0.0/24",  # Redis
  ]
}

resource "google_compute_firewall" "allow_health_checks" {
  name    = "coditect-${var.environment}-allow-health-checks"
  network = google_compute_network.vpc.name
  project = var.project_id

  allow {
    protocol = "tcp"
    ports    = ["80", "443", "8080"]
  }

  # Google Cloud health check IP ranges
  source_ranges = [
    "35.191.0.0/16",
    "130.211.0.0/22",
  ]

  target_tags = ["gke-node"]
}

resource "google_compute_firewall" "deny_all_ingress" {
  name     = "coditect-${var.environment}-deny-all-ingress"
  network  = google_compute_network.vpc.name
  project  = var.project_id
  priority = 65534

  deny {
    protocol = "all"
  }

  source_ranges = ["0.0.0.0/0"]
}

# Outputs
output "vpc_id" {
  value       = google_compute_network.vpc.id
  description = "VPC network ID"
}

output "gke_subnet_name" {
  value       = google_compute_subnetwork.gke_subnet.name
  description = "GKE subnet name"
}

output "gke_subnet_secondary_range_pods" {
  value       = google_compute_subnetwork.gke_subnet.secondary_ip_range[0].range_name
  description = "Secondary IP range name for pods"
}

output "gke_subnet_secondary_range_services" {
  value       = google_compute_subnetwork.gke_subnet.secondary_ip_range[1].range_name
  description = "Secondary IP range name for services"
}

Phase 2: GKE Cluster (OpenTofu)

# infrastructure/terraform/modules/gke/main.tf

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

variable "project_id" {
  type = string
}

variable "region" {
  type    = string
  default = "us-central1"
}

variable "environment" {
  type = string
}

variable "vpc_name" {
  type = string
}

variable "subnet_name" {
  type = string
}

variable "pods_range_name" {
  type = string
}

variable "services_range_name" {
  type = string
}

variable "node_machine_type" {
  type    = string
  default = "n2-standard-4"
}

variable "min_node_count" {
  type    = number
  default = 1
}

variable "max_node_count" {
  type    = number
  default = 10
}

# GKE Cluster
resource "google_container_cluster" "primary" {
  name     = "coditect-${var.environment}-gke"
  location = var.region
  project  = var.project_id

  # Regional cluster (multi-zone by default)
  # Kubernetes version
  min_master_version = "1.28"

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  # Network configuration
  network    = var.vpc_name
  subnetwork = var.subnet_name

  # IP allocation policy for VPC-native cluster
  ip_allocation_policy {
    cluster_secondary_range_name  = var.pods_range_name
    services_secondary_range_name = var.services_range_name
  }

  # Private cluster configuration
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false # Set to true for fully private
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }

  # Master authorized networks (for kubectl access)
  master_authorized_networks_config {
    cidr_blocks {
      cidr_block   = "0.0.0.0/0" # TODO: Restrict to office/VPN IPs in prod
      display_name = "All"
    }
  }

  # Workload Identity for secure pod authentication
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  # Add-ons
  addons_config {
    http_load_balancing {
      disabled = false
    }

    horizontal_pod_autoscaling {
      disabled = false
    }

    network_policy_config {
      disabled = false
    }

    gcp_filestore_csi_driver_config {
      enabled = false
    }
  }

  # Network policy
  network_policy {
    enabled  = true
    provider = "PROVIDER_UNSPECIFIED"
  }

  # Binary Authorization (for image security)
  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }

  # Maintenance window
  maintenance_policy {
    daily_maintenance_window {
      start_time = "03:00" # 3 AM UTC
    }
  }

  # Monitoring and logging
  logging_service    = "logging.googleapis.com/kubernetes"
  monitoring_service = "monitoring.googleapis.com/kubernetes"

  # Resource labels
  resource_labels = {
    environment = var.environment
    managed_by  = "terraform"
  }
}

# Node Pool
resource "google_container_node_pool" "primary_nodes" {
  name       = "coditect-${var.environment}-node-pool"
  location   = var.region
  cluster    = google_container_cluster.primary.name
  project    = var.project_id

  # Auto-scaling configuration
  autoscaling {
    min_node_count = var.min_node_count
    max_node_count = var.max_node_count
  }

  # Auto-repair and auto-upgrade
  management {
    auto_repair  = true
    auto_upgrade = true
  }

  # Node configuration
  node_config {
    machine_type = var.node_machine_type
    disk_size_gb = 100
    disk_type    = "pd-standard"

    # Service account for nodes
    service_account = google_service_account.gke_node_sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    # Workload Identity
    workload_metadata_config {
      mode = "GKE_METADATA"
    }

    # Security
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }

    # Network tags
    tags = ["gke-node", "coditect-${var.environment}"]

    # Labels
    labels = {
      environment = var.environment
    }

    # Metadata
    metadata = {
      disable-legacy-endpoints = "true"
    }
  }
}

# Service Account for GKE Nodes
resource "google_service_account" "gke_node_sa" {
  account_id   = "coditect-${var.environment}-gke-node"
  display_name = "Service account for GKE nodes (${var.environment})"
  project      = var.project_id
}

# IAM bindings for node service account
resource "google_project_iam_member" "gke_node_log_writer" {
  project = var.project_id
  role    = "roles/logging.logWriter"
  member  = "serviceAccount:${google_service_account.gke_node_sa.email}"
}

resource "google_project_iam_member" "gke_node_metric_writer" {
  project = var.project_id
  role    = "roles/monitoring.metricWriter"
  member  = "serviceAccount:${google_service_account.gke_node_sa.email}"
}

resource "google_project_iam_member" "gke_node_monitoring_viewer" {
  project = var.project_id
  role    = "roles/monitoring.viewer"
  member  = "serviceAccount:${google_service_account.gke_node_sa.email}"
}

# Outputs
output "cluster_name" {
  value       = google_container_cluster.primary.name
  description = "GKE cluster name"
}

output "cluster_endpoint" {
  value       = google_container_cluster.primary.endpoint
  description = "GKE cluster endpoint"
  sensitive   = true
}

output "cluster_ca_certificate" {
  value       = google_container_cluster.primary.master_auth[0].cluster_ca_certificate
  description = "GKE cluster CA certificate"
  sensitive   = true
}

Phase 3: Cloud SQL PostgreSQL (OpenTofu)

# infrastructure/terraform/modules/cloudsql/main.tf

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~> 3.5"
    }
  }
}

variable "project_id" {
  type = string
}

variable "region" {
  type    = string
  default = "us-central1"
}

variable "environment" {
  type = string
}

variable "vpc_id" {
  type        = string
  description = "VPC network ID for private IP"
}

variable "database_version" {
  type    = string
  default = "POSTGRES_15"
}

variable "tier" {
  type        = string
  description = "Cloud SQL tier (e.g., db-custom-2-7680)"
  default     = "db-custom-2-7680" # 2 vCPU, 7.5 GB RAM
}

variable "disk_size" {
  type        = number
  description = "Disk size in GB"
  default     = 100
}

variable "backup_enabled" {
  type    = bool
  default = true
}

variable "high_availability" {
  type    = bool
  default = true
}

# Random password for database
resource "random_password" "db_password" {
  length  = 32
  special = true
}

# Private IP allocation for Cloud SQL
resource "google_compute_global_address" "private_ip_address" {
  name          = "coditect-${var.environment}-cloudsql-ip"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = var.vpc_id
  project       = var.project_id
}

# VPC peering connection
resource "google_service_networking_connection" "private_vpc_connection" {
  network                 = var.vpc_id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.private_ip_address.name]
}

# Cloud SQL Instance
resource "google_sql_database_instance" "main" {
  name             = "coditect-${var.environment}-db"
  database_version = var.database_version
  region           = var.region
  project          = var.project_id

  depends_on = [google_service_networking_connection.private_vpc_connection]

  settings {
    tier              = var.tier
    availability_type = var.high_availability ? "REGIONAL" : "ZONAL"
    disk_type         = "PD_SSD"
    disk_size         = var.disk_size
    disk_autoresize   = true

    # Backup configuration
    backup_configuration {
      enabled                        = var.backup_enabled
      start_time                     = "02:00"
      point_in_time_recovery_enabled = true
      transaction_log_retention_days = 7

      backup_retention_settings {
        retained_backups = 30
        retention_unit   = "COUNT"
      }
    }

    # Maintenance window
    maintenance_window {
      day          = 7 # Sunday
      hour         = 3 # 3 AM
      update_track = "stable"
    }

    # IP configuration
    ip_configuration {
      ipv4_enabled    = false # Private IP only
      private_network = var.vpc_id
      require_ssl     = true

      # No authorized networks (private only)
    }

    # Database flags for PostgreSQL
    database_flags {
      name  = "max_connections"
      value = "200"
    }

    database_flags {
      name  = "shared_buffers"
      value = "1572864" # 6 GB (in 8KB pages)
    }

    database_flags {
      name  = "work_mem"
      value = "32768" # 32 MB (in KB)
    }

    database_flags {
      name  = "maintenance_work_mem"
      value = "524288" # 512 MB (in KB)
    }

    database_flags {
      name  = "effective_cache_size"
      value = "4718592" # 18 GB (in 8KB pages)
    }

    database_flags {
      name  = "random_page_cost"
      value = "1.1" # SSD optimization
    }

    database_flags {
      name  = "log_statement"
      value = "ddl" # Log DDL only
    }

    database_flags {
      name  = "log_min_duration_statement"
      value = "1000" # Log queries >1s
    }

    # Insights configuration
    insights_config {
      query_insights_enabled  = true
      query_string_length     = 1024
      record_application_tags = true
      record_client_address   = true
    }
  }

  # Deletion protection
  deletion_protection = var.environment == "prod" ? true : false
}

# Database
resource "google_sql_database" "database" {
  name     = "coditect"
  instance = google_sql_database_instance.main.name
  project  = var.project_id
}

# Database users
resource "google_sql_user" "app_user" {
  name     = "coditect_app"
  instance = google_sql_database_instance.main.name
  password = random_password.db_password.result
  project  = var.project_id
}

resource "google_sql_user" "admin_user" {
  name     = "postgres"
  instance = google_sql_database_instance.main.name
  password = random_password.db_password.result
  project  = var.project_id
}

# Store database password in Secret Manager
resource "google_secret_manager_secret" "db_password" {
  secret_id = "coditect-${var.environment}-db-password"
  project   = var.project_id

  replication {
    auto {}
  }
}

resource "google_secret_manager_secret_version" "db_password" {
  secret      = google_secret_manager_secret.db_password.id
  secret_data = random_password.db_password.result
}

# Outputs
output "instance_name" {
  value       = google_sql_database_instance.main.name
  description = "Cloud SQL instance name"
}

output "instance_connection_name" {
  value       = google_sql_database_instance.main.connection_name
  description = "Cloud SQL instance connection name"
}

output "private_ip_address" {
  value       = google_sql_database_instance.main.private_ip_address
  description = "Private IP address of Cloud SQL instance"
  sensitive   = true
}

output "database_name" {
  value       = google_sql_database.database.name
  description = "Database name"
}

output "app_user" {
  value       = google_sql_user.app_user.name
  description = "Application database user"
}

output "db_password_secret_id" {
  value       = google_secret_manager_secret.db_password.secret_id
  description = "Secret Manager secret ID for database password"
}

Phase 4: Redis Memorystore (OpenTofu)

# infrastructure/terraform/modules/redis/main.tf

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

variable "project_id" {
  type = string
}

variable "region" {
  type    = string
  default = "us-central1"
}

variable "environment" {
  type = string
}

variable "vpc_id" {
  type        = string
  description = "VPC network ID for private connection"
}

variable "memory_size_gb" {
  type        = number
  description = "Redis memory size in GB"
  default     = 5
}

variable "redis_version" {
  type    = string
  default = "REDIS_7_0"
}

variable "tier" {
  type        = string
  description = "Service tier (BASIC or STANDARD_HA)"
  default     = "STANDARD_HA"
}

# Redis Instance
resource "google_redis_instance" "cache" {
  name               = "coditect-${var.environment}-redis"
  tier               = var.tier
  memory_size_gb     = var.memory_size_gb
  region             = var.region
  redis_version      = var.redis_version
  authorized_network = var.vpc_id
  project            = var.project_id

  # High availability configuration
  replica_count       = var.tier == "STANDARD_HA" ? 1 : 0
  read_replicas_mode  = var.tier == "STANDARD_HA" ? "READ_REPLICAS_ENABLED" : null

  # Persistence configuration
  persistence_config {
    persistence_mode    = "RDB"
    rdb_snapshot_period = "TWELVE_HOURS"
  }

  # Maintenance window
  maintenance_policy {
    weekly_maintenance_window {
      day = "SUNDAY"
      start_time {
        hours   = 3
        minutes = 0
      }
    }
  }

  # Redis configuration
  redis_configs = {
    maxmemory-policy = "allkeys-lru"
    notify-keyspace-events = "Ex"
    timeout = "300"
  }

  # Display name
  display_name = "CODITECT ${var.environment} Redis Cache"

  # Labels
  labels = {
    environment = var.environment
    managed_by  = "terraform"
  }
}

# Outputs
output "redis_host" {
  value       = google_redis_instance.cache.host
  description = "Redis instance host"
  sensitive   = true
}

output "redis_port" {
  value       = google_redis_instance.cache.port
  description = "Redis instance port"
}

output "redis_connection_string" {
  value       = "redis://${google_redis_instance.cache.host}:${google_redis_instance.cache.port}"
  description = "Redis connection string"
  sensitive   = true
}

Phase 5: Cloud KMS (OpenTofu)

# infrastructure/terraform/modules/kms/main.tf

terraform {
  required_version = ">= 1.5"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

variable "project_id" {
  type = string
}

variable "region" {
  type    = string
  default = "us-central1"
}

variable "environment" {
  type = string
}

# KMS Keyring
resource "google_kms_key_ring" "keyring" {
  name     = "coditect-${var.environment}-keyring"
  location = var.region
  project  = var.project_id
}

# Encryption key for tenant data
resource "google_kms_crypto_key" "tenant_data_key" {
  name     = "tenant-data-encryption"
  key_ring = google_kms_key_ring.keyring.id

  rotation_period = "7776000s" # 90 days

  version_template {
    algorithm = "GOOGLE_SYMMETRIC_ENCRYPTION"
  }

  lifecycle {
    prevent_destroy = true
  }
}

# Encryption key for database backups
resource "google_kms_crypto_key" "backup_key" {
  name     = "database-backup-encryption"
  key_ring = google_kms_key_ring.keyring.id

  rotation_period = "7776000s" # 90 days

  version_template {
    algorithm = "GOOGLE_SYMMETRIC_ENCRYPTION"
  }

  lifecycle {
    prevent_destroy = true
  }
}

# Service account for KMS access
resource "google_service_account" "kms_sa" {
  account_id   = "coditect-${var.environment}-kms"
  display_name = "Service account for KMS operations (${var.environment})"
  project      = var.project_id
}

# IAM binding for encryption/decryption
resource "google_kms_crypto_key_iam_member" "tenant_data_encrypter_decrypter" {
  crypto_key_id = google_kms_crypto_key.tenant_data_key.id
  role          = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
  member        = "serviceAccount:${google_service_account.kms_sa.email}"
}

resource "google_kms_crypto_key_iam_member" "backup_encrypter_decrypter" {
  crypto_key_id = google_kms_crypto_key.backup_key.id
  role          = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
  member        = "serviceAccount:${google_service_account.kms_sa.email}"
}

# Outputs
output "keyring_id" {
  value       = google_kms_key_ring.keyring.id
  description = "KMS keyring ID"
}

output "tenant_data_key_id" {
  value       = google_kms_crypto_key.tenant_data_key.id
  description = "Tenant data encryption key ID"
}

output "backup_key_id" {
  value       = google_kms_crypto_key.backup_key.id
  description = "Backup encryption key ID"
}

output "kms_service_account_email" {
  value       = google_service_account.kms_sa.email
  description = "KMS service account email"
}

Phase 6: Kubernetes Manifests

Deployment

# kubernetes/deployments/django-api.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: django-api
  namespace: coditect
  labels:
    app: django-api
    version: v1
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: django-api
  template:
    metadata:
      labels:
        app: django-api
        version: v1
    spec:
      serviceAccountName: coditect-app-sa

      # Init container for migrations
      initContainers:
        - name: migrations
          image: gcr.io/PROJECT_ID/coditect-backend:VERSION
          command: ["python", "manage.py", "migrate", "--noinput"]
          env:
            - name: DB_HOST
              value: "127.0.0.1" # Cloud SQL Proxy sidecar
            - name: DB_PORT
              value: "5432"
            - name: DB_NAME
              valueFrom:
                secretKeyRef:
                  name: cloudsql-db-credentials
                  key: database
            - name: DB_USER
              valueFrom:
                secretKeyRef:
                  name: cloudsql-db-credentials
                  key: username
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: cloudsql-db-credentials
                  key: password
            - name: REDIS_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: redis_host
            - name: REDIS_PORT
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: redis_port

      containers:
        # Django API container
        - name: django-api
          image: gcr.io/PROJECT_ID/coditect-backend:VERSION
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          env:
            - name: DB_HOST
              value: "127.0.0.1"
            - name: DB_PORT
              value: "5432"
            - name: DB_NAME
              valueFrom:
                secretKeyRef:
                  name: cloudsql-db-credentials
                  key: database
            - name: DB_USER
              valueFrom:
                secretKeyRef:
                  name: cloudsql-db-credentials
                  key: username
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: cloudsql-db-credentials
                  key: password
            - name: REDIS_HOST
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: redis_host
            - name: REDIS_PORT
              valueFrom:
                configMapKeyRef:
                  name: app-config
                  key: redis_port
            - name: SECRET_KEY
              valueFrom:
                secretKeyRef:
                  name: django-secrets
                  key: secret_key
            - name: ENVIRONMENT
              value: "production"

          # Resource limits
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "1Gi"
              cpu: "1000m"

          # Liveness probe
          livenessProbe:
            httpGet:
              path: /health/liveness
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3

          # Readiness probe
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3

        # Cloud SQL Proxy sidecar
        - name: cloud-sql-proxy
          image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
          args:
            - "--private-ip"
            - "PROJECT_ID:REGION:INSTANCE_NAME"
          securityContext:
            runAsNonRoot: true
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "200m"

      # Pod security
      securityContext:
        fsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000

Service

# kubernetes/services/django-api.yaml

apiVersion: v1
kind: Service
metadata:
  name: django-api
  namespace: coditect
  labels:
    app: django-api
spec:
  type: ClusterIP
  selector:
    app: django-api
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
      name: http
  sessionAffinity: None

Ingress

# kubernetes/ingress/django-api-ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: django-api-ingress
  namespace: coditect
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  tls:
    - hosts:
        - api.coditect.ai
      secretName: django-api-tls
  rules:
    - host: api.coditect.ai
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: django-api
                port:
                  number: 80

HorizontalPodAutoscaler

# kubernetes/autoscaling/django-api-hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: django-api-hpa
  namespace: coditect
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: django-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 50
          periodSeconds: 30
        - type: Pods
          value: 3
          periodSeconds: 30
      selectPolicy: Max

ConfigMap

# kubernetes/configmaps/app-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: coditect
data:
  redis_host: "10.20.0.3" # Redis Memorystore private IP
  redis_port: "6379"
  log_level: "INFO"
  django_settings_module: "config.settings.production"

Secrets (Template)

# kubernetes/secrets/django-secrets.yaml (template)

apiVersion: v1
kind: Secret
metadata:
  name: django-secrets
  namespace: coditect
type: Opaque
stringData:
  secret_key: "CHANGE_ME_GENERATE_RANDOM_STRING"

---
apiVersion: v1
kind: Secret
metadata:
  name: cloudsql-db-credentials
  namespace: coditect
type: Opaque
stringData:
  database: "coditect"
  username: "coditect_app"
  password: "CHANGE_ME_RETRIEVE_FROM_SECRET_MANAGER"

Phase 7: Monitoring & Observability

Prometheus ServiceMonitor

# kubernetes/monitoring/servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: django-api-metrics
  namespace: coditect
  labels:
    app: django-api
spec:
  selector:
    matchLabels:
      app: django-api
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

Grafana Dashboard (JSON)

{
  "dashboard": {
    "title": "CODITECT API Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(django_http_requests_total[5m])"
          }
        ]
      },
      {
        "title": "Request Latency (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(django_http_request_duration_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(django_http_requests_total{status=~\"5..\"}[5m])"
          }
        ]
      },
      {
        "title": "Active License Sessions",
        "targets": [
          {
            "expr": "sum(tenant_license_sessions{status=\"active\"})"
          }
        ]
      }
    ]
  }
}

Cost Analysis

Development Environment

GKE Cluster:

3 nodes (n2-standard-2): $150/month
Total: $150/month

Cloud SQL PostgreSQL:

db-custom-1-3840 (1 vCPU, 3.75 GB): $80/month
Storage 50 GB: $17/month
Total: $97/month

Redis Memorystore:

Basic tier, 1 GB: $50/month
Total: $50/month

Cloud KMS:

Key storage: $1/month
Operations: $5/month (estimated)
Total: $6/month

Networking:

Cloud Load Balancer: $20/month
Egress: $10/month (estimated)
Total: $30/month

Cloud Monitoring/Logging:

Logs: $20/month
Metrics: $10/month
Total: $30/month

Development Total: $363/month

Production Environment (Launch)

GKE Cluster:

6 nodes (n2-standard-4): $600/month
Total: $600/month

Cloud SQL PostgreSQL:

db-custom-2-7680 (2 vCPU, 7.5 GB) HA: $320/month
Storage 100 GB: $34/month
Backups 100 GB: $10/month
Total: $364/month

Redis Memorystore:

Standard HA tier, 5 GB: $200/month
Total: $200/month

Cloud KMS:

Key storage: $1/month
Operations: $10/month
Total: $11/month

Networking:

Cloud Load Balancer: $30/month
Egress: $50/month (estimated)
Total: $80/month

Cloud Monitoring/Logging:

Logs: $50/month
Metrics: $20/month
Total: $70/month

Production Total: $1,325/month

Production at Scale (10K Concurrent Sessions)

GKE Cluster:

15 nodes (n2-standard-4): $1,500/month
Total: $1,500/month

Cloud SQL PostgreSQL:

db-custom-4-15360 (4 vCPU, 15 GB) HA: $640/month
Storage 500 GB: $170/month
Backups 500 GB: $50/month
Total: $860/month

Redis Memorystore:

Standard HA tier, 10 GB: $400/month
Total: $400/month

Cloud KMS:

Key storage: $1/month
Operations: $20/month
Total: $21/month

Networking:

Cloud Load Balancer: $50/month
Egress: $200/month
Total: $250/month

Cloud Monitoring/Logging:

Logs: $150/month
Metrics: $50/month
Total: $200/month

Production at Scale Total: $3,231/month

Consequences

Positive

Availability:

✅ 99.9% uptime SLA achievable with regional GKE + Cloud SQL HA
✅ Automatic failover for database and Redis
✅ Zero-downtime deployments with rolling updates
✅ Multi-zone redundancy protects against zone failures

Security:

✅ Private VPC with no public IPs for resources
✅ Encrypted at rest (Cloud SQL, Redis, secrets)
✅ Encrypted in transit (TLS 1.3 everywhere)
✅ Workload Identity for secure pod authentication

Scalability:

✅ Horizontal pod autoscaling handles traffic spikes
✅ GKE node autoscaling adjusts capacity automatically
✅ Cloud SQL read replicas for read-heavy workloads
✅ 10K+ concurrent sessions supported

Operational Efficiency:

✅ Managed services reduce operational burden
✅ Automatic backups and point-in-time recovery
✅ Infrastructure as Code with Terraform
✅ Cloud Monitoring integration for observability

Cost Efficiency:

✅ Auto-scaling minimizes costs during low usage
✅ Shared infrastructure across tenants
✅ $1,325/month at launch (affordable for early stage)
✅ Predictable scaling ($3,231/month at 10K sessions)

Negative

Vendor Lock-In:

⚠️ GCP-specific infrastructure (not portable)
⚠️ Cloud SQL proprietary features (not pure PostgreSQL)
⚠️ GKE custom configurations (not vanilla Kubernetes)
⚠️ Migration costs high if switching cloud providers

Complexity:

⚠️ Terraform state management requires careful coordination
⚠️ Multi-component deployments increase failure surface area
⚠️ Networking complexity (VPC peering, private IPs, NAT)
⚠️ Troubleshooting requires GCP-specific knowledge

Cost:

⚠️ Higher than single VM ($1,325/month vs. $100/month)
⚠️ Egress costs can be unpredictable
⚠️ Monitoring costs scale with usage
⚠️ Idle costs during low traffic (still pay for HA resources)

Limitations:

⚠️ Cloud SQL limits (max 96 vCPU, 624 GB RAM per instance)
⚠️ Redis memory limits (max 300 GB per instance)
⚠️ GKE quotas (max nodes per cluster, max pods per node)
⚠️ Regional lock (cannot span multiple regions easily)

Mitigation Strategies

Vendor Lock-In:

Use standard Kubernetes manifests (portable to other clouds)
Abstract GCP-specific features behind interfaces
Document migration path to AWS/Azure
Use open standards where possible (Prometheus, Grafana)

Complexity:

Comprehensive runbooks for common operations
Automated deployment pipelines (GitHub Actions)
Infrastructure testing with Terratest
Regular disaster recovery drills

Cost:

Set up billing alerts and budgets
Optimize resource allocation based on metrics
Use committed use discounts for predictable workloads
Regular cost audits

Limitations:

Plan for sharding/multi-database strategy before hitting limits
Use Cloud Spanner for unlimited horizontal scaling (if needed)
Implement circuit breakers for graceful degradation
Monitor quotas and request increases proactively

Alternatives Considered

Alternative 1: Single VM Deployment

Architecture:

Single GCE VM (n2-standard-4)
PostgreSQL installed on VM
Redis installed on VM
Nginx + Gunicorn for Django

Pros:

✅ Simplest architecture
✅ Lowest cost ($100/month)
✅ Easy to debug
✅ No Kubernetes complexity

Cons:

❌ Single point of failure (no HA)
❌ Manual scaling required
❌ Downtime for deployments
❌ Cannot meet 99.9% SLA
❌ Limited to vertical scaling only

Decision: Rejected due to availability requirements and inability to meet SLA.

Alternative 2: Cloud Run + Cloud SQL

Architecture:

Cloud Run for Django API (serverless containers)
Cloud SQL PostgreSQL (managed)
Redis Memorystore (managed)

Pros:

✅ Serverless (zero idle costs)
✅ Automatic scaling to zero
✅ Simple deployment
✅ Lower management overhead

Cons:

❌ Cold start latency (100-500ms)
❌ 15-minute request timeout limit
❌ No persistent connections (Cloud SQL Proxy overhead)
❌ Difficult to run background workers
❌ Less control over networking

Decision: Rejected due to cold start latency and timeout limits.

Alternative 3: AWS EKS + RDS

Architecture:

AWS EKS (Kubernetes)
AWS RDS PostgreSQL
AWS ElastiCache Redis
AWS KMS

Pros:

✅ Similar architecture to GKE
✅ Broader AWS ecosystem
✅ More regions available
✅ Potentially lower costs (Reserved Instances)

Cons:

❌ Team has GCP experience, not AWS
❌ Migration cost from existing GCP
❌ AWS-specific learning curve
❌ Different pricing model

Decision: Rejected due to existing GCP investment and team expertise.

Alternative 4: Multi-Cloud Deployment

Architecture:

Kubernetes on GCP + AWS + Azure
PostgreSQL replication across clouds
Global load balancing

Pros:

✅ No vendor lock-in
✅ Maximum availability
✅ Geographic distribution
✅ Disaster recovery across clouds

Cons:

❌ Extreme complexity
❌ 3-5x operational burden
❌ Data replication costs high
❌ Latency for cross-cloud queries
❌ Overkill for current scale

Decision: Rejected as over-engineering for current needs. Revisit at 100K+ users.

ADR-002: Redis Caching Strategy

Defines Redis usage patterns for session management and caching.

Relationship: ADR-009 provides managed Redis Memorystore infrastructure for patterns defined in ADR-002.

ADR-006: Secret Management Strategy

Defines encryption key management with Cloud KMS.

Relationship: ADR-009 provisions Cloud KMS infrastructure for secret management defined in ADR-006.

ADR-007: Django Multi-Tenant Architecture

Defines Cloud SQL PostgreSQL requirements for Row-Level Security.

Relationship: ADR-009 provides Cloud SQL PostgreSQL 15 with RLS support required by ADR-007.

Implementation Timeline

Phase 1: Infrastructure Foundation (Week 1-2)

✅ VPC network setup
✅ GKE cluster provisioning
✅ Cloud SQL instance creation
✅ Redis Memorystore setup

Phase 2: Security & Compliance (Week 3)

✅ Cloud KMS configuration
✅ Service accounts and IAM
✅ Network policies
✅ Secret management

Phase 3: Application Deployment (Week 4)

✅ Kubernetes manifests
✅ CI/CD pipeline
✅ Initial deployment
✅ DNS configuration

Phase 4: Monitoring & Operations (Week 5)

✅ Prometheus setup
✅ Grafana dashboards
✅ Alerting rules
✅ Runbooks

Phase 5: Testing & Validation (Week 6)

✅ Load testing
✅ Failover testing
✅ Disaster recovery drill
✅ Security audit

References

GCP Documentation:

Related ADRs:

Last Updated: 2025-11-30 Review Date: 2026-02-28 Status: Accepted ✅

Context​

Business Requirements​

Technical Constraints​

Decision​

Core Architecture​

Infrastructure Components​

Implementation​

Phase 1: Network Infrastructure (OpenTofu)​

VPC Network​

Phase 2: GKE Cluster (OpenTofu)​

Phase 3: Cloud SQL PostgreSQL (OpenTofu)​

Phase 4: Redis Memorystore (OpenTofu)​

Phase 5: Cloud KMS (OpenTofu)​

Phase 6: Kubernetes Manifests​

Deployment​

Service​

Ingress​

HorizontalPodAutoscaler​

ConfigMap​

Secrets (Template)​

Phase 7: Monitoring & Observability​

Prometheus ServiceMonitor​

Grafana Dashboard (JSON)​

Cost Analysis​

Development Environment​

Production Environment (Launch)​

Production at Scale (10K Concurrent Sessions)​

Consequences​

Positive​

Negative​

Mitigation Strategies​

Alternatives Considered​

Alternative 1: Single VM Deployment​

Alternative 2: Cloud Run + Cloud SQL​

Alternative 3: AWS EKS + RDS​

Alternative 4: Multi-Cloud Deployment​

Related Decisions​

ADR-002: Redis Caching Strategy​

ADR-006: Secret Management Strategy​

ADR-007: Django Multi-Tenant Architecture​

Implementation Timeline​

References​

Context

Business Requirements

Technical Constraints

Decision

Core Architecture

Infrastructure Components

Implementation

Phase 1: Network Infrastructure (OpenTofu)

VPC Network

Phase 2: GKE Cluster (OpenTofu)

Phase 3: Cloud SQL PostgreSQL (OpenTofu)

Phase 4: Redis Memorystore (OpenTofu)

Phase 5: Cloud KMS (OpenTofu)

Phase 6: Kubernetes Manifests

Deployment

Service

Ingress

HorizontalPodAutoscaler

ConfigMap

Secrets (Template)

Phase 7: Monitoring & Observability

Prometheus ServiceMonitor

Grafana Dashboard (JSON)

Cost Analysis

Development Environment

Production Environment (Launch)

Production at Scale (10K Concurrent Sessions)

Consequences

Positive

Negative

Mitigation Strategies

Alternatives Considered

Alternative 1: Single VM Deployment

Alternative 2: Cloud Run + Cloud SQL

Alternative 3: AWS EKS + RDS

Alternative 4: Multi-Cloud Deployment

Related Decisions

ADR-002: Redis Caching Strategy

ADR-006: Secret Management Strategy

ADR-007: Django Multi-Tenant Architecture

Implementation Timeline

References