Skip to main content

ADR-009: GCP Infrastructure Architecture

Status: Accepted Date: 2025-11-30 Deciders: Hal Casteel (CTO), Architecture Team Related ADRs:


Context

The CODITECT license server requires production-ready infrastructure to support multi-tenant SaaS operations with high availability, security, and cost efficiency.

Business Requirements

Scale Targets:

  • Support 10,000 concurrent license sessions
  • Handle 1M+ license validations per day
  • Serve 1,000+ tenant organizations
  • 100,000+ total registered users

Availability Requirements:

  • 99.9% uptime SLA (8.76 hours downtime/year max)
  • <100ms license validation latency (p99)
  • Zero-downtime deployments for application updates
  • Automated failover for database failures

Security Requirements:

  • Encrypted data at rest (database, Redis, secrets)
  • Encrypted data in transit (TLS 1.3 for all connections)
  • Network isolation (private VPC, no public IPs)
  • Compliance: SOC 2 Type II, GDPR, HIPAA-ready

Cost Requirements:

  • Development environment: <$500/month
  • Production environment: <$2,000/month at launch
  • Auto-scaling to minimize costs during low usage
  • Reserved instances for predictable workloads

Technical Constraints

Platform Requirements:

  • Google Cloud Platform (GCP) - existing organization account
  • Kubernetes for container orchestration
  • PostgreSQL 15+ for database (RLS support required)
  • Redis for caching and session management

Deployment Environment:

  • Geographic regions: us-central1 (primary), us-east1 (DR)
  • Multi-AZ deployment for high availability
  • Terraform/OpenTofu for infrastructure as code
  • GitHub Actions for CI/CD

Integration Requirements:

  • Cloud SQL for managed PostgreSQL
  • Redis Memorystore for managed Redis
  • Cloud KMS for encryption key management
  • Cloud Load Balancing for ingress
  • Cloud Monitoring for observability

Decision

We will deploy on Google Kubernetes Engine (GKE) with Cloud SQL PostgreSQL, Redis Memorystore, and Cloud KMS for production infrastructure.

Core Architecture

Infrastructure Components

1. Google Kubernetes Engine (GKE)

Regional GKE cluster with 3 zones for high availability, node auto-scaling, and automatic cluster upgrades.

2. Cloud SQL PostgreSQL

Managed PostgreSQL 15 with High Availability (HA) configuration, automatic failover, and point-in-time recovery.

3. Redis Memorystore

Managed Redis cluster (Standard Tier) with automatic failover and backup.

4. Cloud KMS

Managed encryption key service for encrypting sensitive tenant data.

5. Cloud Load Balancer

Global HTTPS load balancer with SSL termination and DDoS protection.


Implementation

Phase 1: Network Infrastructure (OpenTofu)

VPC Network

# infrastructure/terraform/modules/vpc/main.tf

terraform {
required_version = ">= 1.5"

required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}

variable "project_id" {
type = string
description = "GCP project ID"
}

variable "region" {
type = string
description = "GCP region"
default = "us-central1"
}

variable "environment" {
type = string
description = "Environment name (dev, staging, prod)"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod"
}
}

# VPC Network
resource "google_compute_network" "vpc" {
name = "coditect-${var.environment}-vpc"
auto_create_subnetworks = false
project = var.project_id
}

# GKE Subnet
resource "google_compute_subnetwork" "gke_subnet" {
name = "coditect-${var.environment}-gke-subnet"
ip_cidr_range = "10.0.0.0/20"
region = var.region
network = google_compute_network.vpc.id
project = var.project_id

# Secondary ranges for GKE pods and services
secondary_ip_range {
range_name = "pods"
ip_cidr_range = "10.1.0.0/16"
}

secondary_ip_range {
range_name = "services"
ip_cidr_range = "10.2.0.0/20"
}

# Enable Private Google Access for managed services
private_ip_google_access = true
}

# Cloud SQL Subnet
resource "google_compute_subnetwork" "cloudsql_subnet" {
name = "coditect-${var.environment}-cloudsql-subnet"
ip_cidr_range = "10.10.0.0/24"
region = var.region
network = google_compute_network.vpc.id
project = var.project_id

private_ip_google_access = true
}

# Redis Subnet
resource "google_compute_subnetwork" "redis_subnet" {
name = "coditect-${var.environment}-redis-subnet"
ip_cidr_range = "10.20.0.0/24"
region = var.region
network = google_compute_network.vpc.id
project = var.project_id

private_ip_google_access = true
}

# Cloud Router for NAT
resource "google_compute_router" "router" {
name = "coditect-${var.environment}-router"
region = var.region
network = google_compute_network.vpc.id
project = var.project_id
}

# Cloud NAT for outbound internet access
resource "google_compute_router_nat" "nat" {
name = "coditect-${var.environment}-nat"
router = google_compute_router.router.name
region = var.region
nat_ip_allocate_option = "AUTO_ONLY"
source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
project = var.project_id

log_config {
enable = true
filter = "ERRORS_ONLY"
}
}

# Firewall Rules
resource "google_compute_firewall" "allow_internal" {
name = "coditect-${var.environment}-allow-internal"
network = google_compute_network.vpc.name
project = var.project_id

allow {
protocol = "tcp"
ports = ["0-65535"]
}

allow {
protocol = "udp"
ports = ["0-65535"]
}

allow {
protocol = "icmp"
}

source_ranges = [
"10.0.0.0/20", # GKE subnet
"10.1.0.0/16", # Pods
"10.2.0.0/20", # Services
"10.10.0.0/24", # Cloud SQL
"10.20.0.0/24", # Redis
]
}

resource "google_compute_firewall" "allow_health_checks" {
name = "coditect-${var.environment}-allow-health-checks"
network = google_compute_network.vpc.name
project = var.project_id

allow {
protocol = "tcp"
ports = ["80", "443", "8080"]
}

# Google Cloud health check IP ranges
source_ranges = [
"35.191.0.0/16",
"130.211.0.0/22",
]

target_tags = ["gke-node"]
}

resource "google_compute_firewall" "deny_all_ingress" {
name = "coditect-${var.environment}-deny-all-ingress"
network = google_compute_network.vpc.name
project = var.project_id
priority = 65534

deny {
protocol = "all"
}

source_ranges = ["0.0.0.0/0"]
}

# Outputs
output "vpc_id" {
value = google_compute_network.vpc.id
description = "VPC network ID"
}

output "gke_subnet_name" {
value = google_compute_subnetwork.gke_subnet.name
description = "GKE subnet name"
}

output "gke_subnet_secondary_range_pods" {
value = google_compute_subnetwork.gke_subnet.secondary_ip_range[0].range_name
description = "Secondary IP range name for pods"
}

output "gke_subnet_secondary_range_services" {
value = google_compute_subnetwork.gke_subnet.secondary_ip_range[1].range_name
description = "Secondary IP range name for services"
}

Phase 2: GKE Cluster (OpenTofu)

# infrastructure/terraform/modules/gke/main.tf

terraform {
required_version = ">= 1.5"

required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}

variable "project_id" {
type = string
}

variable "region" {
type = string
default = "us-central1"
}

variable "environment" {
type = string
}

variable "vpc_name" {
type = string
}

variable "subnet_name" {
type = string
}

variable "pods_range_name" {
type = string
}

variable "services_range_name" {
type = string
}

variable "node_machine_type" {
type = string
default = "n2-standard-4"
}

variable "min_node_count" {
type = number
default = 1
}

variable "max_node_count" {
type = number
default = 10
}

# GKE Cluster
resource "google_container_cluster" "primary" {
name = "coditect-${var.environment}-gke"
location = var.region
project = var.project_id

# Regional cluster (multi-zone by default)
# Kubernetes version
min_master_version = "1.28"

# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
remove_default_node_pool = true
initial_node_count = 1

# Network configuration
network = var.vpc_name
subnetwork = var.subnet_name

# IP allocation policy for VPC-native cluster
ip_allocation_policy {
cluster_secondary_range_name = var.pods_range_name
services_secondary_range_name = var.services_range_name
}

# Private cluster configuration
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false # Set to true for fully private
master_ipv4_cidr_block = "172.16.0.0/28"
}

# Master authorized networks (for kubectl access)
master_authorized_networks_config {
cidr_blocks {
cidr_block = "0.0.0.0/0" # TODO: Restrict to office/VPN IPs in prod
display_name = "All"
}
}

# Workload Identity for secure pod authentication
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}

# Add-ons
addons_config {
http_load_balancing {
disabled = false
}

horizontal_pod_autoscaling {
disabled = false
}

network_policy_config {
disabled = false
}

gcp_filestore_csi_driver_config {
enabled = false
}
}

# Network policy
network_policy {
enabled = true
provider = "PROVIDER_UNSPECIFIED"
}

# Binary Authorization (for image security)
binary_authorization {
evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
}

# Maintenance window
maintenance_policy {
daily_maintenance_window {
start_time = "03:00" # 3 AM UTC
}
}

# Monitoring and logging
logging_service = "logging.googleapis.com/kubernetes"
monitoring_service = "monitoring.googleapis.com/kubernetes"

# Resource labels
resource_labels = {
environment = var.environment
managed_by = "terraform"
}
}

# Node Pool
resource "google_container_node_pool" "primary_nodes" {
name = "coditect-${var.environment}-node-pool"
location = var.region
cluster = google_container_cluster.primary.name
project = var.project_id

# Auto-scaling configuration
autoscaling {
min_node_count = var.min_node_count
max_node_count = var.max_node_count
}

# Auto-repair and auto-upgrade
management {
auto_repair = true
auto_upgrade = true
}

# Node configuration
node_config {
machine_type = var.node_machine_type
disk_size_gb = 100
disk_type = "pd-standard"

# Service account for nodes
service_account = google_service_account.gke_node_sa.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]

# Workload Identity
workload_metadata_config {
mode = "GKE_METADATA"
}

# Security
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}

# Network tags
tags = ["gke-node", "coditect-${var.environment}"]

# Labels
labels = {
environment = var.environment
}

# Metadata
metadata = {
disable-legacy-endpoints = "true"
}
}
}

# Service Account for GKE Nodes
resource "google_service_account" "gke_node_sa" {
account_id = "coditect-${var.environment}-gke-node"
display_name = "Service account for GKE nodes (${var.environment})"
project = var.project_id
}

# IAM bindings for node service account
resource "google_project_iam_member" "gke_node_log_writer" {
project = var.project_id
role = "roles/logging.logWriter"
member = "serviceAccount:${google_service_account.gke_node_sa.email}"
}

resource "google_project_iam_member" "gke_node_metric_writer" {
project = var.project_id
role = "roles/monitoring.metricWriter"
member = "serviceAccount:${google_service_account.gke_node_sa.email}"
}

resource "google_project_iam_member" "gke_node_monitoring_viewer" {
project = var.project_id
role = "roles/monitoring.viewer"
member = "serviceAccount:${google_service_account.gke_node_sa.email}"
}

# Outputs
output "cluster_name" {
value = google_container_cluster.primary.name
description = "GKE cluster name"
}

output "cluster_endpoint" {
value = google_container_cluster.primary.endpoint
description = "GKE cluster endpoint"
sensitive = true
}

output "cluster_ca_certificate" {
value = google_container_cluster.primary.master_auth[0].cluster_ca_certificate
description = "GKE cluster CA certificate"
sensitive = true
}

Phase 3: Cloud SQL PostgreSQL (OpenTofu)

# infrastructure/terraform/modules/cloudsql/main.tf

terraform {
required_version = ">= 1.5"

required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.5"
}
}
}

variable "project_id" {
type = string
}

variable "region" {
type = string
default = "us-central1"
}

variable "environment" {
type = string
}

variable "vpc_id" {
type = string
description = "VPC network ID for private IP"
}

variable "database_version" {
type = string
default = "POSTGRES_15"
}

variable "tier" {
type = string
description = "Cloud SQL tier (e.g., db-custom-2-7680)"
default = "db-custom-2-7680" # 2 vCPU, 7.5 GB RAM
}

variable "disk_size" {
type = number
description = "Disk size in GB"
default = 100
}

variable "backup_enabled" {
type = bool
default = true
}

variable "high_availability" {
type = bool
default = true
}

# Random password for database
resource "random_password" "db_password" {
length = 32
special = true
}

# Private IP allocation for Cloud SQL
resource "google_compute_global_address" "private_ip_address" {
name = "coditect-${var.environment}-cloudsql-ip"
purpose = "VPC_PEERING"
address_type = "INTERNAL"
prefix_length = 16
network = var.vpc_id
project = var.project_id
}

# VPC peering connection
resource "google_service_networking_connection" "private_vpc_connection" {
network = var.vpc_id
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.private_ip_address.name]
}

# Cloud SQL Instance
resource "google_sql_database_instance" "main" {
name = "coditect-${var.environment}-db"
database_version = var.database_version
region = var.region
project = var.project_id

depends_on = [google_service_networking_connection.private_vpc_connection]

settings {
tier = var.tier
availability_type = var.high_availability ? "REGIONAL" : "ZONAL"
disk_type = "PD_SSD"
disk_size = var.disk_size
disk_autoresize = true

# Backup configuration
backup_configuration {
enabled = var.backup_enabled
start_time = "02:00"
point_in_time_recovery_enabled = true
transaction_log_retention_days = 7

backup_retention_settings {
retained_backups = 30
retention_unit = "COUNT"
}
}

# Maintenance window
maintenance_window {
day = 7 # Sunday
hour = 3 # 3 AM
update_track = "stable"
}

# IP configuration
ip_configuration {
ipv4_enabled = false # Private IP only
private_network = var.vpc_id
require_ssl = true

# No authorized networks (private only)
}

# Database flags for PostgreSQL
database_flags {
name = "max_connections"
value = "200"
}

database_flags {
name = "shared_buffers"
value = "1572864" # 6 GB (in 8KB pages)
}

database_flags {
name = "work_mem"
value = "32768" # 32 MB (in KB)
}

database_flags {
name = "maintenance_work_mem"
value = "524288" # 512 MB (in KB)
}

database_flags {
name = "effective_cache_size"
value = "4718592" # 18 GB (in 8KB pages)
}

database_flags {
name = "random_page_cost"
value = "1.1" # SSD optimization
}

database_flags {
name = "log_statement"
value = "ddl" # Log DDL only
}

database_flags {
name = "log_min_duration_statement"
value = "1000" # Log queries >1s
}

# Insights configuration
insights_config {
query_insights_enabled = true
query_string_length = 1024
record_application_tags = true
record_client_address = true
}
}

# Deletion protection
deletion_protection = var.environment == "prod" ? true : false
}

# Database
resource "google_sql_database" "database" {
name = "coditect"
instance = google_sql_database_instance.main.name
project = var.project_id
}

# Database users
resource "google_sql_user" "app_user" {
name = "coditect_app"
instance = google_sql_database_instance.main.name
password = random_password.db_password.result
project = var.project_id
}

resource "google_sql_user" "admin_user" {
name = "postgres"
instance = google_sql_database_instance.main.name
password = random_password.db_password.result
project = var.project_id
}

# Store database password in Secret Manager
resource "google_secret_manager_secret" "db_password" {
secret_id = "coditect-${var.environment}-db-password"
project = var.project_id

replication {
auto {}
}
}

resource "google_secret_manager_secret_version" "db_password" {
secret = google_secret_manager_secret.db_password.id
secret_data = random_password.db_password.result
}

# Outputs
output "instance_name" {
value = google_sql_database_instance.main.name
description = "Cloud SQL instance name"
}

output "instance_connection_name" {
value = google_sql_database_instance.main.connection_name
description = "Cloud SQL instance connection name"
}

output "private_ip_address" {
value = google_sql_database_instance.main.private_ip_address
description = "Private IP address of Cloud SQL instance"
sensitive = true
}

output "database_name" {
value = google_sql_database.database.name
description = "Database name"
}

output "app_user" {
value = google_sql_user.app_user.name
description = "Application database user"
}

output "db_password_secret_id" {
value = google_secret_manager_secret.db_password.secret_id
description = "Secret Manager secret ID for database password"
}

Phase 4: Redis Memorystore (OpenTofu)

# infrastructure/terraform/modules/redis/main.tf

terraform {
required_version = ">= 1.5"

required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}

variable "project_id" {
type = string
}

variable "region" {
type = string
default = "us-central1"
}

variable "environment" {
type = string
}

variable "vpc_id" {
type = string
description = "VPC network ID for private connection"
}

variable "memory_size_gb" {
type = number
description = "Redis memory size in GB"
default = 5
}

variable "redis_version" {
type = string
default = "REDIS_7_0"
}

variable "tier" {
type = string
description = "Service tier (BASIC or STANDARD_HA)"
default = "STANDARD_HA"
}

# Redis Instance
resource "google_redis_instance" "cache" {
name = "coditect-${var.environment}-redis"
tier = var.tier
memory_size_gb = var.memory_size_gb
region = var.region
redis_version = var.redis_version
authorized_network = var.vpc_id
project = var.project_id

# High availability configuration
replica_count = var.tier == "STANDARD_HA" ? 1 : 0
read_replicas_mode = var.tier == "STANDARD_HA" ? "READ_REPLICAS_ENABLED" : null

# Persistence configuration
persistence_config {
persistence_mode = "RDB"
rdb_snapshot_period = "TWELVE_HOURS"
}

# Maintenance window
maintenance_policy {
weekly_maintenance_window {
day = "SUNDAY"
start_time {
hours = 3
minutes = 0
}
}
}

# Redis configuration
redis_configs = {
maxmemory-policy = "allkeys-lru"
notify-keyspace-events = "Ex"
timeout = "300"
}

# Display name
display_name = "CODITECT ${var.environment} Redis Cache"

# Labels
labels = {
environment = var.environment
managed_by = "terraform"
}
}

# Outputs
output "redis_host" {
value = google_redis_instance.cache.host
description = "Redis instance host"
sensitive = true
}

output "redis_port" {
value = google_redis_instance.cache.port
description = "Redis instance port"
}

output "redis_connection_string" {
value = "redis://${google_redis_instance.cache.host}:${google_redis_instance.cache.port}"
description = "Redis connection string"
sensitive = true
}

Phase 5: Cloud KMS (OpenTofu)

# infrastructure/terraform/modules/kms/main.tf

terraform {
required_version = ">= 1.5"

required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}

variable "project_id" {
type = string
}

variable "region" {
type = string
default = "us-central1"
}

variable "environment" {
type = string
}

# KMS Keyring
resource "google_kms_key_ring" "keyring" {
name = "coditect-${var.environment}-keyring"
location = var.region
project = var.project_id
}

# Encryption key for tenant data
resource "google_kms_crypto_key" "tenant_data_key" {
name = "tenant-data-encryption"
key_ring = google_kms_key_ring.keyring.id

rotation_period = "7776000s" # 90 days

version_template {
algorithm = "GOOGLE_SYMMETRIC_ENCRYPTION"
}

lifecycle {
prevent_destroy = true
}
}

# Encryption key for database backups
resource "google_kms_crypto_key" "backup_key" {
name = "database-backup-encryption"
key_ring = google_kms_key_ring.keyring.id

rotation_period = "7776000s" # 90 days

version_template {
algorithm = "GOOGLE_SYMMETRIC_ENCRYPTION"
}

lifecycle {
prevent_destroy = true
}
}

# Service account for KMS access
resource "google_service_account" "kms_sa" {
account_id = "coditect-${var.environment}-kms"
display_name = "Service account for KMS operations (${var.environment})"
project = var.project_id
}

# IAM binding for encryption/decryption
resource "google_kms_crypto_key_iam_member" "tenant_data_encrypter_decrypter" {
crypto_key_id = google_kms_crypto_key.tenant_data_key.id
role = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
member = "serviceAccount:${google_service_account.kms_sa.email}"
}

resource "google_kms_crypto_key_iam_member" "backup_encrypter_decrypter" {
crypto_key_id = google_kms_crypto_key.backup_key.id
role = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
member = "serviceAccount:${google_service_account.kms_sa.email}"
}

# Outputs
output "keyring_id" {
value = google_kms_key_ring.keyring.id
description = "KMS keyring ID"
}

output "tenant_data_key_id" {
value = google_kms_crypto_key.tenant_data_key.id
description = "Tenant data encryption key ID"
}

output "backup_key_id" {
value = google_kms_crypto_key.backup_key.id
description = "Backup encryption key ID"
}

output "kms_service_account_email" {
value = google_service_account.kms_sa.email
description = "KMS service account email"
}

Phase 6: Kubernetes Manifests

Deployment

# kubernetes/deployments/django-api.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: django-api
namespace: coditect
labels:
app: django-api
version: v1
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: django-api
template:
metadata:
labels:
app: django-api
version: v1
spec:
serviceAccountName: coditect-app-sa

# Init container for migrations
initContainers:
- name: migrations
image: gcr.io/PROJECT_ID/coditect-backend:VERSION
command: ["python", "manage.py", "migrate", "--noinput"]
env:
- name: DB_HOST
value: "127.0.0.1" # Cloud SQL Proxy sidecar
- name: DB_PORT
value: "5432"
- name: DB_NAME
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: database
- name: DB_USER
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: password
- name: REDIS_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: redis_host
- name: REDIS_PORT
valueFrom:
configMapKeyRef:
name: app-config
key: redis_port

containers:
# Django API container
- name: django-api
image: gcr.io/PROJECT_ID/coditect-backend:VERSION
ports:
- containerPort: 8000
name: http
protocol: TCP
env:
- name: DB_HOST
value: "127.0.0.1"
- name: DB_PORT
value: "5432"
- name: DB_NAME
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: database
- name: DB_USER
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: cloudsql-db-credentials
key: password
- name: REDIS_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: redis_host
- name: REDIS_PORT
valueFrom:
configMapKeyRef:
name: app-config
key: redis_port
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: django-secrets
key: secret_key
- name: ENVIRONMENT
value: "production"

# Resource limits
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"

# Liveness probe
livenessProbe:
httpGet:
path: /health/liveness
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

# Readiness probe
readinessProbe:
httpGet:
path: /health/readiness
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3

# Cloud SQL Proxy sidecar
- name: cloud-sql-proxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
args:
- "--private-ip"
- "PROJECT_ID:REGION:INSTANCE_NAME"
securityContext:
runAsNonRoot: true
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"

# Pod security
securityContext:
fsGroup: 1000
runAsNonRoot: true
runAsUser: 1000

Service

# kubernetes/services/django-api.yaml

apiVersion: v1
kind: Service
metadata:
name: django-api
namespace: coditect
labels:
app: django-api
spec:
type: ClusterIP
selector:
app: django-api
ports:
- port: 80
targetPort: 8000
protocol: TCP
name: http
sessionAffinity: None

Ingress

# kubernetes/ingress/django-api-ingress.yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: django-api-ingress
namespace: coditect
annotations:
kubernetes.io/ingress.class: "nginx"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
tls:
- hosts:
- api.coditect.ai
secretName: django-api-tls
rules:
- host: api.coditect.ai
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: django-api
port:
number: 80

HorizontalPodAutoscaler

# kubernetes/autoscaling/django-api-hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: django-api-hpa
namespace: coditect
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: django-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 50
periodSeconds: 30
- type: Pods
value: 3
periodSeconds: 30
selectPolicy: Max

ConfigMap

# kubernetes/configmaps/app-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
namespace: coditect
data:
redis_host: "10.20.0.3" # Redis Memorystore private IP
redis_port: "6379"
log_level: "INFO"
django_settings_module: "config.settings.production"

Secrets (Template)

# kubernetes/secrets/django-secrets.yaml (template)

apiVersion: v1
kind: Secret
metadata:
name: django-secrets
namespace: coditect
type: Opaque
stringData:
secret_key: "CHANGE_ME_GENERATE_RANDOM_STRING"

---
apiVersion: v1
kind: Secret
metadata:
name: cloudsql-db-credentials
namespace: coditect
type: Opaque
stringData:
database: "coditect"
username: "coditect_app"
password: "CHANGE_ME_RETRIEVE_FROM_SECRET_MANAGER"

Phase 7: Monitoring & Observability

Prometheus ServiceMonitor

# kubernetes/monitoring/servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: django-api-metrics
namespace: coditect
labels:
app: django-api
spec:
selector:
matchLabels:
app: django-api
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s

Grafana Dashboard (JSON)

{
"dashboard": {
"title": "CODITECT API Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(django_http_requests_total[5m])"
}
]
},
{
"title": "Request Latency (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(django_http_request_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(django_http_requests_total{status=~\"5..\"}[5m])"
}
]
},
{
"title": "Active License Sessions",
"targets": [
{
"expr": "sum(tenant_license_sessions{status=\"active\"})"
}
]
}
]
}
}

Cost Analysis

Development Environment

GKE Cluster:

  • 3 nodes (n2-standard-2): $150/month
  • Total: $150/month

Cloud SQL PostgreSQL:

  • db-custom-1-3840 (1 vCPU, 3.75 GB): $80/month
  • Storage 50 GB: $17/month
  • Total: $97/month

Redis Memorystore:

  • Basic tier, 1 GB: $50/month
  • Total: $50/month

Cloud KMS:

  • Key storage: $1/month
  • Operations: $5/month (estimated)
  • Total: $6/month

Networking:

  • Cloud Load Balancer: $20/month
  • Egress: $10/month (estimated)
  • Total: $30/month

Cloud Monitoring/Logging:

  • Logs: $20/month
  • Metrics: $10/month
  • Total: $30/month

Development Total: $363/month


Production Environment (Launch)

GKE Cluster:

  • 6 nodes (n2-standard-4): $600/month
  • Total: $600/month

Cloud SQL PostgreSQL:

  • db-custom-2-7680 (2 vCPU, 7.5 GB) HA: $320/month
  • Storage 100 GB: $34/month
  • Backups 100 GB: $10/month
  • Total: $364/month

Redis Memorystore:

  • Standard HA tier, 5 GB: $200/month
  • Total: $200/month

Cloud KMS:

  • Key storage: $1/month
  • Operations: $10/month
  • Total: $11/month

Networking:

  • Cloud Load Balancer: $30/month
  • Egress: $50/month (estimated)
  • Total: $80/month

Cloud Monitoring/Logging:

  • Logs: $50/month
  • Metrics: $20/month
  • Total: $70/month

Production Total: $1,325/month


Production at Scale (10K Concurrent Sessions)

GKE Cluster:

  • 15 nodes (n2-standard-4): $1,500/month
  • Total: $1,500/month

Cloud SQL PostgreSQL:

  • db-custom-4-15360 (4 vCPU, 15 GB) HA: $640/month
  • Storage 500 GB: $170/month
  • Backups 500 GB: $50/month
  • Total: $860/month

Redis Memorystore:

  • Standard HA tier, 10 GB: $400/month
  • Total: $400/month

Cloud KMS:

  • Key storage: $1/month
  • Operations: $20/month
  • Total: $21/month

Networking:

  • Cloud Load Balancer: $50/month
  • Egress: $200/month
  • Total: $250/month

Cloud Monitoring/Logging:

  • Logs: $150/month
  • Metrics: $50/month
  • Total: $200/month

Production at Scale Total: $3,231/month


Consequences

Positive

Availability:

  • 99.9% uptime SLA achievable with regional GKE + Cloud SQL HA
  • Automatic failover for database and Redis
  • Zero-downtime deployments with rolling updates
  • Multi-zone redundancy protects against zone failures

Security:

  • Private VPC with no public IPs for resources
  • Encrypted at rest (Cloud SQL, Redis, secrets)
  • Encrypted in transit (TLS 1.3 everywhere)
  • Workload Identity for secure pod authentication

Scalability:

  • Horizontal pod autoscaling handles traffic spikes
  • GKE node autoscaling adjusts capacity automatically
  • Cloud SQL read replicas for read-heavy workloads
  • 10K+ concurrent sessions supported

Operational Efficiency:

  • Managed services reduce operational burden
  • Automatic backups and point-in-time recovery
  • Infrastructure as Code with Terraform
  • Cloud Monitoring integration for observability

Cost Efficiency:

  • Auto-scaling minimizes costs during low usage
  • Shared infrastructure across tenants
  • $1,325/month at launch (affordable for early stage)
  • Predictable scaling ($3,231/month at 10K sessions)

Negative

Vendor Lock-In:

  • ⚠️ GCP-specific infrastructure (not portable)
  • ⚠️ Cloud SQL proprietary features (not pure PostgreSQL)
  • ⚠️ GKE custom configurations (not vanilla Kubernetes)
  • ⚠️ Migration costs high if switching cloud providers

Complexity:

  • ⚠️ Terraform state management requires careful coordination
  • ⚠️ Multi-component deployments increase failure surface area
  • ⚠️ Networking complexity (VPC peering, private IPs, NAT)
  • ⚠️ Troubleshooting requires GCP-specific knowledge

Cost:

  • ⚠️ Higher than single VM ($1,325/month vs. $100/month)
  • ⚠️ Egress costs can be unpredictable
  • ⚠️ Monitoring costs scale with usage
  • ⚠️ Idle costs during low traffic (still pay for HA resources)

Limitations:

  • ⚠️ Cloud SQL limits (max 96 vCPU, 624 GB RAM per instance)
  • ⚠️ Redis memory limits (max 300 GB per instance)
  • ⚠️ GKE quotas (max nodes per cluster, max pods per node)
  • ⚠️ Regional lock (cannot span multiple regions easily)

Mitigation Strategies

Vendor Lock-In:

  • Use standard Kubernetes manifests (portable to other clouds)
  • Abstract GCP-specific features behind interfaces
  • Document migration path to AWS/Azure
  • Use open standards where possible (Prometheus, Grafana)

Complexity:

  • Comprehensive runbooks for common operations
  • Automated deployment pipelines (GitHub Actions)
  • Infrastructure testing with Terratest
  • Regular disaster recovery drills

Cost:

  • Set up billing alerts and budgets
  • Optimize resource allocation based on metrics
  • Use committed use discounts for predictable workloads
  • Regular cost audits

Limitations:

  • Plan for sharding/multi-database strategy before hitting limits
  • Use Cloud Spanner for unlimited horizontal scaling (if needed)
  • Implement circuit breakers for graceful degradation
  • Monitor quotas and request increases proactively

Alternatives Considered

Alternative 1: Single VM Deployment

Architecture:

  • Single GCE VM (n2-standard-4)
  • PostgreSQL installed on VM
  • Redis installed on VM
  • Nginx + Gunicorn for Django

Pros:

  • ✅ Simplest architecture
  • ✅ Lowest cost ($100/month)
  • ✅ Easy to debug
  • ✅ No Kubernetes complexity

Cons:

  • ❌ Single point of failure (no HA)
  • ❌ Manual scaling required
  • ❌ Downtime for deployments
  • ❌ Cannot meet 99.9% SLA
  • ❌ Limited to vertical scaling only

Decision: Rejected due to availability requirements and inability to meet SLA.


Alternative 2: Cloud Run + Cloud SQL

Architecture:

  • Cloud Run for Django API (serverless containers)
  • Cloud SQL PostgreSQL (managed)
  • Redis Memorystore (managed)

Pros:

  • ✅ Serverless (zero idle costs)
  • ✅ Automatic scaling to zero
  • ✅ Simple deployment
  • ✅ Lower management overhead

Cons:

  • ❌ Cold start latency (100-500ms)
  • ❌ 15-minute request timeout limit
  • ❌ No persistent connections (Cloud SQL Proxy overhead)
  • ❌ Difficult to run background workers
  • ❌ Less control over networking

Decision: Rejected due to cold start latency and timeout limits.


Alternative 3: AWS EKS + RDS

Architecture:

  • AWS EKS (Kubernetes)
  • AWS RDS PostgreSQL
  • AWS ElastiCache Redis
  • AWS KMS

Pros:

  • ✅ Similar architecture to GKE
  • ✅ Broader AWS ecosystem
  • ✅ More regions available
  • ✅ Potentially lower costs (Reserved Instances)

Cons:

  • ❌ Team has GCP experience, not AWS
  • ❌ Migration cost from existing GCP
  • ❌ AWS-specific learning curve
  • ❌ Different pricing model

Decision: Rejected due to existing GCP investment and team expertise.


Alternative 4: Multi-Cloud Deployment

Architecture:

  • Kubernetes on GCP + AWS + Azure
  • PostgreSQL replication across clouds
  • Global load balancing

Pros:

  • ✅ No vendor lock-in
  • ✅ Maximum availability
  • ✅ Geographic distribution
  • ✅ Disaster recovery across clouds

Cons:

  • ❌ Extreme complexity
  • ❌ 3-5x operational burden
  • ❌ Data replication costs high
  • ❌ Latency for cross-cloud queries
  • ❌ Overkill for current scale

Decision: Rejected as over-engineering for current needs. Revisit at 100K+ users.


ADR-002: Redis Caching Strategy

Defines Redis usage patterns for session management and caching.

Relationship: ADR-009 provides managed Redis Memorystore infrastructure for patterns defined in ADR-002.


ADR-006: Secret Management Strategy

Defines encryption key management with Cloud KMS.

Relationship: ADR-009 provisions Cloud KMS infrastructure for secret management defined in ADR-006.


ADR-007: Django Multi-Tenant Architecture

Defines Cloud SQL PostgreSQL requirements for Row-Level Security.

Relationship: ADR-009 provides Cloud SQL PostgreSQL 15 with RLS support required by ADR-007.


Implementation Timeline

Phase 1: Infrastructure Foundation (Week 1-2)

  • ✅ VPC network setup
  • ✅ GKE cluster provisioning
  • ✅ Cloud SQL instance creation
  • ✅ Redis Memorystore setup

Phase 2: Security & Compliance (Week 3)

  • ✅ Cloud KMS configuration
  • ✅ Service accounts and IAM
  • ✅ Network policies
  • ✅ Secret management

Phase 3: Application Deployment (Week 4)

  • ✅ Kubernetes manifests
  • ✅ CI/CD pipeline
  • ✅ Initial deployment
  • ✅ DNS configuration

Phase 4: Monitoring & Operations (Week 5)

  • ✅ Prometheus setup
  • ✅ Grafana dashboards
  • ✅ Alerting rules
  • ✅ Runbooks

Phase 5: Testing & Validation (Week 6)

  • ✅ Load testing
  • ✅ Failover testing
  • ✅ Disaster recovery drill
  • ✅ Security audit

References

GCP Documentation:

Related ADRs:


Last Updated: 2025-11-30 Review Date: 2026-02-28 Status: Accepted ✅