Skip to main content

CODITECT Flow Platform - Infrastructure Architecture

Technical architecture documentation for the GCP deployment.


Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│ CODITECT Flow Platform │
│ Google Cloud Platform │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ VPC Network │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ GKE Subnet (10.x.0.0/20) │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ GKE Cluster (Regional/Zonal) │ │ │
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │
│ │ │ │ Node Pool │ │ Node Pool │ │ Node Pool │ │ │ │
│ │ │ │ n2-std-4 │ │ n2-std-4 │ │ n2-std-4 │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ │
│ │ │ │ │Flow API │ │ │ │NATS │ │ │ │Workers │ │ │ │ │
│ │ │ │ │Pods │ │ │ │JetStream │ │ │ │Pods │ │ │ │ │
│ │ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │ │
│ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ Pods: 10.x+1.0.0/16 | Services: 10.x+2.0.0/20 │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Data Subnet (10.x+3.0.0/20) - Private Service Access │ │
│ │ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ Cloud SQL │ │ Redis │ │ │
│ │ │ PostgreSQL 15 │ │ Memorystore │ │ │
│ │ │ Private IP │ │ Private IP │ │ │
│ │ └────────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Cloud NAT → Internet │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Supporting Services │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ IAM/WI │ │ Secret │ │ Monitoring │ │ Logging │ │
│ │ Service │ │ Manager │ │ Prometheus │ │ Cloud │ │
│ │ Accounts │ │ │ │ Grafana │ │ Logging │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Network Architecture

VPC Design

Staging:

  • Network: coditect-step-staging-vpc
  • GKE Subnet: 10.10.0.0/20 (4,096 IPs)
  • GKE Pods: 10.11.0.0/16 (65,536 IPs)
  • GKE Services: 10.12.0.0/20 (4,096 IPs)
  • Data Subnet: 10.13.0.0/20 (4,096 IPs)

Production:

  • Network: coditect-step-production-vpc
  • GKE Subnet: 10.20.0.0/20 (4,096 IPs)
  • GKE Pods: 10.21.0.0/16 (65,536 IPs)
  • GKE Services: 10.22.0.0/20 (4,096 IPs)
  • Data Subnet: 10.23.0.0/20 (4,096 IPs)

Firewall Rules

  1. Allow Internal - All traffic within VPC subnets
  2. Allow NATS Cluster - Ports 4222, 6222, 8222 for NATS
  3. Allow SSH IAP - SSH via Identity-Aware Proxy (35.235.240.0/20)

Private Service Access

  • Cloud SQL and Redis use private IP addresses only
  • No public endpoints exposed
  • VPC Peering with servicenetworking.googleapis.com

Egress

  • Cloud NAT for outbound internet access
  • All nodes use NAT gateway for updates and external API calls

Compute Resources

GKE Configuration

Staging (Zonal):

  • Location: us-central1-a
  • Nodes: 1-3 (autoscaling)
  • Machine Type: n2-standard-4 (4 vCPU, 16 GB RAM)
  • Disk: 100 GB SSD per node
  • Total: 4-12 vCPU, 16-48 GB RAM

Production (Regional):

  • Location: us-central1 (multi-zone: a, b, c)
  • Nodes: 3-10 (autoscaling, minimum 1 per zone)
  • Machine Type: n2-standard-4 (4 vCPU, 16 GB RAM)
  • Disk: 100 GB SSD per node
  • Total: 12-40 vCPU, 48-160 GB RAM

Workload Identity

All pods use Workload Identity instead of static credentials:

GKE Service Account     →     GCP Service Account
├── flow-api → {env}-flow-api@{project}.iam.gserviceaccount.com
├── nats → {env}-nats@{project}.iam.gserviceaccount.com
└── cloudsql-proxy → {env}-cloudsql-proxy@{project}.iam.gserviceaccount.com

Data Layer

Cloud SQL PostgreSQL

Staging:

  • Version: PostgreSQL 15
  • Tier: db-custom-2-7680 (2 vCPU, 7.5 GB RAM)
  • Storage: 50 GB SSD
  • Availability: ZONAL
  • Backups: Daily, 7-day retention
  • Private IP only

Production:

  • Version: PostgreSQL 15
  • Tier: db-custom-4-15360 (4 vCPU, 15 GB RAM)
  • Storage: 200 GB SSD (auto-resize enabled)
  • Availability: REGIONAL (HA across zones)
  • Backups: Daily, 14-day retention, PITR enabled
  • Read Replica: Optional (configurable)
  • Private IP only

Database Flags:

max_connections              = 200
shared_buffers = 262144 # 1 GB
work_mem = 16384 # 64 MB
maintenance_work_mem = 65536 # 256 MB
effective_cache_size = 524288 # 2 GB
log_min_duration_statement = 1000 # 1 second

Redis Memorystore

Staging:

  • Tier: BASIC (single instance)
  • Memory: 5 GB
  • Version: Redis 7.0
  • Auth: Enabled
  • Transit Encryption: TLS
  • Private IP only

Production:

  • Tier: STANDARD_HA (primary + replica)
  • Memory: 10 GB
  • Version: Redis 7.0
  • Read Replicas: 1
  • Auth: Enabled
  • Transit Encryption: TLS
  • Persistence: RDB snapshots every 12 hours
  • Private IP only

Redis Configuration:

maxmemory-policy: allkeys-lru
notify-keyspace-events: Ex

NATS JetStream

Staging:

  • Replicas: 1 (single instance)
  • Memory Storage: 1 GB
  • File Storage: 5 GB (Persistent Volume)
  • Resources:
    • Request: 250m CPU, 1 GB RAM
    • Limit: 1000m CPU, 4 GB RAM

Production:

  • Replicas: 3 (clustered)
  • Memory Storage: 2 GB
  • File Storage: 20 GB per replica (60 GB total)
  • Resources:
    • Request: 500m CPU, 2 GB RAM
    • Limit: 2000m CPU, 8 GB RAM

NATS Features:

  • JetStream enabled for persistent messaging
  • Cluster mode for HA
  • Prometheus metrics exporter
  • Workload Identity for GCS backups

Security

IAM and Service Accounts

GKE Node Service Account:

  • roles/logging.logWriter
  • roles/monitoring.metricWriter
  • roles/monitoring.viewer
  • roles/artifactregistry.reader

Flow API Service Account:

  • roles/cloudsql.client
  • roles/secretmanager.secretAccessor
  • roles/storage.objectViewer

NATS Service Account:

  • roles/storage.admin (for JetStream backups)

Cloud SQL Proxy Service Account:

  • roles/cloudsql.client

Network Security

  • Private Cluster: GKE nodes use private IPs
  • Private Endpoints: Cloud SQL and Redis not publicly accessible
  • TLS Everywhere: All traffic encrypted in transit
  • mTLS: Service mesh for pod-to-pod encryption (optional)

Secrets Management

  • Database passwords stored in Kubernetes Secrets
  • Redis AUTH strings stored in Kubernetes Secrets
  • Secrets synced from GCP Secret Manager (recommended for production)
  • No hardcoded credentials in code or configs

High Availability

Staging (Single-Zone)

  • GKE: Single zone, node autoscaling
  • Cloud SQL: ZONAL (single instance)
  • Redis: BASIC (single instance)
  • NATS: 1 replica
  • RTO: ~15 minutes (manual intervention)
  • RPO: ~5 minutes (last transaction log)

Production (Multi-Zone)

  • GKE: Regional (3 zones), node autoscaling
  • Cloud SQL: REGIONAL (automatic failover)
  • Redis: STANDARD_HA (automatic failover)
  • NATS: 3-node cluster (Raft consensus)
  • RTO: ~2 minutes (automatic failover)
  • RPO: ~1 minute (synchronous replication)

Monitoring and Observability

Metrics Collection

  • GKE: Managed Prometheus for workloads
  • Cloud SQL: Query Insights, slow query logs
  • Redis: Memorystore metrics
  • NATS: Prometheus exporter on port 7777

Logging

  • GKE: Cloud Logging for all container logs
  • Cloud SQL: Slow query logs, error logs
  • Application: Structured JSON logs

Alerting

Recommended alerts:

  • GKE node CPU/memory > 80%
  • Cloud SQL connections > 180/200
  • Redis memory > 90%
  • NATS cluster degraded
  • Application error rate > 1%
  • Application p99 latency > 500ms

Cost Breakdown

Staging Environment

ResourceConfigurationMonthly Cost
GKE ClusterZonal, 1-3 nodes$75-225
Cloud SQLZONAL, 2 vCPU, 7.5 GB RAM$120
RedisBASIC, 5 GB$40
NetworkingNAT, VPC peering$50
Persistent DisksNATS storage$5
Total$290-440/month

Production Environment

ResourceConfigurationMonthly Cost
GKE ClusterRegional, 3-10 nodes$225-750
Cloud SQLREGIONAL HA, 4 vCPU, 15 GB RAM$400
RedisSTANDARD_HA, 10 GB$150
NetworkingNAT, VPC peering$100
Persistent DisksNATS storage (60 GB)$12
Total$887-1412/month

Cost Optimization Strategies:

  • Use preemptible nodes for non-critical workloads
  • Enable cluster autoscaling
  • Use committed use discounts for predictable workloads
  • Implement pod autoscaling to reduce idle resources
  • Use Cloud SQL read replicas only when needed

Disaster Recovery

Backup Strategy

Cloud SQL:

  • Automated daily backups at 02:00 UTC
  • Point-in-time recovery (PITR) enabled (production)
  • Transaction logs retained for 7 days
  • Backups retained for 7-14 days
  • Backups stored in multi-region

Redis:

  • RDB snapshots every 12 hours (production)
  • No persistence in staging (cache only)

NATS JetStream:

  • Persistent volumes backed up via GKE snapshots
  • Optional GCS backup via custom script

Terraform State:

  • Stored in GCS with versioning enabled
  • 10 previous versions retained
  • Bucket in multi-region

Recovery Procedures

Cloud SQL Restore:

gcloud sql backups restore BACKUP_ID \
--backup-instance=SOURCE_INSTANCE \
--backup-project=coditect-citus-prod \
--instance=TARGET_INSTANCE

NATS Restore:

kubectl get pvc -n coditect-flow
# Restore from GKE volume snapshot
gcloud compute disks create DISK_NAME --source-snapshot=SNAPSHOT_NAME

Infrastructure Restore:

cd infra/environments/production
tofu init
tofu plan
tofu apply

Scaling Strategy

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: flow-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: flow-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

Cluster Autoscaler

Automatically adjusts node count based on pod resource requests:

  • Staging: 1-3 nodes
  • Production: 3-10 nodes

Database Scaling

Vertical Scaling (Cloud SQL):

gcloud sql instances patch INSTANCE_NAME \
--tier=db-custom-8-30720 \
--no-backup

Horizontal Scaling:

  • Enable read replica for read-heavy workloads
  • Connection pooling (PgBouncer) for high concurrency

Maintenance Windows

GKE:

  • Daily maintenance window: 03:00-07:00 UTC
  • Release channel: REGULAR (stable, predictable updates)

Cloud SQL:

  • Weekly maintenance: Sunday, 03:00-04:00 UTC
  • Update track: stable

Redis:

  • Weekly maintenance: Sunday, 03:00-04:00 UTC

Recommended Deployment Windows:

  • Staging: Anytime (low traffic)
  • Production: Tuesday-Thursday, 02:00-04:00 UTC (lowest traffic)

Compliance and Governance

Data Residency

  • All data stored in us-central1 region
  • No cross-region replication (configurable)
  • VPC Service Controls for additional isolation (optional)

Audit Logging

  • Cloud Audit Logs enabled for all services
  • Admin activity logs (always on)
  • Data access logs (configurable)
  • System event logs

Encryption

  • At Rest: AES-256 encryption for all data (Cloud SQL, Redis, GKE disks)
  • In Transit: TLS 1.3 for all connections
  • Keys: Google-managed encryption keys (CMEK optional)

Next Steps

After infrastructure deployment:

  1. Deploy Flow Platform application containers
  2. Configure Ingress with SSL certificates (Let's Encrypt)
  3. Set up monitoring dashboards (Grafana)
  4. Configure alerting rules (Prometheus Alertmanager)
  5. Implement automated backups and disaster recovery testing
  6. Enable Binary Authorization for container security
  7. Implement CI/CD pipeline for deployments

Track: AO.19 | Owner: AZ1.AI INC | Lead: Hal Casteel Version: 1.0.0 | Updated: February 7, 2026