CODITECT Flow Platform - Infrastructure Architecture

Technical architecture documentation for the GCP deployment.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          CODITECT Flow Platform                      │
│                          Google Cloud Platform                       │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                              VPC Network                             │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  GKE Subnet (10.x.0.0/20)                                     │   │
│  │  ┌────────────────────────────────────────────────────────┐  │   │
│  │  │  GKE Cluster (Regional/Zonal)                          │  │   │
│  │  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │  │   │
│  │  │  │ Node Pool    │  │ Node Pool    │  │ Node Pool    │  │  │   │
│  │  │  │ n2-std-4     │  │ n2-std-4     │  │ n2-std-4     │  │  │   │
│  │  │  │              │  │              │  │              │  │  │   │
│  │  │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │  │   │
│  │  │  │ │Flow API  │ │  │ │NATS      │ │  │ │Workers   │ │  │  │   │
│  │  │  │ │Pods      │ │  │ │JetStream │ │  │ │Pods      │ │  │  │   │
│  │  │  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │  │   │
│  │  │  └──────────────┘  └──────────────┘  └──────────────┘  │  │   │
│  │  │                                                         │  │   │
│  │  │  Pods: 10.x+1.0.0/16 | Services: 10.x+2.0.0/20        │  │   │
│  │  └────────────────────────────────────────────────────────┘  │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  Data Subnet (10.x+3.0.0/20) - Private Service Access       │   │
│  │  ┌────────────────┐         ┌────────────────┐              │   │
│  │  │ Cloud SQL      │         │ Redis          │              │   │
│  │  │ PostgreSQL 15  │         │ Memorystore    │              │   │
│  │  │ Private IP     │         │ Private IP     │              │   │
│  │  └────────────────┘         └────────────────┘              │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  Cloud NAT → Internet                                        │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                      Supporting Services                             │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐    │
│  │ IAM/WI     │  │ Secret     │  │ Monitoring │  │ Logging    │    │
│  │ Service    │  │ Manager    │  │ Prometheus │  │ Cloud      │    │
│  │ Accounts   │  │            │  │ Grafana    │  │ Logging    │    │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Network Architecture

VPC Design

Staging:

Network: coditect-step-staging-vpc
GKE Subnet: 10.10.0.0/20 (4,096 IPs)
GKE Pods: 10.11.0.0/16 (65,536 IPs)
GKE Services: 10.12.0.0/20 (4,096 IPs)
Data Subnet: 10.13.0.0/20 (4,096 IPs)

Production:

Network: coditect-step-production-vpc
GKE Subnet: 10.20.0.0/20 (4,096 IPs)
GKE Pods: 10.21.0.0/16 (65,536 IPs)
GKE Services: 10.22.0.0/20 (4,096 IPs)
Data Subnet: 10.23.0.0/20 (4,096 IPs)

Firewall Rules

Allow Internal - All traffic within VPC subnets
Allow NATS Cluster - Ports 4222, 6222, 8222 for NATS
Allow SSH IAP - SSH via Identity-Aware Proxy (35.235.240.0/20)

Private Service Access

Cloud SQL and Redis use private IP addresses only
No public endpoints exposed
VPC Peering with servicenetworking.googleapis.com

Egress

Cloud NAT for outbound internet access
All nodes use NAT gateway for updates and external API calls

Compute Resources

GKE Configuration

Staging (Zonal):

Location: us-central1-a
Nodes: 1-3 (autoscaling)
Machine Type: n2-standard-4 (4 vCPU, 16 GB RAM)
Disk: 100 GB SSD per node
Total: 4-12 vCPU, 16-48 GB RAM

Production (Regional):

Location: us-central1 (multi-zone: a, b, c)
Nodes: 3-10 (autoscaling, minimum 1 per zone)
Machine Type: n2-standard-4 (4 vCPU, 16 GB RAM)
Disk: 100 GB SSD per node
Total: 12-40 vCPU, 48-160 GB RAM

Workload Identity

All pods use Workload Identity instead of static credentials:

GKE Service Account     →     GCP Service Account
├── flow-api            →     {env}-flow-api@{project}.iam.gserviceaccount.com
├── nats                →     {env}-nats@{project}.iam.gserviceaccount.com
└── cloudsql-proxy      →     {env}-cloudsql-proxy@{project}.iam.gserviceaccount.com

Data Layer

Cloud SQL PostgreSQL

Staging:

Version: PostgreSQL 15
Tier: db-custom-2-7680 (2 vCPU, 7.5 GB RAM)
Storage: 50 GB SSD
Availability: ZONAL
Backups: Daily, 7-day retention
Private IP only

Production:

Version: PostgreSQL 15
Tier: db-custom-4-15360 (4 vCPU, 15 GB RAM)
Storage: 200 GB SSD (auto-resize enabled)
Availability: REGIONAL (HA across zones)
Backups: Daily, 14-day retention, PITR enabled
Read Replica: Optional (configurable)
Private IP only

Database Flags:

max_connections              = 200
shared_buffers               = 262144  # 1 GB
work_mem                     = 16384   # 64 MB
maintenance_work_mem         = 65536   # 256 MB
effective_cache_size         = 524288  # 2 GB
log_min_duration_statement   = 1000    # 1 second

Redis Memorystore

Staging:

Tier: BASIC (single instance)
Memory: 5 GB
Version: Redis 7.0
Auth: Enabled
Transit Encryption: TLS
Private IP only

Production:

Tier: STANDARD_HA (primary + replica)
Memory: 10 GB
Version: Redis 7.0
Read Replicas: 1
Auth: Enabled
Transit Encryption: TLS
Persistence: RDB snapshots every 12 hours
Private IP only

Redis Configuration:

maxmemory-policy: allkeys-lru
notify-keyspace-events: Ex

NATS JetStream

Staging:

Replicas: 1 (single instance)
Memory Storage: 1 GB
File Storage: 5 GB (Persistent Volume)
Resources:
- Request: 250m CPU, 1 GB RAM
- Limit: 1000m CPU, 4 GB RAM

Production:

Replicas: 3 (clustered)
Memory Storage: 2 GB
File Storage: 20 GB per replica (60 GB total)
Resources:
- Request: 500m CPU, 2 GB RAM
- Limit: 2000m CPU, 8 GB RAM

NATS Features:

JetStream enabled for persistent messaging
Cluster mode for HA
Prometheus metrics exporter
Workload Identity for GCS backups

Security

IAM and Service Accounts

GKE Node Service Account:

roles/logging.logWriter
roles/monitoring.metricWriter
roles/monitoring.viewer
roles/artifactregistry.reader

Flow API Service Account:

roles/cloudsql.client
roles/secretmanager.secretAccessor
roles/storage.objectViewer

NATS Service Account:

roles/storage.admin (for JetStream backups)

Cloud SQL Proxy Service Account:

roles/cloudsql.client

Network Security

Private Cluster: GKE nodes use private IPs
Private Endpoints: Cloud SQL and Redis not publicly accessible
TLS Everywhere: All traffic encrypted in transit
mTLS: Service mesh for pod-to-pod encryption (optional)

Secrets Management

Database passwords stored in Kubernetes Secrets
Redis AUTH strings stored in Kubernetes Secrets
Secrets synced from GCP Secret Manager (recommended for production)
No hardcoded credentials in code or configs

High Availability

Staging (Single-Zone)

GKE: Single zone, node autoscaling
Cloud SQL: ZONAL (single instance)
Redis: BASIC (single instance)
NATS: 1 replica
RTO: ~15 minutes (manual intervention)
RPO: ~5 minutes (last transaction log)

Production (Multi-Zone)

GKE: Regional (3 zones), node autoscaling
Cloud SQL: REGIONAL (automatic failover)
Redis: STANDARD_HA (automatic failover)
NATS: 3-node cluster (Raft consensus)
RTO: ~2 minutes (automatic failover)
RPO: ~1 minute (synchronous replication)

Monitoring and Observability

Metrics Collection

GKE: Managed Prometheus for workloads
Cloud SQL: Query Insights, slow query logs
Redis: Memorystore metrics
NATS: Prometheus exporter on port 7777

Logging

GKE: Cloud Logging for all container logs
Cloud SQL: Slow query logs, error logs
Application: Structured JSON logs

Alerting

Recommended alerts:

GKE node CPU/memory > 80%
Cloud SQL connections > 180/200
Redis memory > 90%
NATS cluster degraded
Application error rate > 1%
Application p99 latency > 500ms

Cost Breakdown

Staging Environment

Resource	Configuration	Monthly Cost
GKE Cluster	Zonal, 1-3 nodes	$75-225
Cloud SQL	ZONAL, 2 vCPU, 7.5 GB RAM	$120
Redis	BASIC, 5 GB	$40
Networking	NAT, VPC peering	$50
Persistent Disks	NATS storage	$5
Total		$290-440/month

Production Environment

Resource	Configuration	Monthly Cost
GKE Cluster	Regional, 3-10 nodes	$225-750
Cloud SQL	REGIONAL HA, 4 vCPU, 15 GB RAM	$400
Redis	STANDARD_HA, 10 GB	$150
Networking	NAT, VPC peering	$100
Persistent Disks	NATS storage (60 GB)	$12
Total		$887-1412/month

Cost Optimization Strategies:

Use preemptible nodes for non-critical workloads
Enable cluster autoscaling
Use committed use discounts for predictable workloads
Implement pod autoscaling to reduce idle resources
Use Cloud SQL read replicas only when needed

Disaster Recovery

Backup Strategy

Cloud SQL:

Automated daily backups at 02:00 UTC
Point-in-time recovery (PITR) enabled (production)
Transaction logs retained for 7 days
Backups retained for 7-14 days
Backups stored in multi-region

Redis:

RDB snapshots every 12 hours (production)
No persistence in staging (cache only)

NATS JetStream:

Persistent volumes backed up via GKE snapshots
Optional GCS backup via custom script

Terraform State:

Stored in GCS with versioning enabled
10 previous versions retained
Bucket in multi-region

Recovery Procedures

Cloud SQL Restore:

gcloud sql backups restore BACKUP_ID \
  --backup-instance=SOURCE_INSTANCE \
  --backup-project=coditect-citus-prod \
  --instance=TARGET_INSTANCE

NATS Restore:

kubectl get pvc -n coditect-flow
# Restore from GKE volume snapshot
gcloud compute disks create DISK_NAME --source-snapshot=SNAPSHOT_NAME

Infrastructure Restore:

cd infra/environments/production
tofu init
tofu plan
tofu apply

Scaling Strategy

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: flow-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: flow-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Cluster Autoscaler

Automatically adjusts node count based on pod resource requests:

Staging: 1-3 nodes
Production: 3-10 nodes

Database Scaling

Vertical Scaling (Cloud SQL):

gcloud sql instances patch INSTANCE_NAME \
  --tier=db-custom-8-30720 \
  --no-backup

Horizontal Scaling:

Enable read replica for read-heavy workloads
Connection pooling (PgBouncer) for high concurrency

Maintenance Windows

GKE:

Daily maintenance window: 03:00-07:00 UTC
Release channel: REGULAR (stable, predictable updates)

Cloud SQL:

Weekly maintenance: Sunday, 03:00-04:00 UTC
Update track: stable

Redis:

Weekly maintenance: Sunday, 03:00-04:00 UTC

Recommended Deployment Windows:

Staging: Anytime (low traffic)
Production: Tuesday-Thursday, 02:00-04:00 UTC (lowest traffic)

Compliance and Governance

Data Residency

All data stored in us-central1 region
No cross-region replication (configurable)
VPC Service Controls for additional isolation (optional)

Audit Logging

Cloud Audit Logs enabled for all services
Admin activity logs (always on)
Data access logs (configurable)
System event logs

Encryption

At Rest: AES-256 encryption for all data (Cloud SQL, Redis, GKE disks)
In Transit: TLS 1.3 for all connections
Keys: Google-managed encryption keys (CMEK optional)

Next Steps

After infrastructure deployment:

Deploy Flow Platform application containers
Configure Ingress with SSL certificates (Let's Encrypt)
Set up monitoring dashboards (Grafana)
Configure alerting rules (Prometheus Alertmanager)
Implement automated backups and disaster recovery testing
Enable Binary Authorization for container security
Implement CI/CD pipeline for deployments

Track: AO.19 | Owner: AZ1.AI INC | Lead: Hal Casteel Version: 1.0.0 | Updated: February 7, 2026

Architecture Overview​

Network Architecture​

VPC Design​

Firewall Rules​

Private Service Access​

Egress​

Compute Resources​

GKE Configuration​

Workload Identity​

Data Layer​

Cloud SQL PostgreSQL​

Redis Memorystore​

NATS JetStream​

Security​

IAM and Service Accounts​

Network Security​

Secrets Management​

High Availability​

Staging (Single-Zone)​

Production (Multi-Zone)​

Monitoring and Observability​

Metrics Collection​

Logging​

Alerting​

Cost Breakdown​

Staging Environment​

Production Environment​

Disaster Recovery​

Backup Strategy​

Recovery Procedures​

Scaling Strategy​

Horizontal Pod Autoscaler (HPA)​

Cluster Autoscaler​

Database Scaling​

Maintenance Windows​

Compliance and Governance​

Data Residency​

Audit Logging​

Encryption​

Next Steps​

Architecture Overview

Network Architecture

VPC Design

Firewall Rules

Private Service Access

Egress

Compute Resources

GKE Configuration

Workload Identity

Data Layer

Cloud SQL PostgreSQL

Redis Memorystore

NATS JetStream

Security

IAM and Service Accounts

Network Security

Secrets Management

High Availability

Staging (Single-Zone)

Production (Multi-Zone)

Monitoring and Observability

Metrics Collection

Logging

Alerting

Cost Breakdown

Staging Environment

Production Environment

Disaster Recovery

Backup Strategy

Recovery Procedures

Scaling Strategy

Horizontal Pod Autoscaler (HPA)

Cluster Autoscaler

Database Scaling

Maintenance Windows

Compliance and Governance

Data Residency

Audit Logging

Encryption

Next Steps