CODITECT Flow Platform - Infrastructure Architecture
Technical architecture documentation for the GCP deployment.
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ CODITECT Flow Platform │
│ Google Cloud Platform │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ VPC Network │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ GKE Subnet (10.x.0.0/20) │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ GKE Cluster (Regional/Zonal) │ │ │
│ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │
│ │ │ │ Node Pool │ │ Node Pool │ │ Node Pool │ │ │ │
│ │ │ │ n2-std-4 │ │ n2-std-4 │ │ n2-std-4 │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │ │
│ │ │ │ │Flow API │ │ │ │NATS │ │ │ │Workers │ │ │ │ │
│ │ │ │ │Pods │ │ │ │JetStream │ │ │ │Pods │ │ │ │ │
│ │ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │ │
│ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ Pods: 10.x+1.0.0/16 | Services: 10.x+2.0.0/20 │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Data Subnet (10.x+3.0.0/20) - Private Service Access │ │
│ │ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ Cloud SQL │ │ Redis │ │ │
│ │ │ PostgreSQL 15 │ │ Memorystore │ │ │
│ │ │ Private IP │ │ Private IP │ │ │
│ │ └────────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Cloud NAT → Internet │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Supporting Services │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ IAM/WI │ │ Secret │ │ Monitoring │ │ Logging │ │
│ │ Service │ │ Manager │ │ Prometheus │ │ Cloud │ │
│ │ Accounts │ │ │ │ Grafana │ │ Logging │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Network Architecture
VPC Design
Staging:
- Network:
coditect-step-staging-vpc - GKE Subnet:
10.10.0.0/20(4,096 IPs) - GKE Pods:
10.11.0.0/16(65,536 IPs) - GKE Services:
10.12.0.0/20(4,096 IPs) - Data Subnet:
10.13.0.0/20(4,096 IPs)
Production:
- Network:
coditect-step-production-vpc - GKE Subnet:
10.20.0.0/20(4,096 IPs) - GKE Pods:
10.21.0.0/16(65,536 IPs) - GKE Services:
10.22.0.0/20(4,096 IPs) - Data Subnet:
10.23.0.0/20(4,096 IPs)
Firewall Rules
- Allow Internal - All traffic within VPC subnets
- Allow NATS Cluster - Ports 4222, 6222, 8222 for NATS
- Allow SSH IAP - SSH via Identity-Aware Proxy (35.235.240.0/20)
Private Service Access
- Cloud SQL and Redis use private IP addresses only
- No public endpoints exposed
- VPC Peering with
servicenetworking.googleapis.com
Egress
- Cloud NAT for outbound internet access
- All nodes use NAT gateway for updates and external API calls
Compute Resources
GKE Configuration
Staging (Zonal):
- Location:
us-central1-a - Nodes: 1-3 (autoscaling)
- Machine Type:
n2-standard-4(4 vCPU, 16 GB RAM) - Disk: 100 GB SSD per node
- Total: 4-12 vCPU, 16-48 GB RAM
Production (Regional):
- Location:
us-central1(multi-zone: a, b, c) - Nodes: 3-10 (autoscaling, minimum 1 per zone)
- Machine Type:
n2-standard-4(4 vCPU, 16 GB RAM) - Disk: 100 GB SSD per node
- Total: 12-40 vCPU, 48-160 GB RAM
Workload Identity
All pods use Workload Identity instead of static credentials:
GKE Service Account → GCP Service Account
├── flow-api → {env}-flow-api@{project}.iam.gserviceaccount.com
├── nats → {env}-nats@{project}.iam.gserviceaccount.com
└── cloudsql-proxy → {env}-cloudsql-proxy@{project}.iam.gserviceaccount.com
Data Layer
Cloud SQL PostgreSQL
Staging:
- Version: PostgreSQL 15
- Tier:
db-custom-2-7680(2 vCPU, 7.5 GB RAM) - Storage: 50 GB SSD
- Availability: ZONAL
- Backups: Daily, 7-day retention
- Private IP only
Production:
- Version: PostgreSQL 15
- Tier:
db-custom-4-15360(4 vCPU, 15 GB RAM) - Storage: 200 GB SSD (auto-resize enabled)
- Availability: REGIONAL (HA across zones)
- Backups: Daily, 14-day retention, PITR enabled
- Read Replica: Optional (configurable)
- Private IP only
Database Flags:
max_connections = 200
shared_buffers = 262144 # 1 GB
work_mem = 16384 # 64 MB
maintenance_work_mem = 65536 # 256 MB
effective_cache_size = 524288 # 2 GB
log_min_duration_statement = 1000 # 1 second
Redis Memorystore
Staging:
- Tier: BASIC (single instance)
- Memory: 5 GB
- Version: Redis 7.0
- Auth: Enabled
- Transit Encryption: TLS
- Private IP only
Production:
- Tier: STANDARD_HA (primary + replica)
- Memory: 10 GB
- Version: Redis 7.0
- Read Replicas: 1
- Auth: Enabled
- Transit Encryption: TLS
- Persistence: RDB snapshots every 12 hours
- Private IP only
Redis Configuration:
maxmemory-policy: allkeys-lru
notify-keyspace-events: Ex
NATS JetStream
Staging:
- Replicas: 1 (single instance)
- Memory Storage: 1 GB
- File Storage: 5 GB (Persistent Volume)
- Resources:
- Request: 250m CPU, 1 GB RAM
- Limit: 1000m CPU, 4 GB RAM
Production:
- Replicas: 3 (clustered)
- Memory Storage: 2 GB
- File Storage: 20 GB per replica (60 GB total)
- Resources:
- Request: 500m CPU, 2 GB RAM
- Limit: 2000m CPU, 8 GB RAM
NATS Features:
- JetStream enabled for persistent messaging
- Cluster mode for HA
- Prometheus metrics exporter
- Workload Identity for GCS backups
Security
IAM and Service Accounts
GKE Node Service Account:
roles/logging.logWriterroles/monitoring.metricWriterroles/monitoring.viewerroles/artifactregistry.reader
Flow API Service Account:
roles/cloudsql.clientroles/secretmanager.secretAccessorroles/storage.objectViewer
NATS Service Account:
roles/storage.admin(for JetStream backups)
Cloud SQL Proxy Service Account:
roles/cloudsql.client
Network Security
- Private Cluster: GKE nodes use private IPs
- Private Endpoints: Cloud SQL and Redis not publicly accessible
- TLS Everywhere: All traffic encrypted in transit
- mTLS: Service mesh for pod-to-pod encryption (optional)
Secrets Management
- Database passwords stored in Kubernetes Secrets
- Redis AUTH strings stored in Kubernetes Secrets
- Secrets synced from GCP Secret Manager (recommended for production)
- No hardcoded credentials in code or configs
High Availability
Staging (Single-Zone)
- GKE: Single zone, node autoscaling
- Cloud SQL: ZONAL (single instance)
- Redis: BASIC (single instance)
- NATS: 1 replica
- RTO: ~15 minutes (manual intervention)
- RPO: ~5 minutes (last transaction log)
Production (Multi-Zone)
- GKE: Regional (3 zones), node autoscaling
- Cloud SQL: REGIONAL (automatic failover)
- Redis: STANDARD_HA (automatic failover)
- NATS: 3-node cluster (Raft consensus)
- RTO: ~2 minutes (automatic failover)
- RPO: ~1 minute (synchronous replication)
Monitoring and Observability
Metrics Collection
- GKE: Managed Prometheus for workloads
- Cloud SQL: Query Insights, slow query logs
- Redis: Memorystore metrics
- NATS: Prometheus exporter on port 7777
Logging
- GKE: Cloud Logging for all container logs
- Cloud SQL: Slow query logs, error logs
- Application: Structured JSON logs
Alerting
Recommended alerts:
- GKE node CPU/memory > 80%
- Cloud SQL connections > 180/200
- Redis memory > 90%
- NATS cluster degraded
- Application error rate > 1%
- Application p99 latency > 500ms
Cost Breakdown
Staging Environment
| Resource | Configuration | Monthly Cost |
|---|---|---|
| GKE Cluster | Zonal, 1-3 nodes | $75-225 |
| Cloud SQL | ZONAL, 2 vCPU, 7.5 GB RAM | $120 |
| Redis | BASIC, 5 GB | $40 |
| Networking | NAT, VPC peering | $50 |
| Persistent Disks | NATS storage | $5 |
| Total | $290-440/month |
Production Environment
| Resource | Configuration | Monthly Cost |
|---|---|---|
| GKE Cluster | Regional, 3-10 nodes | $225-750 |
| Cloud SQL | REGIONAL HA, 4 vCPU, 15 GB RAM | $400 |
| Redis | STANDARD_HA, 10 GB | $150 |
| Networking | NAT, VPC peering | $100 |
| Persistent Disks | NATS storage (60 GB) | $12 |
| Total | $887-1412/month |
Cost Optimization Strategies:
- Use preemptible nodes for non-critical workloads
- Enable cluster autoscaling
- Use committed use discounts for predictable workloads
- Implement pod autoscaling to reduce idle resources
- Use Cloud SQL read replicas only when needed
Disaster Recovery
Backup Strategy
Cloud SQL:
- Automated daily backups at 02:00 UTC
- Point-in-time recovery (PITR) enabled (production)
- Transaction logs retained for 7 days
- Backups retained for 7-14 days
- Backups stored in multi-region
Redis:
- RDB snapshots every 12 hours (production)
- No persistence in staging (cache only)
NATS JetStream:
- Persistent volumes backed up via GKE snapshots
- Optional GCS backup via custom script
Terraform State:
- Stored in GCS with versioning enabled
- 10 previous versions retained
- Bucket in multi-region
Recovery Procedures
Cloud SQL Restore:
gcloud sql backups restore BACKUP_ID \
--backup-instance=SOURCE_INSTANCE \
--backup-project=coditect-citus-prod \
--instance=TARGET_INSTANCE
NATS Restore:
kubectl get pvc -n coditect-flow
# Restore from GKE volume snapshot
gcloud compute disks create DISK_NAME --source-snapshot=SNAPSHOT_NAME
Infrastructure Restore:
cd infra/environments/production
tofu init
tofu plan
tofu apply
Scaling Strategy
Horizontal Pod Autoscaler (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: flow-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: flow-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Cluster Autoscaler
Automatically adjusts node count based on pod resource requests:
- Staging: 1-3 nodes
- Production: 3-10 nodes
Database Scaling
Vertical Scaling (Cloud SQL):
gcloud sql instances patch INSTANCE_NAME \
--tier=db-custom-8-30720 \
--no-backup
Horizontal Scaling:
- Enable read replica for read-heavy workloads
- Connection pooling (PgBouncer) for high concurrency
Maintenance Windows
GKE:
- Daily maintenance window: 03:00-07:00 UTC
- Release channel: REGULAR (stable, predictable updates)
Cloud SQL:
- Weekly maintenance: Sunday, 03:00-04:00 UTC
- Update track: stable
Redis:
- Weekly maintenance: Sunday, 03:00-04:00 UTC
Recommended Deployment Windows:
- Staging: Anytime (low traffic)
- Production: Tuesday-Thursday, 02:00-04:00 UTC (lowest traffic)
Compliance and Governance
Data Residency
- All data stored in
us-central1region - No cross-region replication (configurable)
- VPC Service Controls for additional isolation (optional)
Audit Logging
- Cloud Audit Logs enabled for all services
- Admin activity logs (always on)
- Data access logs (configurable)
- System event logs
Encryption
- At Rest: AES-256 encryption for all data (Cloud SQL, Redis, GKE disks)
- In Transit: TLS 1.3 for all connections
- Keys: Google-managed encryption keys (CMEK optional)
Next Steps
After infrastructure deployment:
- Deploy Flow Platform application containers
- Configure Ingress with SSL certificates (Let's Encrypt)
- Set up monitoring dashboards (Grafana)
- Configure alerting rules (Prometheus Alertmanager)
- Implement automated backups and disaster recovery testing
- Enable Binary Authorization for container security
- Implement CI/CD pipeline for deployments
Track: AO.19 | Owner: AZ1.AI INC | Lead: Hal Casteel Version: 1.0.0 | Updated: February 7, 2026