GKE Capacity Planning - Coditect IDE
Date: 2025-10-26 Purpose: Calculate user capacity and provide scaling recommendations
Current Configuration Analysis
Deployed Resources (from terraform/environments/prod/values.yaml)
Per Pod:
resources:
requests:
memory: "512Mi" # 0.5 GB
cpu: "500m" # 0.5 vCPU
limits:
memory: "2Gi" # 2 GB
cpu: "2000m" # 2 vCPU
Cluster Configuration:
- Replicas: 3 StatefulSet pods
- Session Affinity: Enabled (users stick to one pod)
- Storage: 50Gi workspace + 5Gi config per pod
User Capacity - Current Config
Light Development Users
Resource Requirements: 1-2 vCPU, 2-4 GB RAM per user
Calculation (using conservative 2 vCPU + 2 GB RAM):
- CPU per pod: 2000m limit / 2000m per user = 1 user
- RAM per pod: 2048Mi limit / 2048Mi per user = 1 user
- Storage per pod: 50Gi / 10Gi per user = 5 users (not limiting factor)
Bottleneck: CPU and RAM (both limited to 1 user per pod)
Total Capacity: 3 concurrent users (1 per pod)
Full Development Users
Resource Requirements: 2-4 vCPU, 6-8 GB RAM per user
Result: ❌ NOT SUPPORTED
- Pods only have 2Gi RAM limit
- Full development needs 6-8 GB minimum
- Would require pod resource increase to 8Gi+ RAM
Scaling Scenarios
Scenario 1: Small Team (10-20 Users)
Configuration:
- Replicas: 10 pods
- CPU per pod: 4000m (4 vCPU)
- RAM per pod: 8Gi (8 GB)
- Storage per pod: 100Gi
Capacity:
- Light users: 20 concurrent (2 per pod)
- Full dev users: 10 concurrent (1 per pod)
- Cost: ~$400-600/month
Terraform values.yaml changes:
replicaCount: 10
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
persistence:
workspace:
size: 100Gi
Scenario 2: Mid-Size Team (50-100 Users)
Configuration:
- Replicas: 50 pods
- CPU per pod: 4000m (4 vCPU)
- RAM per pod: 8Gi (8 GB)
- Storage per pod: 100Gi
- Autoscaling: Enabled (min 30, max 50)
Capacity:
- Light users: 100 concurrent (2 per pod)
- Full dev users: 50 concurrent (1 per pod)
- Cost: ~$2,000-3,000/month
Terraform values.yaml changes:
replicaCount: 30
autoscaling:
enabled: true
minReplicas: 30
maxReplicas: 50
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 75
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
Scenario 3: Enterprise (500+ Users)
Configuration:
- Replicas: 250 pods (autoscaling)
- CPU per pod: 4000m (4 vCPU)
- RAM per pod: 8Gi (8 GB)
- Storage per pod: 200Gi
- Multi-region deployment
- Regional node pools with affinity
Capacity:
- Light users: 500 concurrent (2 per pod)
- Full dev users: 250 concurrent (1 per pod)
- Cost: ~$10,000-15,000/month
Terraform values.yaml changes:
replicaCount: 100
autoscaling:
enabled: true
minReplicas: 100
maxReplicas: 250
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 75
resources:
requests:
memory: "6Gi"
cpu: "3000m"
limits:
memory: "8Gi"
cpu: "4000m"
persistence:
workspace:
size: 200Gi
Recommended Next Steps
Immediate Actions (Current Config - 3 Users Max)
❌ Current config is NOT production-ready - only supports 3 light users
Must do before production launch:
- Increase pod resources (lines 39-45 in
terraform/environments/prod/values.yaml):
resources:
requests:
memory: "4Gi" # Was 512Mi
cpu: "2000m" # Was 500m
limits:
memory: "8Gi" # Was 2Gi
cpu: "4000m" # Was 2000m
-
Scale replicas based on user count:
- 10 users → 10 pods
- 50 users → 50 pods
- 100 users → 100 pods
-
Enable autoscaling (lines 117-122):
autoscaling:
enabled: true # Was false
minReplicas: 10
maxReplicas: 30
- Increase workspace storage (line 30):
persistence:
workspace:
size: 100Gi # Was 50Gi
GKE Cluster Requirements
For 10-20 users:
- Node pool: 3-5 nodes
- Machine type:
e2-standard-8(8 vCPU, 32 GB RAM) - Total cluster: 24-40 vCPU, 96-160 GB RAM
For 50-100 users:
- Node pool: 15-25 nodes
- Machine type:
e2-standard-8ore2-standard-16 - Total cluster: 120-400 vCPU, 480-1600 GB RAM
For 500+ users:
- Multiple node pools (3+ per region)
- Machine type:
e2-standard-16ore2-highmem-16 - Total cluster: 1000+ vCPU, 4000+ GB RAM
- Multi-region deployment recommended
Cost Estimates
Per User Monthly Cost
| User Type | vCPU | RAM | Storage | GKE Cost |
|---|---|---|---|---|
| Light Dev | 2 | 2 GB | 50 GB | $20-30 |
| Full Dev | 4 | 8 GB | 100 GB | $40-60 |
Note: Costs include GKE compute, persistent disk, and load balancer
Total Monthly Costs
| User Count | Light Dev | Full Dev | Mixed (50/50) |
|---|---|---|---|
| 10 | $200-300 | $400-600 | $300-450 |
| 50 | $1,000-1,500 | $2,000-3,000 | $1,500-2,250 |
| 100 | $2,000-3,000 | $4,000-6,000 | $3,000-4,500 |
| 500 | $10,000-15,000 | $20,000-30,000 | $15,000-22,500 |
Additional costs:
- Networking egress: ~5-10% of compute
- FoundationDB cluster: $500-2,000/month (depending on size)
- Cloud Build: ~$50-200/month
- Load balancer: ~$20-50/month
Resource Optimization Strategies
1. Right-Size Pods
Monitor actual usage:
# Check CPU/RAM usage per pod
kubectl top pods -n coditect-app
# Get resource usage over time
kubectl get hpa -n coditect-app --watch
Adjust based on metrics:
- If CPU consistently < 50%: Reduce CPU limits
- If RAM consistently < 50%: Reduce RAM limits
- If pods are OOMKilled: Increase RAM limits
2. workspace Storage Optimization
Per-user storage tiers:
- Free tier: 10Gi
- Starter: 50Gi
- Pro: 100Gi
- Enterprise: 200Gi+
Implement in code:
// Based on user plan, set workspace storage
const workspaceSize = {
free: '10Gi',
starter: '50Gi',
pro: '100Gi',
enterprise: '200Gi'
}[user.plan];
3. Enable Autoscaling
HPA (Horizontal Pod Autoscaler):
autoscaling:
enabled: true
minReplicas: 10
maxReplicas: 50
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 75
VPA (Vertical Pod Autoscaler) - Optional:
- Automatically adjust pod resource requests/limits
- Based on historical usage patterns
- Requires VPA controller installed on cluster
4. Node Pool Optimization
Use preemptible nodes for dev/staging:
- 60-80% cost savings
- Automatic recreation on preemption
- Not recommended for production
Node affinity for workload tiers:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-tier
operator: In
values:
- premium # For paid users
Monitoring & Alerts
Key Metrics to Track
Pod-level:
- CPU utilization (target: 60-80%)
- RAM utilization (target: 60-80%)
- Disk I/O (workspace PVC)
- Active connections per pod
Cluster-level:
- Total pods running
- Node utilization
- PVC provisioning time
- Autoscaling events
Recommended Alerts
Critical (P0):
- Pod OOMKilled (memory exhausted)
- PVC provisioning failed
-
90% of maxReplicas reached
- Node pool at capacity
Warning (P1):
- CPU utilization >80% for 10 min
- RAM utilization >80% for 10 min
-
70% of maxReplicas reached
- workspace storage >80% full
Info (P2):
- Autoscaling events
- Pod evictions
- Node scaling events
Prometheus Queries
# CPU usage per pod
rate(container_cpu_usage_seconds_total{pod=~"coditect-combined-.*"}[5m])
# RAM usage per pod
container_memory_working_set_bytes{pod=~"coditect-combined-.*"}
# Active users (based on connections)
sum(nginx_ingress_controller_requests{service="coditect-combined-service"})
Migration Path - Current to Production
Phase 1: Immediate (Today)
- ✅ StatefulSet migration completed (
./scripts/migrate-to-statefulset.sh) - ⏳ Test persistence (
./scripts/test-persistence.sh) - ⏳ Validate session affinity (
./scripts/test-session-affinity.sh)
Current capacity: 3 light users
Phase 2: Small Team (This Week)
- Update
terraform/environments/prod/values.yaml:- Increase pod resources to 4 vCPU, 8 GB RAM
- Scale to 10 replicas
- Increase workspace to 100Gi
- Apply with Terraform:
terraform apply - Monitor with
kubectl top pods
Target capacity: 10-20 users
Phase 3: Production-Ready (Next Month)
- Enable autoscaling (min 10, max 50 replicas)
- Set up monitoring (Prometheus + Grafana)
- Configure alerts (OpsGenie/PagerDuty)
- Implement workspace storage tiers
- Load testing with 50+ concurrent users
Target capacity: 50-100 users
Phase 4: Enterprise Scale (Quarter)
- Multi-region deployment
- Regional node pools
- Advanced autoscaling (VPA + HPA)
- Cost optimization (preemptible nodes for dev)
- Advanced monitoring (distributed tracing)
Target capacity: 500+ users
Summary
Current State
- Configuration: 3 pods, 2 vCPU + 2 GB RAM each
- Capacity: 3 concurrent light users (1 per pod)
- Production-ready: ❌ NO - severely under-provisioned
Recommended First Step
Update terraform/environments/prod/values.yaml with:
replicaCount: 10
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
persistence:
workspace:
size: 100Gi
autoscaling:
enabled: true
minReplicas: 10
maxReplicas: 30
This will support:
- 10-20 light development users
- 10 full development users
- Autoscaling to 30 pods during peak load
Estimated cost: $400-600/month
Next: Apply Terraform changes and monitor actual usage to optimize further.