Skip to main content

GKE Capacity Planning - Coditect IDE

Date: 2025-10-26 Purpose: Calculate user capacity and provide scaling recommendations


Current Configuration Analysis

Deployed Resources (from terraform/environments/prod/values.yaml)

Per Pod:

resources:
requests:
memory: "512Mi" # 0.5 GB
cpu: "500m" # 0.5 vCPU
limits:
memory: "2Gi" # 2 GB
cpu: "2000m" # 2 vCPU

Cluster Configuration:

  • Replicas: 3 StatefulSet pods
  • Session Affinity: Enabled (users stick to one pod)
  • Storage: 50Gi workspace + 5Gi config per pod

User Capacity - Current Config

Light Development Users

Resource Requirements: 1-2 vCPU, 2-4 GB RAM per user

Calculation (using conservative 2 vCPU + 2 GB RAM):

  • CPU per pod: 2000m limit / 2000m per user = 1 user
  • RAM per pod: 2048Mi limit / 2048Mi per user = 1 user
  • Storage per pod: 50Gi / 10Gi per user = 5 users (not limiting factor)

Bottleneck: CPU and RAM (both limited to 1 user per pod)

Total Capacity: 3 concurrent users (1 per pod)

Full Development Users

Resource Requirements: 2-4 vCPU, 6-8 GB RAM per user

Result: ❌ NOT SUPPORTED

  • Pods only have 2Gi RAM limit
  • Full development needs 6-8 GB minimum
  • Would require pod resource increase to 8Gi+ RAM

Scaling Scenarios

Scenario 1: Small Team (10-20 Users)

Configuration:

  • Replicas: 10 pods
  • CPU per pod: 4000m (4 vCPU)
  • RAM per pod: 8Gi (8 GB)
  • Storage per pod: 100Gi

Capacity:

  • Light users: 20 concurrent (2 per pod)
  • Full dev users: 10 concurrent (1 per pod)
  • Cost: ~$400-600/month

Terraform values.yaml changes:

replicaCount: 10

resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"

persistence:
workspace:
size: 100Gi

Scenario 2: Mid-Size Team (50-100 Users)

Configuration:

  • Replicas: 50 pods
  • CPU per pod: 4000m (4 vCPU)
  • RAM per pod: 8Gi (8 GB)
  • Storage per pod: 100Gi
  • Autoscaling: Enabled (min 30, max 50)

Capacity:

  • Light users: 100 concurrent (2 per pod)
  • Full dev users: 50 concurrent (1 per pod)
  • Cost: ~$2,000-3,000/month

Terraform values.yaml changes:

replicaCount: 30

autoscaling:
enabled: true
minReplicas: 30
maxReplicas: 50
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 75

resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"

Scenario 3: Enterprise (500+ Users)

Configuration:

  • Replicas: 250 pods (autoscaling)
  • CPU per pod: 4000m (4 vCPU)
  • RAM per pod: 8Gi (8 GB)
  • Storage per pod: 200Gi
  • Multi-region deployment
  • Regional node pools with affinity

Capacity:

  • Light users: 500 concurrent (2 per pod)
  • Full dev users: 250 concurrent (1 per pod)
  • Cost: ~$10,000-15,000/month

Terraform values.yaml changes:

replicaCount: 100

autoscaling:
enabled: true
minReplicas: 100
maxReplicas: 250
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 75

resources:
requests:
memory: "6Gi"
cpu: "3000m"
limits:
memory: "8Gi"
cpu: "4000m"

persistence:
workspace:
size: 200Gi

Immediate Actions (Current Config - 3 Users Max)

Current config is NOT production-ready - only supports 3 light users

Must do before production launch:

  1. Increase pod resources (lines 39-45 in terraform/environments/prod/values.yaml):
resources:
requests:
memory: "4Gi" # Was 512Mi
cpu: "2000m" # Was 500m
limits:
memory: "8Gi" # Was 2Gi
cpu: "4000m" # Was 2000m
  1. Scale replicas based on user count:

    • 10 users → 10 pods
    • 50 users → 50 pods
    • 100 users → 100 pods
  2. Enable autoscaling (lines 117-122):

autoscaling:
enabled: true # Was false
minReplicas: 10
maxReplicas: 30
  1. Increase workspace storage (line 30):
persistence:
workspace:
size: 100Gi # Was 50Gi

GKE Cluster Requirements

For 10-20 users:

  • Node pool: 3-5 nodes
  • Machine type: e2-standard-8 (8 vCPU, 32 GB RAM)
  • Total cluster: 24-40 vCPU, 96-160 GB RAM

For 50-100 users:

  • Node pool: 15-25 nodes
  • Machine type: e2-standard-8 or e2-standard-16
  • Total cluster: 120-400 vCPU, 480-1600 GB RAM

For 500+ users:

  • Multiple node pools (3+ per region)
  • Machine type: e2-standard-16 or e2-highmem-16
  • Total cluster: 1000+ vCPU, 4000+ GB RAM
  • Multi-region deployment recommended

Cost Estimates

Per User Monthly Cost

User TypevCPURAMStorageGKE Cost
Light Dev22 GB50 GB$20-30
Full Dev48 GB100 GB$40-60

Note: Costs include GKE compute, persistent disk, and load balancer

Total Monthly Costs

User CountLight DevFull DevMixed (50/50)
10$200-300$400-600$300-450
50$1,000-1,500$2,000-3,000$1,500-2,250
100$2,000-3,000$4,000-6,000$3,000-4,500
500$10,000-15,000$20,000-30,000$15,000-22,500

Additional costs:

  • Networking egress: ~5-10% of compute
  • FoundationDB cluster: $500-2,000/month (depending on size)
  • Cloud Build: ~$50-200/month
  • Load balancer: ~$20-50/month

Resource Optimization Strategies

1. Right-Size Pods

Monitor actual usage:

# Check CPU/RAM usage per pod
kubectl top pods -n coditect-app

# Get resource usage over time
kubectl get hpa -n coditect-app --watch

Adjust based on metrics:

  • If CPU consistently < 50%: Reduce CPU limits
  • If RAM consistently < 50%: Reduce RAM limits
  • If pods are OOMKilled: Increase RAM limits

2. workspace Storage Optimization

Per-user storage tiers:

  • Free tier: 10Gi
  • Starter: 50Gi
  • Pro: 100Gi
  • Enterprise: 200Gi+

Implement in code:

// Based on user plan, set workspace storage
const workspaceSize = {
free: '10Gi',
starter: '50Gi',
pro: '100Gi',
enterprise: '200Gi'
}[user.plan];

3. Enable Autoscaling

HPA (Horizontal Pod Autoscaler):

autoscaling:
enabled: true
minReplicas: 10
maxReplicas: 50
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 75

VPA (Vertical Pod Autoscaler) - Optional:

  • Automatically adjust pod resource requests/limits
  • Based on historical usage patterns
  • Requires VPA controller installed on cluster

4. Node Pool Optimization

Use preemptible nodes for dev/staging:

  • 60-80% cost savings
  • Automatic recreation on preemption
  • Not recommended for production

Node affinity for workload tiers:

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-tier
operator: In
values:
- premium # For paid users

Monitoring & Alerts

Key Metrics to Track

Pod-level:

  • CPU utilization (target: 60-80%)
  • RAM utilization (target: 60-80%)
  • Disk I/O (workspace PVC)
  • Active connections per pod

Cluster-level:

  • Total pods running
  • Node utilization
  • PVC provisioning time
  • Autoscaling events

Critical (P0):

  • Pod OOMKilled (memory exhausted)
  • PVC provisioning failed
  • 90% of maxReplicas reached

  • Node pool at capacity

Warning (P1):

  • CPU utilization >80% for 10 min
  • RAM utilization >80% for 10 min
  • 70% of maxReplicas reached

  • workspace storage >80% full

Info (P2):

  • Autoscaling events
  • Pod evictions
  • Node scaling events

Prometheus Queries

# CPU usage per pod
rate(container_cpu_usage_seconds_total{pod=~"coditect-combined-.*"}[5m])

# RAM usage per pod
container_memory_working_set_bytes{pod=~"coditect-combined-.*"}

# Active users (based on connections)
sum(nginx_ingress_controller_requests{service="coditect-combined-service"})

Migration Path - Current to Production

Phase 1: Immediate (Today)

  1. ✅ StatefulSet migration completed (./scripts/migrate-to-statefulset.sh)
  2. ⏳ Test persistence (./scripts/test-persistence.sh)
  3. ⏳ Validate session affinity (./scripts/test-session-affinity.sh)

Current capacity: 3 light users

Phase 2: Small Team (This Week)

  1. Update terraform/environments/prod/values.yaml:
    • Increase pod resources to 4 vCPU, 8 GB RAM
    • Scale to 10 replicas
    • Increase workspace to 100Gi
  2. Apply with Terraform: terraform apply
  3. Monitor with kubectl top pods

Target capacity: 10-20 users

Phase 3: Production-Ready (Next Month)

  1. Enable autoscaling (min 10, max 50 replicas)
  2. Set up monitoring (Prometheus + Grafana)
  3. Configure alerts (OpsGenie/PagerDuty)
  4. Implement workspace storage tiers
  5. Load testing with 50+ concurrent users

Target capacity: 50-100 users

Phase 4: Enterprise Scale (Quarter)

  1. Multi-region deployment
  2. Regional node pools
  3. Advanced autoscaling (VPA + HPA)
  4. Cost optimization (preemptible nodes for dev)
  5. Advanced monitoring (distributed tracing)

Target capacity: 500+ users


Summary

Current State

  • Configuration: 3 pods, 2 vCPU + 2 GB RAM each
  • Capacity: 3 concurrent light users (1 per pod)
  • Production-ready: ❌ NO - severely under-provisioned

Update terraform/environments/prod/values.yaml with:

replicaCount: 10

resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"

persistence:
workspace:
size: 100Gi

autoscaling:
enabled: true
minReplicas: 10
maxReplicas: 30

This will support:

  • 10-20 light development users
  • 10 full development users
  • Autoscaling to 30 pods during peak load

Estimated cost: $400-600/month


Next: Apply Terraform changes and monitor actual usage to optimize further.