GKE Capacity Planning - Coditect IDE

Date: 2025-10-26 Purpose: Calculate user capacity and provide scaling recommendations

Current Configuration Analysis

Deployed Resources (from `terraform/environments/prod/values.yaml`)

Per Pod:

resources:
  requests:
    memory: "512Mi"  # 0.5 GB
    cpu: "500m"      # 0.5 vCPU
  limits:
    memory: "2Gi"    # 2 GB
    cpu: "2000m"     # 2 vCPU

Cluster Configuration:

Replicas: 3 StatefulSet pods
Session Affinity: Enabled (users stick to one pod)
Storage: 50Gi workspace + 5Gi config per pod

User Capacity - Current Config

Light Development Users

Resource Requirements: 1-2 vCPU, 2-4 GB RAM per user

Calculation (using conservative 2 vCPU + 2 GB RAM):

CPU per pod: 2000m limit / 2000m per user = 1 user
RAM per pod: 2048Mi limit / 2048Mi per user = 1 user
Storage per pod: 50Gi / 10Gi per user = 5 users (not limiting factor)

Bottleneck: CPU and RAM (both limited to 1 user per pod)

Total Capacity: 3 concurrent users (1 per pod)

Full Development Users

Resource Requirements: 2-4 vCPU, 6-8 GB RAM per user

Result: ❌ NOT SUPPORTED

Pods only have 2Gi RAM limit
Full development needs 6-8 GB minimum
Would require pod resource increase to 8Gi+ RAM

Scaling Scenarios

Scenario 1: Small Team (10-20 Users)

Configuration:

Replicas: 10 pods
CPU per pod: 4000m (4 vCPU)
RAM per pod: 8Gi (8 GB)
Storage per pod: 100Gi

Capacity:

Light users: 20 concurrent (2 per pod)
Full dev users: 10 concurrent (1 per pod)
Cost: ~$400-600/month

Terraform values.yaml changes:

replicaCount: 10

resources:
  requests:
    memory: "4Gi"
    cpu: "2000m"
  limits:
    memory: "8Gi"
    cpu: "4000m"

persistence:
  workspace:
    size: 100Gi

Scenario 2: Mid-Size Team (50-100 Users)

Configuration:

Replicas: 50 pods
CPU per pod: 4000m (4 vCPU)
RAM per pod: 8Gi (8 GB)
Storage per pod: 100Gi
Autoscaling: Enabled (min 30, max 50)

Capacity:

Light users: 100 concurrent (2 per pod)
Full dev users: 50 concurrent (1 per pod)
Cost: ~$2,000-3,000/month

Terraform values.yaml changes:

replicaCount: 30

autoscaling:
  enabled: true
  minReplicas: 30
  maxReplicas: 50
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 75

resources:
  requests:
    memory: "4Gi"
    cpu: "2000m"
  limits:
    memory: "8Gi"
    cpu: "4000m"

Scenario 3: Enterprise (500+ Users)

Configuration:

Replicas: 250 pods (autoscaling)
CPU per pod: 4000m (4 vCPU)
RAM per pod: 8Gi (8 GB)
Storage per pod: 200Gi
Multi-region deployment
Regional node pools with affinity

Capacity:

Light users: 500 concurrent (2 per pod)
Full dev users: 250 concurrent (1 per pod)
Cost: ~$10,000-15,000/month

Terraform values.yaml changes:

replicaCount: 100

autoscaling:
  enabled: true
  minReplicas: 100
  maxReplicas: 250
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 75

resources:
  requests:
    memory: "6Gi"
    cpu: "3000m"
  limits:
    memory: "8Gi"
    cpu: "4000m"

persistence:
  workspace:
    size: 200Gi

Recommended Next Steps

Immediate Actions (Current Config - 3 Users Max)

❌ Current config is NOT production-ready - only supports 3 light users

Must do before production launch:

Increase pod resources (lines 39-45 in terraform/environments/prod/values.yaml):

resources:
  requests:
    memory: "4Gi"    # Was 512Mi
    cpu: "2000m"     # Was 500m
  limits:
    memory: "8Gi"    # Was 2Gi
    cpu: "4000m"     # Was 2000m

Scale replicas based on user count:
- 10 users → 10 pods
- 50 users → 50 pods
- 100 users → 100 pods
Enable autoscaling (lines 117-122):

autoscaling:
  enabled: true  # Was false
  minReplicas: 10
  maxReplicas: 30

Increase workspace storage (line 30):

persistence:
  workspace:
    size: 100Gi  # Was 50Gi

GKE Cluster Requirements

For 10-20 users:

Node pool: 3-5 nodes
Machine type: e2-standard-8 (8 vCPU, 32 GB RAM)
Total cluster: 24-40 vCPU, 96-160 GB RAM

For 50-100 users:

Node pool: 15-25 nodes
Machine type: e2-standard-8 or e2-standard-16
Total cluster: 120-400 vCPU, 480-1600 GB RAM

For 500+ users:

Multiple node pools (3+ per region)
Machine type: e2-standard-16 or e2-highmem-16
Total cluster: 1000+ vCPU, 4000+ GB RAM
Multi-region deployment recommended

Cost Estimates

Per User Monthly Cost

User Type	vCPU	RAM	Storage	GKE Cost
Light Dev	2	2 GB	50 GB	$20-30
Full Dev	4	8 GB	100 GB	$40-60

Note: Costs include GKE compute, persistent disk, and load balancer

Total Monthly Costs

User Count	Light Dev	Full Dev	Mixed (50/50)
10	$200-300	$400-600	$300-450
50	$1,000-1,500	$2,000-3,000	$1,500-2,250
100	$2,000-3,000	$4,000-6,000	$3,000-4,500
500	$10,000-15,000	$20,000-30,000	$15,000-22,500

Additional costs:

Networking egress: ~5-10% of compute
FoundationDB cluster: $500-2,000/month (depending on size)
Cloud Build: ~$50-200/month
Load balancer: ~$20-50/month

Resource Optimization Strategies

1. Right-Size Pods

Monitor actual usage:

# Check CPU/RAM usage per pod
kubectl top pods -n coditect-app

# Get resource usage over time
kubectl get hpa -n coditect-app --watch

Adjust based on metrics:

If CPU consistently < 50%: Reduce CPU limits
If RAM consistently < 50%: Reduce RAM limits
If pods are OOMKilled: Increase RAM limits

2. workspace Storage Optimization

Per-user storage tiers:

Free tier: 10Gi
Starter: 50Gi
Pro: 100Gi
Enterprise: 200Gi+

Implement in code:

// Based on user plan, set workspace storage
const workspaceSize = {
  free: '10Gi',
  starter: '50Gi',
  pro: '100Gi',
  enterprise: '200Gi'
}[user.plan];

3. Enable Autoscaling

HPA (Horizontal Pod Autoscaler):

autoscaling:
  enabled: true
  minReplicas: 10
  maxReplicas: 50
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 75

VPA (Vertical Pod Autoscaler) - Optional:

Automatically adjust pod resource requests/limits
Based on historical usage patterns
Requires VPA controller installed on cluster

4. Node Pool Optimization

Use preemptible nodes for dev/staging:

60-80% cost savings
Automatic recreation on preemption
Not recommended for production

Node affinity for workload tiers:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: workload-tier
          operator: In
          values:
          - premium  # For paid users

Monitoring & Alerts

Key Metrics to Track

Pod-level:

CPU utilization (target: 60-80%)
RAM utilization (target: 60-80%)
Disk I/O (workspace PVC)
Active connections per pod

Cluster-level:

Total pods running
Node utilization
PVC provisioning time
Autoscaling events

Recommended Alerts

Critical (P0):

Pod OOMKilled (memory exhausted)
PVC provisioning failed
90% of maxReplicas reached
Node pool at capacity

Warning (P1):

CPU utilization >80% for 10 min
RAM utilization >80% for 10 min
70% of maxReplicas reached
workspace storage >80% full

Info (P2):

Autoscaling events
Pod evictions
Node scaling events

Prometheus Queries

# CPU usage per pod
rate(container_cpu_usage_seconds_total{pod=~"coditect-combined-.*"}[5m])

# RAM usage per pod
container_memory_working_set_bytes{pod=~"coditect-combined-.*"}

# Active users (based on connections)
sum(nginx_ingress_controller_requests{service="coditect-combined-service"})

Migration Path - Current to Production

Phase 1: Immediate (Today)

✅ StatefulSet migration completed (./scripts/migrate-to-statefulset.sh)
⏳ Test persistence (./scripts/test-persistence.sh)
⏳ Validate session affinity (./scripts/test-session-affinity.sh)

Current capacity: 3 light users

Phase 2: Small Team (This Week)

Update terraform/environments/prod/values.yaml:
- Increase pod resources to 4 vCPU, 8 GB RAM
- Scale to 10 replicas
- Increase workspace to 100Gi
Apply with Terraform: terraform apply
Monitor with kubectl top pods

Target capacity: 10-20 users

Phase 3: Production-Ready (Next Month)

Enable autoscaling (min 10, max 50 replicas)
Set up monitoring (Prometheus + Grafana)
Configure alerts (OpsGenie/PagerDuty)
Implement workspace storage tiers
Load testing with 50+ concurrent users

Target capacity: 50-100 users

Phase 4: Enterprise Scale (Quarter)

Multi-region deployment
Regional node pools
Advanced autoscaling (VPA + HPA)
Cost optimization (preemptible nodes for dev)
Advanced monitoring (distributed tracing)

Target capacity: 500+ users

Summary

Current State

Configuration: 3 pods, 2 vCPU + 2 GB RAM each
Capacity: 3 concurrent light users (1 per pod)
Production-ready: ❌ NO - severely under-provisioned

Recommended First Step

Update terraform/environments/prod/values.yaml with:

replicaCount: 10

resources:
  requests:
    memory: "4Gi"
    cpu: "2000m"
  limits:
    memory: "8Gi"
    cpu: "4000m"

persistence:
  workspace:
    size: 100Gi

autoscaling:
  enabled: true
  minReplicas: 10
  maxReplicas: 30

This will support:

10-20 light development users
10 full development users
Autoscaling to 30 pods during peak load

Estimated cost: $400-600/month

Next: Apply Terraform changes and monitor actual usage to optimize further.

Current Configuration Analysis​

Deployed Resources (from terraform/environments/prod/values.yaml)​

User Capacity - Current Config​

Light Development Users​

Full Development Users​

Scaling Scenarios​

Scenario 1: Small Team (10-20 Users)​

Scenario 2: Mid-Size Team (50-100 Users)​

Scenario 3: Enterprise (500+ Users)​

Recommended Next Steps​

Immediate Actions (Current Config - 3 Users Max)​

GKE Cluster Requirements​

Cost Estimates​

Per User Monthly Cost​

Total Monthly Costs​

Resource Optimization Strategies​

1. Right-Size Pods​

2. workspace Storage Optimization​

3. Enable Autoscaling​

4. Node Pool Optimization​

Monitoring & Alerts​

Key Metrics to Track​

Recommended Alerts​

Prometheus Queries​

Migration Path - Current to Production​

Phase 1: Immediate (Today)​

Phase 2: Small Team (This Week)​

Phase 3: Production-Ready (Next Month)​

Phase 4: Enterprise Scale (Quarter)​

Summary​

Current State​

Recommended First Step​

Current Configuration Analysis

Deployed Resources (from `terraform/environments/prod/values.yaml`)

User Capacity - Current Config

Light Development Users

Full Development Users

Scaling Scenarios

Scenario 1: Small Team (10-20 Users)

Scenario 2: Mid-Size Team (50-100 Users)

Scenario 3: Enterprise (500+ Users)

Recommended Next Steps

Immediate Actions (Current Config - 3 Users Max)

GKE Cluster Requirements

Cost Estimates

Per User Monthly Cost

Total Monthly Costs

Resource Optimization Strategies

1. Right-Size Pods

2. workspace Storage Optimization

3. Enable Autoscaling

4. Node Pool Optimization

Monitoring & Alerts

Key Metrics to Track

Recommended Alerts

Prometheus Queries

Migration Path - Current to Production

Phase 1: Immediate (Today)

Phase 2: Small Team (This Week)

Phase 3: Production-Ready (Next Month)

Phase 4: Enterprise Scale (Quarter)

Summary

Current State

Recommended First Step