theia GKE Scaling Research Summary

Source Document: theia-instance-running-on-gcp-gke-kubernetes.md Date: 2025-10-26 Research Tool: Perplexity AI Status: ✅ Complete - 60KB comprehensive analysis

🎯 Executive Summary

Comprehensive research on deploying and scaling Eclipse theia IDE on Google Kubernetes Engine (GKE) for 1-50 initial users scaling to 10k-20k concurrent users. Covers pod persistence issues, storage strategies, multi-tenancy architecture, cost estimates, and production-ready Terraform + Helm deployment bundles.

📊 Key Topics Covered

1. Pod Persistence & Session Management

Problem: theia pods disappear after session timeout (typically 30 minutes) Root Cause: theia Cloud auto-destroys idle pods to free resources Solutions:

Disable/extend sessionTimeout in theia Cloud config
Use StatefulSets instead of Deployments for pod identity
Implement persistent volumes (PVCs) for workspace data
Add annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

2. Persistent Storage Architecture

Per-User PVC Pattern:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-rwo
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: pd-balanced

Storage Options:

GCE Persistent Disk (CSI): Standard for per-user PVCs
Filestore/NFS: Shared volumes for collaboration
Hyperdisk Storage Pools: Scalable high-throughput for 1000s of volumes
GCS/S3: Backup/archival for inactive workspaces

3. Scaling Architecture (10k-20k Users)

Resource Profile per User:

Resource	Light Session	Full Development
vCPU	1-2 vCPU	2-4 vCPU
RAM	2-4 GiB	6-8 GiB
Disk	5-10 GB	15-30 GB

Cost Estimates (10k concurrent users):

Mode	Monthly Cost	Notes
GKE Autopilot	$1.0-1.2M	Pay-per-pod, bursty workloads
GKE Standard	$0.8-0.9M	75-90% utilization, manual tuning
Storage (100TB)	$4-6k	GCE PD Balanced

Per-User Cost: ~$0.05-0.10 per active user-hour

4. Multi-Tenancy Design

Namespace-based Isolation (Recommended):

Each user/tenant → dedicated namespace
ResourceQuotas + LimitRanges prevent resource starvation
RBAC + NetworkPolicies restrict inter-tenant access
Scales to 10k+ users with proper quota tuning

Pod-per-User Model:

Strong isolation, higher control
Limited density, higher overhead
Use for regulated/enterprise scenarios

5. Autoscaling Strategy

Three-Layer Autoscaling:

HPA (Horizontal Pod Autoscaler): Scale pods based on CPU/memory
Cluster Autoscaler: Add/remove nodes based on pending pods
Node Auto-Provisioning: Create new node pools on demand

Example HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: theia-hpa
spec:
  minReplicas: 1
  maxReplicas: 10000
  targetCPUUtilizationPercentage: 60

6. Security & Isolation

Layered Security:

GKE Sandbox (gVisor): User-space kernel for untrusted code
NetworkPolicies: Default deny-all, whitelist ingress
Private GKE Clusters: No public node IPs
Workload Identity: Map GCP IAM to K8s ServiceAccounts
PodSecurity Admission: Enforce restricted profile
Shielded GKE Nodes: VTPM attestation

7. Terraform + Helm Deployment Bundle

Infrastructure as Code:

resource "google_container_cluster" "theia" {
  for_each = toset(["us-central1", "us-east1", "europe-west1"])
  name     = "theia-${each.key}"

  cluster_autoscaling {
    enabled = true
    resource_limits {
      resource_type = "cpu"
      minimum       = 200
      maximum       = 20000
    }
  }
}

Helm Values:

autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 500
  targetCPUUtilizationPercentage: 60

persistence:
  enabled: true
  storageClass: standard-rwo
  size: 10Gi
  reclaimPolicy: Retain

resources:
  requests:
    cpu: "2000m"
    memory: "4Gi"

8. Growth Path (1-50 → 10k Users)

Phase-Based Scaling:

Phase	Users	Architecture	Monthly Cost
MVP	1-50	Single regional GKE Autopilot (3 nodes)	$400-800
Growth	50-1k	Multi-node, namespace isolation	$20k-50k
Scale-Out	1k-10k	Multi-pool, multi-zone regional	$800k-1.0M
Global	10k-20k	Multi-cluster federation	$1.0M+

Key Principle: Same Terraform + Helm configuration across all phases - no redesign needed

🔑 Key Insights for t2 Project

Relevant to Current Socket.IO Issues

Session Timeout Kills Pods:
- Default 30-minute inactivity timeout destroys IDE containers
- Related to Socket.IO 400 errors: Pods may be terminating mid-session
- Solution: Extend/disable sessionTimeout in theia Cloud config
WebSocket Connection Management:
- Known issue: theia IDE WebSocket disconnects every 30 seconds in K8s
- Reference: StackOverflow #64452006
- Related to t2 Socket.IO issue: Session affinity + WebSocket annotation required
Ingress/Load Balancer Configuration:
- Requires WebSocket-specific annotations for GKE Ingress
- Session affinity (CLIENT_IP or GENERATED_COOKIE) essential
- Backend timeout must exceed WebSocket keepalive interval
Health Check Endpoints:
- Critical for GKE load balancers to route correctly
- Missing health checks → 400 errors from stale backends
- Recommendation: Implement /health and /ready endpoints

Architecture Lessons for t2

Multi-Tenant Pod Strategy:
- Current t2 approach: Combined pods (Frontend + theia + NGINX)
- Research recommendation: Namespace-based isolation for scale
- Cost optimization: Multi-tenant pods vs per-user pods (100x cost reduction)
Persistent Storage Pattern:
- Use CSI-backed dynamic PVC provisioning
- volumeBindingMode: WaitForFirstConsumer for topology-aware provisioning
- Hyperdisk Storage Pools for elastic capacity at scale
Autoscaling Best Practices:
- HPA for pods, CA for nodes, NAP for node pools
- Mix Spot VMs for 50-70% cost savings on transient sessions
- Regional clusters for multi-zone resilience
Security Hardening:
- GKE Sandbox (gVisor) for untrusted user code
- NetworkPolicies for tenant isolation
- Private GKE clusters + Workload Identity

📋 Action Items for t2 Project

Immediate (Socket.IO Fix)

Add WebSocket annotation to Ingress (P0)
Verify session affinity enabled on backend service (P0)
Create /health and /ready endpoints for theia pods (P0)
Increase backend timeout to 120s (P1)
Implement connection draining optimization (P1)

Short-term (Sprint 3)

Implement persistent workspace storage (PVCs per user)
Configure theia Cloud session timeout (disable or extend to 4 hours)
Add StatefulSet for theia pods (if session persistence critical)
Implement pod anti-affinity for HA

Medium-term (Scaling Architecture)

Review namespace-based multi-tenancy strategy
Design PVC lifecycle automation (provision on demand, cleanup on expiry)
Implement HPA + CA + NAP autoscaling
Cost analysis: Per-user pods vs multi-tenant pods

Long-term (10k+ Users)

Multi-cluster federation strategy
GKE Sandbox (gVisor) for security isolation
Hyperdisk Storage Pools for scalable PVCs
Monitoring & alerting for large-scale deployment

🔗 References

Source Document: 60KB comprehensive research (1,558 lines) Key External References:

Related t2 Documentation:

socket.io-issue/analysis-troubleshooting-guide.md - Socket.IO 400 error root causes
docs/DEFINITIVE-V5-architecture.md - V5 system design
docs/10-execution-plans/phased-deployment-checklist.md - Current sprint status

Last Updated: 2025-10-26 Status: ✅ Research complete, action items identified

🎯 Executive Summary​

📊 Key Topics Covered​

1. Pod Persistence & Session Management​

2. Persistent Storage Architecture​

3. Scaling Architecture (10k-20k Users)​

4. Multi-Tenancy Design​

5. Autoscaling Strategy​

6. Security & Isolation​

7. Terraform + Helm Deployment Bundle​

8. Growth Path (1-50 → 10k Users)​

🔑 Key Insights for t2 Project​

Relevant to Current Socket.IO Issues​

Architecture Lessons for t2​

📋 Action Items for t2 Project​

Immediate (Socket.IO Fix)​

Short-term (Sprint 3)​

Medium-term (Scaling Architecture)​

Long-term (10k+ Users)​

🔗 References​