Skip to main content

theia GKE Scaling Research Summary

Source Document: theia-instance-running-on-gcp-gke-kubernetes.md Date: 2025-10-26 Research Tool: Perplexity AI Status: ✅ Complete - 60KB comprehensive analysis


🎯 Executive Summary​

Comprehensive research on deploying and scaling Eclipse theia IDE on Google Kubernetes Engine (GKE) for 1-50 initial users scaling to 10k-20k concurrent users. Covers pod persistence issues, storage strategies, multi-tenancy architecture, cost estimates, and production-ready Terraform + Helm deployment bundles.


📊 Key Topics Covered​

1. Pod Persistence & Session Management​

Problem: theia pods disappear after session timeout (typically 30 minutes) Root Cause: theia Cloud auto-destroys idle pods to free resources Solutions:

  • Disable/extend sessionTimeout in theia Cloud config
  • Use StatefulSets instead of Deployments for pod identity
  • Implement persistent volumes (PVCs) for workspace data
  • Add annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

2. Persistent Storage Architecture​

Per-User PVC Pattern:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard-rwo
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
parameters:
type: pd-balanced

Storage Options:

  • GCE Persistent Disk (CSI): Standard for per-user PVCs
  • Filestore/NFS: Shared volumes for collaboration
  • Hyperdisk Storage Pools: Scalable high-throughput for 1000s of volumes
  • GCS/S3: Backup/archival for inactive workspaces

3. Scaling Architecture (10k-20k Users)​

Resource Profile per User:

ResourceLight SessionFull Development
vCPU1-2 vCPU2-4 vCPU
RAM2-4 GiB6-8 GiB
Disk5-10 GB15-30 GB

Cost Estimates (10k concurrent users):

ModeMonthly CostNotes
GKE Autopilot$1.0-1.2MPay-per-pod, bursty workloads
GKE Standard$0.8-0.9M75-90% utilization, manual tuning
Storage (100TB)$4-6kGCE PD Balanced

Per-User Cost: ~$0.05-0.10 per active user-hour

4. Multi-Tenancy Design​

Namespace-based Isolation (Recommended):

  • Each user/tenant → dedicated namespace
  • ResourceQuotas + LimitRanges prevent resource starvation
  • RBAC + NetworkPolicies restrict inter-tenant access
  • Scales to 10k+ users with proper quota tuning

Pod-per-User Model:

  • Strong isolation, higher control
  • Limited density, higher overhead
  • Use for regulated/enterprise scenarios

5. Autoscaling Strategy​

Three-Layer Autoscaling:

  1. HPA (Horizontal Pod Autoscaler): Scale pods based on CPU/memory
  2. Cluster Autoscaler: Add/remove nodes based on pending pods
  3. Node Auto-Provisioning: Create new node pools on demand

Example HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: theia-hpa
spec:
minReplicas: 1
maxReplicas: 10000
targetCPUUtilizationPercentage: 60

6. Security & Isolation​

Layered Security:

  • GKE Sandbox (gVisor): User-space kernel for untrusted code
  • NetworkPolicies: Default deny-all, whitelist ingress
  • Private GKE Clusters: No public node IPs
  • Workload Identity: Map GCP IAM to K8s ServiceAccounts
  • PodSecurity Admission: Enforce restricted profile
  • Shielded GKE Nodes: VTPM attestation

7. Terraform + Helm Deployment Bundle​

Infrastructure as Code:

resource "google_container_cluster" "theia" {
for_each = toset(["us-central1", "us-east1", "europe-west1"])
name = "theia-${each.key}"

cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
minimum = 200
maximum = 20000
}
}
}

Helm Values:

autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 500
targetCPUUtilizationPercentage: 60

persistence:
enabled: true
storageClass: standard-rwo
size: 10Gi
reclaimPolicy: Retain

resources:
requests:
cpu: "2000m"
memory: "4Gi"

8. Growth Path (1-50 → 10k Users)​

Phase-Based Scaling:

PhaseUsersArchitectureMonthly Cost
MVP1-50Single regional GKE Autopilot (3 nodes)$400-800
Growth50-1kMulti-node, namespace isolation$20k-50k
Scale-Out1k-10kMulti-pool, multi-zone regional$800k-1.0M
Global10k-20kMulti-cluster federation$1.0M+

Key Principle: Same Terraform + Helm configuration across all phases - no redesign needed


🔑 Key Insights for t2 Project​

Relevant to Current Socket.IO Issues​

  1. Session Timeout Kills Pods:

    • Default 30-minute inactivity timeout destroys IDE containers
    • Related to Socket.IO 400 errors: Pods may be terminating mid-session
    • Solution: Extend/disable sessionTimeout in theia Cloud config
  2. WebSocket Connection Management:

    • Known issue: theia IDE WebSocket disconnects every 30 seconds in K8s
    • Reference: StackOverflow #64452006
    • Related to t2 Socket.IO issue: Session affinity + WebSocket annotation required
  3. Ingress/Load Balancer Configuration:

    • Requires WebSocket-specific annotations for GKE Ingress
    • Session affinity (CLIENT_IP or GENERATED_COOKIE) essential
    • Backend timeout must exceed WebSocket keepalive interval
  4. Health Check Endpoints:

    • Critical for GKE load balancers to route correctly
    • Missing health checks → 400 errors from stale backends
    • Recommendation: Implement /health and /ready endpoints

Architecture Lessons for t2​

  1. Multi-Tenant Pod Strategy:

    • Current t2 approach: Combined pods (Frontend + theia + NGINX)
    • Research recommendation: Namespace-based isolation for scale
    • Cost optimization: Multi-tenant pods vs per-user pods (100x cost reduction)
  2. Persistent Storage Pattern:

    • Use CSI-backed dynamic PVC provisioning
    • volumeBindingMode: WaitForFirstConsumer for topology-aware provisioning
    • Hyperdisk Storage Pools for elastic capacity at scale
  3. Autoscaling Best Practices:

    • HPA for pods, CA for nodes, NAP for node pools
    • Mix Spot VMs for 50-70% cost savings on transient sessions
    • Regional clusters for multi-zone resilience
  4. Security Hardening:

    • GKE Sandbox (gVisor) for untrusted user code
    • NetworkPolicies for tenant isolation
    • Private GKE clusters + Workload Identity

📋 Action Items for t2 Project​

Immediate (Socket.IO Fix)​

  • Add WebSocket annotation to Ingress (P0)
  • Verify session affinity enabled on backend service (P0)
  • Create /health and /ready endpoints for theia pods (P0)
  • Increase backend timeout to 120s (P1)
  • Implement connection draining optimization (P1)

Short-term (Sprint 3)​

  • Implement persistent workspace storage (PVCs per user)
  • Configure theia Cloud session timeout (disable or extend to 4 hours)
  • Add StatefulSet for theia pods (if session persistence critical)
  • Implement pod anti-affinity for HA

Medium-term (Scaling Architecture)​

  • Review namespace-based multi-tenancy strategy
  • Design PVC lifecycle automation (provision on demand, cleanup on expiry)
  • Implement HPA + CA + NAP autoscaling
  • Cost analysis: Per-user pods vs multi-tenant pods

Long-term (10k+ Users)​

  • Multi-cluster federation strategy
  • GKE Sandbox (gVisor) for security isolation
  • Hyperdisk Storage Pools for scalable PVCs
  • Monitoring & alerting for large-scale deployment

🔗 References​

Source Document: 60KB comprehensive research (1,558 lines) Key External References:

Related t2 Documentation:


Last Updated: 2025-10-26 Status: ✅ Research complete, action items identified