Socket.IO & theia Persistence - Critical Insights
Date: 2025-10-26 Context: Pod persistence issues + Socket.IO 400 Bad Request errors Status: 🔴 CRITICAL - Production pods not persisting user data
🚨 Critical Issues Identified
1. Pod Persistence Problem (User-Reported)
Symptom: "Sessions on GKE are not persistent - pods don't save what's there between logins/sessions"
Root Causes (from theia GKE research):
- Default 30-minute session timeout destroys idle pods
- No persistent volumes (PVCs) configured for workspace data
- theia Cloud auto-cleanup deletes inactive IDE containers
- Ephemeral pod design - no storage mounted to
/home/projector workspace directory
Impact: Users lose all code, files, and session state when:
- Logging out
- Session timeout (30 min idle)
- Pod restart/crash
- Node draining/maintenance
2. Socket.IO 400 Errors (Related)
Current Status (from CLAUDE.md):
- ✅ Root Cause #1 FIXED: CDN caching disabled
- ⏳ Root Cause #2 IN PROGRESS: Session affinity missing
- 📋 Additional fixes needed: WebSocket annotation, health checks
Connection to Persistence:
- Session affinity failure → Socket.IO connects to wrong pod
- Pod termination mid-session → 400 error on reconnection
- Missing health checks → load balancer routes to terminating pods
✅ Required Fixes (Priority Order)
P0 - Immediate (Stop Data Loss)
1. Add Persistent Volume Claims (PVCs)
File: k8s/coditect-combined-deployment.yaml (or StatefulSet)
Add to pod spec:
spec:
template:
spec:
volumes:
- name: workspace-data
persistentVolumeClaim:
claimName: theia-workspace-pvc
containers:
- name: theia
volumeMounts:
- mountPath: /home/theia
name: workspace-data
Create PVC manifest:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: theia-workspace-pvc
namespace: coditect-app
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard-rwo
resources:
requests:
storage: 20Gi
2. Disable theia Cloud Session Timeout
File: theia Cloud config (if using theia Cloud operator)
{
"sessionTimeout": 0,
"closeAfterDisconnect": false
}
Alternative (if not using theia Cloud operator): Add annotation to pods:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
P0 - Socket.IO Fixes (Prevent 400 Errors)
3. Add WebSocket Annotation to Ingress
File: k8s/coditect-combined-ingress.yaml
metadata:
annotations:
nginx.ingress.kubernetes.io/websocket-services: "coditect-combined-service"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
4. Verify Session Affinity on Service
File: k8s/coditect-combined-service.yaml
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
5. Create Health Check Endpoints
File: NGINX config in theia pod
location /health {
return 200 "healthy\n";
add_header Content-Type text/plain;
}
location /ready {
return 200 "ready\n";
add_header Content-Type text/plain;
}
Update deployment health checks:
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 10
periodSeconds: 5
P1 - Medium-term (Scale & Reliability)
6. Migrate to StatefulSet (if session persistence critical)
StatefulSets provide:
- Stable pod identity (pod-0, pod-1, etc.)
- Persistent volume per pod (automatic PVC creation)
- Ordered deployment/scaling
- Graceful shutdown guarantees
Example:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: theia-statefulset
spec:
serviceName: "theia"
replicas: 3
volumeClaimTemplates:
- metadata:
name: workspace
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "standard-rwo"
resources:
requests:
storage: 20Gi
7. Implement Connection Draining
File: Deployment spec
spec:
template:
spec:
terminationGracePeriodSeconds: 120
containers:
- name: theia
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"]
📋 Verification Steps
After Implementing PVCs
- Create test file in pod:
kubectl exec -it coditect-combined-<pod-id> -n coditect-app -- bash
echo "Test persistence" > /home/theia/test.txt
cat /home/theia/test.txt
- Delete pod:
kubectl delete pod coditect-combined-<pod-id> -n coditect-app
- Verify file persists in new pod:
kubectl get pods -n coditect-app # Get new pod ID
kubectl exec -it coditect-combined-<new-pod-id> -n coditect-app -- cat /home/theia/test.txt
Expected: "Test persistence" still exists
After Socket.IO Fixes
- Test WebSocket connection:
curl -I https://coditect.ai/theia
# Check for "Connection: Upgrade" header
- Verify session affinity:
kubectl get service coditect-combined-service -n coditect-app -o yaml | grep -A 5 sessionAffinity
- Test health endpoints:
curl https://coditect.ai/health
curl https://coditect.ai/ready
🔗 Related Documentation
Primary References:
theia-instance-running-on-gcp-gke-kubernetes.md- Full 60KB researchtheia-gke-scaling-research-summary.md- Executive summary
Socket.IO Investigation:
Current Project Status:
CLAUDE.md- Project overviewdocs/10-execution-plans/phased-deployment-checklist.md
💡 Quick Win Implementation Order
Fastest path to fixing both issues (2-3 hours):
- ✅ Add PVC to deployment (30 min) - Stops data loss immediately
- ✅ Add WebSocket annotation to Ingress (15 min) - Fixes Socket.IO
- ✅ Verify session affinity (15 min) - Ensures sticky sessions
- ✅ Create health endpoints (30 min) - Prevents stale routing
- ✅ Test end-to-end (60 min) - Verify persistence + Socket.IO working
Deploy & verify (30 min):
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/coditect-combined-deployment.yaml
kubectl apply -f k8s/coditect-combined-ingress.yaml
kubectl rollout status deployment/coditect-combined -n coditect-app
Last Updated: 2025-10-26 Status: 🔴 ACTION REQUIRED - Pod persistence and Socket.IO fixes needed Priority: P0 CRITICAL - Production data loss occurring