Skip to main content

Socket.IO & theia Persistence - Critical Insights

Date: 2025-10-26 Context: Pod persistence issues + Socket.IO 400 Bad Request errors Status: 🔴 CRITICAL - Production pods not persisting user data


🚨 Critical Issues Identified

1. Pod Persistence Problem (User-Reported)

Symptom: "Sessions on GKE are not persistent - pods don't save what's there between logins/sessions"

Root Causes (from theia GKE research):

  1. Default 30-minute session timeout destroys idle pods
  2. No persistent volumes (PVCs) configured for workspace data
  3. theia Cloud auto-cleanup deletes inactive IDE containers
  4. Ephemeral pod design - no storage mounted to /home/project or workspace directory

Impact: Users lose all code, files, and session state when:

  • Logging out
  • Session timeout (30 min idle)
  • Pod restart/crash
  • Node draining/maintenance

Current Status (from CLAUDE.md):

  • ✅ Root Cause #1 FIXED: CDN caching disabled
  • ⏳ Root Cause #2 IN PROGRESS: Session affinity missing
  • 📋 Additional fixes needed: WebSocket annotation, health checks

Connection to Persistence:

  • Session affinity failure → Socket.IO connects to wrong pod
  • Pod termination mid-session → 400 error on reconnection
  • Missing health checks → load balancer routes to terminating pods

✅ Required Fixes (Priority Order)

P0 - Immediate (Stop Data Loss)

1. Add Persistent Volume Claims (PVCs)

File: k8s/coditect-combined-deployment.yaml (or StatefulSet)

Add to pod spec:

spec:
template:
spec:
volumes:
- name: workspace-data
persistentVolumeClaim:
claimName: theia-workspace-pvc
containers:
- name: theia
volumeMounts:
- mountPath: /home/theia
name: workspace-data

Create PVC manifest:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: theia-workspace-pvc
namespace: coditect-app
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard-rwo
resources:
requests:
storage: 20Gi

2. Disable theia Cloud Session Timeout

File: theia Cloud config (if using theia Cloud operator)

{
"sessionTimeout": 0,
"closeAfterDisconnect": false
}

Alternative (if not using theia Cloud operator): Add annotation to pods:

metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

P0 - Socket.IO Fixes (Prevent 400 Errors)

3. Add WebSocket Annotation to Ingress

File: k8s/coditect-combined-ingress.yaml

metadata:
annotations:
nginx.ingress.kubernetes.io/websocket-services: "coditect-combined-service"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"

4. Verify Session Affinity on Service

File: k8s/coditect-combined-service.yaml

spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours

5. Create Health Check Endpoints

File: NGINX config in theia pod

location /health {
return 200 "healthy\n";
add_header Content-Type text/plain;
}

location /ready {
return 200 "ready\n";
add_header Content-Type text/plain;
}

Update deployment health checks:

livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 10
periodSeconds: 5

P1 - Medium-term (Scale & Reliability)

6. Migrate to StatefulSet (if session persistence critical)

StatefulSets provide:

  • Stable pod identity (pod-0, pod-1, etc.)
  • Persistent volume per pod (automatic PVC creation)
  • Ordered deployment/scaling
  • Graceful shutdown guarantees

Example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: theia-statefulset
spec:
serviceName: "theia"
replicas: 3
volumeClaimTemplates:
- metadata:
name: workspace
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "standard-rwo"
resources:
requests:
storage: 20Gi

7. Implement Connection Draining

File: Deployment spec

spec:
template:
spec:
terminationGracePeriodSeconds: 120
containers:
- name: theia
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"]

📋 Verification Steps

After Implementing PVCs

  1. Create test file in pod:
kubectl exec -it coditect-combined-<pod-id> -n coditect-app -- bash
echo "Test persistence" > /home/theia/test.txt
cat /home/theia/test.txt
  1. Delete pod:
kubectl delete pod coditect-combined-<pod-id> -n coditect-app
  1. Verify file persists in new pod:
kubectl get pods -n coditect-app  # Get new pod ID
kubectl exec -it coditect-combined-<new-pod-id> -n coditect-app -- cat /home/theia/test.txt

Expected: "Test persistence" still exists

After Socket.IO Fixes

  1. Test WebSocket connection:
curl -I https://coditect.ai/theia
# Check for "Connection: Upgrade" header
  1. Verify session affinity:
kubectl get service coditect-combined-service -n coditect-app -o yaml | grep -A 5 sessionAffinity
  1. Test health endpoints:
curl https://coditect.ai/health
curl https://coditect.ai/ready

Primary References:

Socket.IO Investigation:

Current Project Status:


💡 Quick Win Implementation Order

Fastest path to fixing both issues (2-3 hours):

  1. Add PVC to deployment (30 min) - Stops data loss immediately
  2. Add WebSocket annotation to Ingress (15 min) - Fixes Socket.IO
  3. Verify session affinity (15 min) - Ensures sticky sessions
  4. Create health endpoints (30 min) - Prevents stale routing
  5. Test end-to-end (60 min) - Verify persistence + Socket.IO working

Deploy & verify (30 min):

kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/coditect-combined-deployment.yaml
kubectl apply -f k8s/coditect-combined-ingress.yaml
kubectl rollout status deployment/coditect-combined -n coditect-app

Last Updated: 2025-10-26 Status: 🔴 ACTION REQUIRED - Pod persistence and Socket.IO fixes needed Priority: P0 CRITICAL - Production data loss occurring