StatefulSet Migration Guide - Production Deployment
Date: 2025-10-26 Objective: Migrate from ephemeral Deployment to persistent StatefulSet Estimated Time: 2-4 hours Downtime: ~2-5 minutes during migration
π― Overviewβ
This guide migrates the Coditect combined service (Frontend + theia) from a stateless Deployment to a stateful StatefulSet with persistent storage. This ensures:
β User data persists across pod restarts β Session affinity routes users to the same pod β High availability with 3 replicas β Automatic PVC creation per pod β Graceful shutdown with connection draining
π Architecture Changesβ
Before (Current State)β
Deployment: coditect-combined-v5 (3 replicas)
βββ Pods: Random names, ephemeral storage
βββ coditect-combined-v5-abc123 β No storage
βββ coditect-combined-v5-def456 β No storage
βββ coditect-combined-v5-ghi789 β No storage
Service: Round-robin load balancing
βββ Users routed to random pods β
After (Target State)β
StatefulSet: coditect-combined (3 replicas)
βββ Pods: Stable names, persistent storage
βββ coditect-combined-0 β
workspace-coditect-combined-0 (50GB)
βββ coditect-combined-1 β
workspace-coditect-combined-1 (50GB)
βββ coditect-combined-2 β
workspace-coditect-combined-2 (50GB)
Service: Session affinity (ClientIP + Cookies)
βββ Users stick to same pod β
π§ Components Createdβ
1. StatefulSet (k8s/theia-statefulset.yaml)β
Key Features:
- 3 replicas with stable pod names (coditect-combined-0, 1, 2)
- Automatic PVC creation via volumeClaimTemplates
- 2 volumes per pod:
/workspace(50GB) - User files, code, projects/home/theia/.theia(5GB) - theia config, settings
- Graceful shutdown: 120s termination grace period
- Environment variables: POD_NAME, POD_IP injected for debugging
PVC Template:
volumeClaimTemplates:
- metadata:
name: workspace
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard-rwo
resources:
requests:
storage: 50Gi
2. Headless Service (theia-headless)β
Purpose: Required for StatefulSet DNS resolution
DNS Pattern: <pod-name>.<service-name>.<namespace>.svc.cluster.local
Examples:
coditect-combined-0.theia-headless.coditect-app.svc.cluster.localcoditect-combined-1.theia-headless.coditect-app.svc.cluster.localcoditect-combined-2.theia-headless.coditect-app.svc.cluster.local
3. Load Balancer Service (coditect-combined-service)β
Features:
- Session affinity: ClientIP + 3-hour timeout
- BackendConfig annotation: Links to session affinity config
- NEG annotation: GKE network endpoint groups
Config:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
4. BackendConfig (backend-config-stateful.yaml)β
Session Affinity:
- Type: CLIENT_IP
- Cookie TTL: 3 hours
- Connection draining: 120s
Timeouts:
- Backend timeout: 3600s (1 hour) for WebSocket
- Health check: 10s interval, 5s timeout
Health Checks:
- Path:
/health - Healthy threshold: 2 checks
- Unhealthy threshold: 3 checks
5. Ingress (ingress-stateful.yaml)β
WebSocket Support:
nginx.ingress.kubernetes.io/websocket-services: "coditect-combined-service"
Session Affinity:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/affinity-mode: "persistent"
nginx.ingress.kubernetes.io/session-cookie-name: "coditect-affinity"
nginx.ingress.kubernetes.io/session-cookie-max-age: "10800" # 3 hours
Timeouts (for long WebSocket connections):
proxy-connect-timeout: "3600"
proxy-send-timeout: "3600"
proxy-read-timeout: "3600"
π Migration Procedureβ
Prerequisitesβ
- GKE cluster access:
gcloud container clusters get-credentials codi-poc-e2-cluster --zone=us-central1-a
- Namespace access:
kubectl config set-context --current --namespace=coditect-app
- Backup current state (optional but recommended):
kubectl get all -n coditect-app -o yaml > backup-$(date +%Y%m%d-%H%M%S).yaml
Step-by-Step Migrationβ
Option A: Automated Migration (Recommended)β
# Run the migration script
./scripts/migrate-to-statefulset.sh
What it does:
- β Verifies GKE access
- β Creates BackendConfig
- β Deletes old Deployment (asks for confirmation)
- β Creates StatefulSet with PVCs
- β Updates Ingress
- β Waits for rollout
- β Runs basic persistence test
- β Displays final status
Expected output:
Migration Complete!
StatefulSet Status: 3/3 pods ready
Pods: coditect-combined-0, coditect-combined-1, coditect-combined-2
PVCs: workspace-coditect-combined-0 (Bound), workspace-coditect-combined-1 (Bound), workspace-coditect-combined-2 (Bound)
Option B: Manual Migrationβ
# 1. Create BackendConfig
kubectl apply -f k8s/backend-config-stateful.yaml
# 2. Delete old Deployment
kubectl delete deployment coditect-combined-v5 -n coditect-app
# OR
kubectl delete deployment coditect-combined -n coditect-app
# 3. Create StatefulSet
kubectl apply -f k8s/theia-statefulset.yaml
# 4. Wait for pods to be ready
kubectl rollout status statefulset/coditect-combined -n coditect-app
# 5. Verify PVCs created
kubectl get pvc -n coditect-app
# 6. Update Ingress
kubectl apply -f k8s/ingress-stateful.yaml
# 7. Verify all components
kubectl get all -n coditect-app -l app=coditect-combined
π§ͺ Testing & Validationβ
Test 1: Persistence Test (Critical)β
./scripts/test-persistence.sh
What it tests:
- β Creates file in pod's /workspace
- β Verifies file exists
- β Deletes pod (simulates crash)
- β Verifies file persists in new pod
- β Checks PVCs are mounted correctly
Expected result: File persists across pod restart β
Test 2: Session Affinity Testβ
./scripts/test-session-affinity.sh
What it tests:
- β Ingress cookie-based affinity configured
- β Service ClientIP affinity configured
- β BackendConfig session affinity configured
- β Simulates multiple requests with cookies
- β Verifies pod distribution
Expected result: Session affinity working β
Test 3: Multi-User Scenario (Manual)β
User 1:
1. Open: https://coditect.ai/theia
2. Create file: /workspace/user1-file.txt
3. Note pod name from logs/DevTools
4. Logout
5. Login again
6. Verify: Same pod + file still exists β
User 2 (different browser/IP):
1. Open: https://coditect.ai/theia
2. Create file: /workspace/user2-file.txt
3. Note pod name (may be different from User 1)
4. Verify: User 1's file NOT visible (separate workspaces) β
Expected behavior:
- Each user routed to specific pod (session affinity) β
- User's files persist on their pod β
- Users on different pods have separate workspaces β
π User β Pod Routing Strategyβ
How Session Affinity Worksβ
Initial Assignment (first visit):
1. User visits https://coditect.ai/theia
2. Ingress routes to any available pod (round-robin)
3. Ingress sets cookie: coditect-affinity=<hash>
4. Pod serves theia IDE
5. User creates files in /workspace (on this pod)
Subsequent Visits (with cookie):
1. User visits again with cookie
2. Ingress reads cookie: coditect-affinity=<hash>
3. Ingress routes to SAME pod (sticky session)
4. User sees same /workspace files β
Session Affinity Layers:
- Ingress Level: Cookie-based (nginx.ingress.kubernetes.io/affinity: "cookie")
- Service Level: ClientIP (sessionAffinity: ClientIP)
- BackendConfig Level: CLIENT_IP (affinityType: "CLIENT_IP")
Fallback Strategy:
- If cookie expires (3 hours): Re-assign to random pod
- If pod is down: Route to healthy pod (user loses session data β οΈ)
- If user clears cookies: New assignment (fresh workspace β οΈ)
Improvement (Future):
- Store userβpod mapping in Redis or database
- API endpoint to return assigned pod for user
- Frontend checks user's assigned pod before loading theia
- Explicit pod routing via query param:
/theia?pod=0
π Monitoring & Troubleshootingβ
Check StatefulSet Statusβ
# StatefulSet
kubectl get statefulset coditect-combined -n coditect-app
# Pods
kubectl get pods -n coditect-app -l app=coditect-combined -o wide
# PVCs
kubectl get pvc -n coditect-app
# Events
kubectl get events -n coditect-app --sort-by='.lastTimestamp'
Check Pod Logsβ
# Specific pod
kubectl logs coditect-combined-0 -n coditect-app
# Follow logs
kubectl logs -f coditect-combined-0 -n coditect-app
# All pods
kubectl logs -n coditect-app -l app=coditect-combined --tail=20
Check PVC Mountingβ
# Exec into pod
kubectl exec -it coditect-combined-0 -n coditect-app -- bash
# Inside pod:
df -h /workspace
ls -la /workspace
Common Issuesβ
Issue 1: Pods stuck in Pending
- Cause: PVC provisioning slow
- Check:
kubectl describe pod coditect-combined-0 -n coditect-app - Solution: Wait 2-5 minutes for GCE PD to provision
Issue 2: Old data not migrated
- Cause: Old Deployment had no PVCs
- Solution: Data cannot be migrated (was ephemeral)
- Workaround: Users must recreate files
Issue 3: Session affinity not working
- Cause: Ingress annotations not applied
- Check:
kubectl get ingress coditect-ingress -n coditect-app -o yaml - Solution: Re-apply ingress:
kubectl apply -f k8s/ingress-stateful.yaml
Issue 4: Different pod on each visit
- Cause: Cookies not being set/read
- Check: Browser DevTools β Application β Cookies β coditect-affinity
- Solution: Verify Ingress affinity annotations, check HTTPS enabled
π Rollback Procedureβ
If issues occur, rollback to Deployment:
# 1. Delete StatefulSet (keeps PVCs by default)
kubectl delete statefulset coditect-combined -n coditect-app
# 2. Re-create old Deployment
kubectl apply -f k8s/k8s-combined-deployment.yaml
# 3. Wait for rollout
kubectl rollout status deployment/coditect-combined-v5 -n coditect-app
Note: PVCs remain after rollback. Can be deleted manually if needed:
kubectl delete pvc -n coditect-app -l app=coditect-combined
π Scaling StatefulSetβ
Scale up:
kubectl scale statefulset coditect-combined --replicas=5 -n coditect-app
What happens:
- New pods created: coditect-combined-3, coditect-combined-4
- New PVCs auto-created for each new pod
- Session affinity distributes new users across all 5 pods
Scale down:
kubectl scale statefulset coditect-combined --replicas=2 -n coditect-app
What happens:
- Pod coditect-combined-2 deleted
- PVC workspace-coditect-combined-2 retained (manual deletion required)
- Users on pod-2 reconnect to pod-0 or pod-1 (lose session β οΈ)
π Post-Migration Checklistβ
- StatefulSet has 3/3 pods running
- 3 PVCs created and bound
- Ingress updated with session affinity
- BackendConfig applied
- Persistence test passed (file survives pod restart)
- Session affinity test passed (users stick to same pod)
- Manual user test passed (login β create file β logout β login β file exists)
- Health checks passing (/health endpoint)
- WebSocket connections working (Socket.IO green indicator)
- No errors in pod logs
- Old Deployment deleted
- Monitoring configured (optional)
π― Success Criteriaβ
Before Migration: β User data lost on pod restart β User data lost on deployment update β Users routed to random pods
After Migration: β User data persists across pod restarts β User data persists across deployments β Users routed to same pod (session affinity) β High availability (3 replicas) β Automatic storage provisioning
Last Updated: 2025-10-26 Status: β Ready for production migration Estimated Completion: 2-4 hours