Skip to main content

StatefulSet Migration Guide - Production Deployment

Date: 2025-10-26 Objective: Migrate from ephemeral Deployment to persistent StatefulSet Estimated Time: 2-4 hours Downtime: ~2-5 minutes during migration


🎯 Overview​

This guide migrates the Coditect combined service (Frontend + theia) from a stateless Deployment to a stateful StatefulSet with persistent storage. This ensures:

βœ… User data persists across pod restarts βœ… Session affinity routes users to the same pod βœ… High availability with 3 replicas βœ… Automatic PVC creation per pod βœ… Graceful shutdown with connection draining


πŸ“ Architecture Changes​

Before (Current State)​

Deployment: coditect-combined-v5 (3 replicas)
└── Pods: Random names, ephemeral storage
β”œβ”€β”€ coditect-combined-v5-abc123 ❌ No storage
β”œβ”€β”€ coditect-combined-v5-def456 ❌ No storage
└── coditect-combined-v5-ghi789 ❌ No storage

Service: Round-robin load balancing
└── Users routed to random pods ❌

After (Target State)​

StatefulSet: coditect-combined (3 replicas)
└── Pods: Stable names, persistent storage
β”œβ”€β”€ coditect-combined-0 βœ… workspace-coditect-combined-0 (50GB)
β”œβ”€β”€ coditect-combined-1 βœ… workspace-coditect-combined-1 (50GB)
└── coditect-combined-2 βœ… workspace-coditect-combined-2 (50GB)

Service: Session affinity (ClientIP + Cookies)
└── Users stick to same pod βœ…

πŸ”§ Components Created​

1. StatefulSet (k8s/theia-statefulset.yaml)​

Key Features:

  • 3 replicas with stable pod names (coditect-combined-0, 1, 2)
  • Automatic PVC creation via volumeClaimTemplates
  • 2 volumes per pod:
    • /workspace (50GB) - User files, code, projects
    • /home/theia/.theia (5GB) - theia config, settings
  • Graceful shutdown: 120s termination grace period
  • Environment variables: POD_NAME, POD_IP injected for debugging

PVC Template:

volumeClaimTemplates:
- metadata:
name: workspace
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard-rwo
resources:
requests:
storage: 50Gi

2. Headless Service (theia-headless)​

Purpose: Required for StatefulSet DNS resolution

DNS Pattern: <pod-name>.<service-name>.<namespace>.svc.cluster.local

Examples:

  • coditect-combined-0.theia-headless.coditect-app.svc.cluster.local
  • coditect-combined-1.theia-headless.coditect-app.svc.cluster.local
  • coditect-combined-2.theia-headless.coditect-app.svc.cluster.local

3. Load Balancer Service (coditect-combined-service)​

Features:

  • Session affinity: ClientIP + 3-hour timeout
  • BackendConfig annotation: Links to session affinity config
  • NEG annotation: GKE network endpoint groups

Config:

sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours

4. BackendConfig (backend-config-stateful.yaml)​

Session Affinity:

  • Type: CLIENT_IP
  • Cookie TTL: 3 hours
  • Connection draining: 120s

Timeouts:

  • Backend timeout: 3600s (1 hour) for WebSocket
  • Health check: 10s interval, 5s timeout

Health Checks:

  • Path: /health
  • Healthy threshold: 2 checks
  • Unhealthy threshold: 3 checks

5. Ingress (ingress-stateful.yaml)​

WebSocket Support:

nginx.ingress.kubernetes.io/websocket-services: "coditect-combined-service"

Session Affinity:

nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/affinity-mode: "persistent"
nginx.ingress.kubernetes.io/session-cookie-name: "coditect-affinity"
nginx.ingress.kubernetes.io/session-cookie-max-age: "10800" # 3 hours

Timeouts (for long WebSocket connections):

proxy-connect-timeout: "3600"
proxy-send-timeout: "3600"
proxy-read-timeout: "3600"

πŸš€ Migration Procedure​

Prerequisites​

  1. GKE cluster access:
gcloud container clusters get-credentials codi-poc-e2-cluster --zone=us-central1-a
  1. Namespace access:
kubectl config set-context --current --namespace=coditect-app
  1. Backup current state (optional but recommended):
kubectl get all -n coditect-app -o yaml > backup-$(date +%Y%m%d-%H%M%S).yaml

Step-by-Step Migration​

# Run the migration script
./scripts/migrate-to-statefulset.sh

What it does:

  1. βœ… Verifies GKE access
  2. βœ… Creates BackendConfig
  3. βœ… Deletes old Deployment (asks for confirmation)
  4. βœ… Creates StatefulSet with PVCs
  5. βœ… Updates Ingress
  6. βœ… Waits for rollout
  7. βœ… Runs basic persistence test
  8. βœ… Displays final status

Expected output:

Migration Complete!
StatefulSet Status: 3/3 pods ready
Pods: coditect-combined-0, coditect-combined-1, coditect-combined-2
PVCs: workspace-coditect-combined-0 (Bound), workspace-coditect-combined-1 (Bound), workspace-coditect-combined-2 (Bound)

Option B: Manual Migration​

# 1. Create BackendConfig
kubectl apply -f k8s/backend-config-stateful.yaml

# 2. Delete old Deployment
kubectl delete deployment coditect-combined-v5 -n coditect-app
# OR
kubectl delete deployment coditect-combined -n coditect-app

# 3. Create StatefulSet
kubectl apply -f k8s/theia-statefulset.yaml

# 4. Wait for pods to be ready
kubectl rollout status statefulset/coditect-combined -n coditect-app

# 5. Verify PVCs created
kubectl get pvc -n coditect-app

# 6. Update Ingress
kubectl apply -f k8s/ingress-stateful.yaml

# 7. Verify all components
kubectl get all -n coditect-app -l app=coditect-combined

πŸ§ͺ Testing & Validation​

Test 1: Persistence Test (Critical)​

./scripts/test-persistence.sh

What it tests:

  1. βœ… Creates file in pod's /workspace
  2. βœ… Verifies file exists
  3. βœ… Deletes pod (simulates crash)
  4. βœ… Verifies file persists in new pod
  5. βœ… Checks PVCs are mounted correctly

Expected result: File persists across pod restart βœ…

Test 2: Session Affinity Test​

./scripts/test-session-affinity.sh

What it tests:

  1. βœ… Ingress cookie-based affinity configured
  2. βœ… Service ClientIP affinity configured
  3. βœ… BackendConfig session affinity configured
  4. βœ… Simulates multiple requests with cookies
  5. βœ… Verifies pod distribution

Expected result: Session affinity working βœ…

Test 3: Multi-User Scenario (Manual)​

User 1:

1. Open: https://coditect.ai/theia
2. Create file: /workspace/user1-file.txt
3. Note pod name from logs/DevTools
4. Logout
5. Login again
6. Verify: Same pod + file still exists βœ…

User 2 (different browser/IP):

1. Open: https://coditect.ai/theia
2. Create file: /workspace/user2-file.txt
3. Note pod name (may be different from User 1)
4. Verify: User 1's file NOT visible (separate workspaces) βœ…

Expected behavior:

  • Each user routed to specific pod (session affinity) βœ…
  • User's files persist on their pod βœ…
  • Users on different pods have separate workspaces βœ…

πŸ” User β†’ Pod Routing Strategy​

How Session Affinity Works​

Initial Assignment (first visit):

1. User visits https://coditect.ai/theia
2. Ingress routes to any available pod (round-robin)
3. Ingress sets cookie: coditect-affinity=<hash>
4. Pod serves theia IDE
5. User creates files in /workspace (on this pod)

Subsequent Visits (with cookie):

1. User visits again with cookie
2. Ingress reads cookie: coditect-affinity=<hash>
3. Ingress routes to SAME pod (sticky session)
4. User sees same /workspace files βœ…

Session Affinity Layers:

  1. Ingress Level: Cookie-based (nginx.ingress.kubernetes.io/affinity: "cookie")
  2. Service Level: ClientIP (sessionAffinity: ClientIP)
  3. BackendConfig Level: CLIENT_IP (affinityType: "CLIENT_IP")

Fallback Strategy:

  • If cookie expires (3 hours): Re-assign to random pod
  • If pod is down: Route to healthy pod (user loses session data ⚠️)
  • If user clears cookies: New assignment (fresh workspace ⚠️)

Improvement (Future):

  • Store userβ†’pod mapping in Redis or database
  • API endpoint to return assigned pod for user
  • Frontend checks user's assigned pod before loading theia
  • Explicit pod routing via query param: /theia?pod=0

πŸ“Š Monitoring & Troubleshooting​

Check StatefulSet Status​

# StatefulSet
kubectl get statefulset coditect-combined -n coditect-app

# Pods
kubectl get pods -n coditect-app -l app=coditect-combined -o wide

# PVCs
kubectl get pvc -n coditect-app

# Events
kubectl get events -n coditect-app --sort-by='.lastTimestamp'

Check Pod Logs​

# Specific pod
kubectl logs coditect-combined-0 -n coditect-app

# Follow logs
kubectl logs -f coditect-combined-0 -n coditect-app

# All pods
kubectl logs -n coditect-app -l app=coditect-combined --tail=20

Check PVC Mounting​

# Exec into pod
kubectl exec -it coditect-combined-0 -n coditect-app -- bash

# Inside pod:
df -h /workspace
ls -la /workspace

Common Issues​

Issue 1: Pods stuck in Pending

  • Cause: PVC provisioning slow
  • Check: kubectl describe pod coditect-combined-0 -n coditect-app
  • Solution: Wait 2-5 minutes for GCE PD to provision

Issue 2: Old data not migrated

  • Cause: Old Deployment had no PVCs
  • Solution: Data cannot be migrated (was ephemeral)
  • Workaround: Users must recreate files

Issue 3: Session affinity not working

  • Cause: Ingress annotations not applied
  • Check: kubectl get ingress coditect-ingress -n coditect-app -o yaml
  • Solution: Re-apply ingress: kubectl apply -f k8s/ingress-stateful.yaml

Issue 4: Different pod on each visit

  • Cause: Cookies not being set/read
  • Check: Browser DevTools β†’ Application β†’ Cookies β†’ coditect-affinity
  • Solution: Verify Ingress affinity annotations, check HTTPS enabled

πŸ”„ Rollback Procedure​

If issues occur, rollback to Deployment:

# 1. Delete StatefulSet (keeps PVCs by default)
kubectl delete statefulset coditect-combined -n coditect-app

# 2. Re-create old Deployment
kubectl apply -f k8s/k8s-combined-deployment.yaml

# 3. Wait for rollout
kubectl rollout status deployment/coditect-combined-v5 -n coditect-app

Note: PVCs remain after rollback. Can be deleted manually if needed:

kubectl delete pvc -n coditect-app -l app=coditect-combined

πŸ“ˆ Scaling StatefulSet​

Scale up:

kubectl scale statefulset coditect-combined --replicas=5 -n coditect-app

What happens:

  • New pods created: coditect-combined-3, coditect-combined-4
  • New PVCs auto-created for each new pod
  • Session affinity distributes new users across all 5 pods

Scale down:

kubectl scale statefulset coditect-combined --replicas=2 -n coditect-app

What happens:

  • Pod coditect-combined-2 deleted
  • PVC workspace-coditect-combined-2 retained (manual deletion required)
  • Users on pod-2 reconnect to pod-0 or pod-1 (lose session ⚠️)

πŸ“‹ Post-Migration Checklist​

  • StatefulSet has 3/3 pods running
  • 3 PVCs created and bound
  • Ingress updated with session affinity
  • BackendConfig applied
  • Persistence test passed (file survives pod restart)
  • Session affinity test passed (users stick to same pod)
  • Manual user test passed (login β†’ create file β†’ logout β†’ login β†’ file exists)
  • Health checks passing (/health endpoint)
  • WebSocket connections working (Socket.IO green indicator)
  • No errors in pod logs
  • Old Deployment deleted
  • Monitoring configured (optional)

🎯 Success Criteria​

Before Migration: ❌ User data lost on pod restart ❌ User data lost on deployment update ❌ Users routed to random pods

After Migration: βœ… User data persists across pod restarts βœ… User data persists across deployments βœ… Users routed to same pod (session affinity) βœ… High availability (3 replicas) βœ… Automatic storage provisioning


Last Updated: 2025-10-26 Status: βœ… Ready for production migration Estimated Completion: 2-4 hours