Hybrid Storage Migration - Standard → Hybrid Cutover
Date: 2025-10-29 Type: Zero-Downtime Production Migration Estimated Duration: 10-15 minutes Risk Level: ✅ LOW (proven in testing, simple rollback)
Executive Summary
Goal: Switch production traffic from standard deployment (50GB PVCs) to hybrid deployment (10GB PVCs) for immediate cost savings and better resource utilization.
Key Benefits:
- 💰 Cost Savings: $24/month immediate savings (73% storage reduction)
- 🚀 Performance: Faster pod startup (less PVC data)
- 📦 Architecture: System tools in Docker image (shared), user files in PVCs
- ✅ Zero Risk: Same image, proven stable (26h uptime), simple rollback
No Data Loss: Fresh migration, no existing user data to preserve
Pre-Migration Status
Current Production (Standard)
StatefulSet: coditect-combined
Pods: 3 (2/3 ready, 1 terminating)
Image: fe55d53d-43b0-4d39-b5b6-4da0d5c7363e
Storage: 50GB workspace + 5GB config = 55GB per pod (165GB total)
Traffic: ✅ Receiving production traffic via coditect-combined-service
Hybrid Testing Deployment
StatefulSet: coditect-combined-hybrid
Pods: 3/3 ready
Image: 904176d4-4627-4fb4-8ba6-e57e9e4028fd (older but stable)
Storage: 10GB workspace + 5GB config = 15GB per pod (45GB total)
Traffic: ❌ No traffic (isolated testing)
Uptime: 26 hours without issues
Ingress Configuration
Current Routing:
coditect.ai → Ingress → coditect-combined-service → coditect-combined-0/1/2
Target Routing:
coditect.ai → Ingress → coditect-combined-service-hybrid → coditect-combined-hybrid-0/1/2
Migration Plan
Phase 1: Update Cloud Build Configuration (3 min)
Change cloudbuild to deploy to hybrid:
File: cloudbuild-combined.yaml
Changes:
- Line 57:
k8s/theia-statefulset.yaml→k8s/theia-statefulset-hybrid.yaml - Line 69:
statefulset/coditect-combined→statefulset/coditect-combined-hybrid - Line 83:
statefulset/coditect-combined→statefulset/coditect-combined-hybrid
Verification:
git diff cloudbuild-combined.yaml
git add cloudbuild-combined.yaml
git commit -m "build: Switch to hybrid storage deployment"
git push origin main
Phase 2: Update Hybrid Pods to Latest Image (5 min)
Update hybrid StatefulSet to use latest production image:
# Get current production image
PROD_IMAGE=$(kubectl get statefulset coditect-combined -n coditect-app -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Production image: $PROD_IMAGE"
# Update hybrid to same image
kubectl set image statefulset/coditect-combined-hybrid \
combined=$PROD_IMAGE \
-n coditect-app
# Wait for rollout
kubectl rollout status statefulset/coditect-combined-hybrid -n coditect-app --timeout=5m
Expected Output:
statefulset rolling update complete 3 pods at revision coditect-combined-hybrid-xxxxx
Verification:
# Verify all pods running
kubectl get pods -n coditect-app -l app=coditect-combined-hybrid
# Should show 3/3 ready
NAME READY STATUS RESTARTS AGE
coditect-combined-hybrid-0 1/1 Running 0 2m
coditect-combined-hybrid-1 1/1 Running 0 2m
coditect-combined-hybrid-2 1/1 Running 0 2m
Phase 3: Update Ingress Routing (2 min)
Switch Ingress to route to hybrid service:
File: k8s/ingress.yaml (or wherever Ingress is defined)
Change:
# BEFORE
backend:
service:
name: coditect-combined-service
port:
number: 80
# AFTER
backend:
service:
name: coditect-combined-service-hybrid
port:
number: 80
Apply:
# Update Ingress
kubectl apply -f k8s/ingress.yaml
# Wait for backend update (30-60 seconds)
sleep 60
# Verify backends healthy
kubectl describe ingress coditect-production-ingress -n coditect-app | grep -A 10 "Backends"
Expected Output:
Backends:
k8s1-...-coditect-combined-service-hybrid-80-...: HEALTHY
Phase 4: Verify Production Traffic (2 min)
Test production URLs:
# Test frontend
curl -I https://coditect.ai
# Expected: HTTP/2 200
# Test theia IDE
curl -I https://coditect.ai/theia
# Expected: HTTP/2 200
# Test V5 API
curl https://api.coditect.ai/health
# Expected: {"status":"ok"}
# Check WebSocket support
curl -I -H "Connection: Upgrade" -H "Upgrade: websocket" https://coditect.ai
# Expected: Connection: upgrade header in response
Verify pods receiving traffic:
# Check pod logs for incoming requests
kubectl logs -f coditect-combined-hybrid-0 -n coditect-app --tail=50
Phase 5: Cleanup Old Standard Deployment (Optional - Later)
⚠️ DO NOT run immediately - wait 24-48 hours to ensure stability
# Scale down standard deployment (keep PVCs)
kubectl scale statefulset/coditect-combined --replicas=0 -n coditect-app
# After 48 hours of stable hybrid operation:
# Delete standard StatefulSet (keeps PVCs for recovery)
kubectl delete statefulset/coditect-combined -n coditect-app
# Delete standard service
kubectl delete service/coditect-combined-service -n coditect-app
# After 7 days, delete PVCs (point of no return)
kubectl delete pvc workspace-coditect-combined-0 -n coditect-app
kubectl delete pvc workspace-coditect-combined-1 -n coditect-app
kubectl delete pvc workspace-coditect-combined-2 -n coditect-app
kubectl delete pvc theia-config-coditect-combined-0 -n coditect-app
kubectl delete pvc theia-config-coditect-combined-1 -n coditect-app
kubectl delete pvc theia-config-coditect-combined-2 -n coditect-app
Rollback Plan
If issues occur, rollback in <2 minutes:
Quick Rollback (Ingress Only)
# Switch Ingress back to standard service
kubectl patch ingress coditect-production-ingress -n coditect-app --type='json' \
-p='[{"op": "replace", "path": "/spec/rules/0/http/paths/0/backend/service/name", "value": "coditect-combined-service"}]'
# Verify
kubectl describe ingress coditect-production-ingress -n coditect-app
Downtime: <30 seconds (Ingress update propagation)
Full Rollback (Scale Up Standard)
# If standard was scaled down
kubectl scale statefulset/coditect-combined --replicas=3 -n coditect-app
# Wait for pods ready
kubectl rollout status statefulset/coditect-combined -n coditect-app
# Switch Ingress back
kubectl patch ingress coditect-production-ingress -n coditect-app --type='json' \
-p='[{"op": "replace", "path": "/spec/rules/0/http/paths/0/backend/service/name", "value": "coditect-combined-service"}]'
Downtime: 2-3 minutes (pod startup + Ingress switch)
Verification Checklist
Pre-Migration Checks
- Hybrid pods all running (3/3)
- Hybrid pods healthy (liveness/readiness passing)
- Standard pods status noted (for comparison)
- No active users (MVP, no production traffic yet)
- Backup of current Ingress config
- Git changes committed
Post-Migration Checks
- Hybrid pods receiving traffic (check logs)
- https://coditect.ai responding (200 OK)
- https://coditect.ai/theia loading theia IDE
- https://api.coditect.ai/health responding
- WebSocket connections working
- Pod resource usage normal (<1 GB RAM, <1 CPU)
- No errors in pod logs
- Ingress backends show HEALTHY
- Session affinity working (sticky sessions)
24-Hour Post-Migration
- No pod restarts
- Memory usage stable
- No CrashLoopBackOff
- Logs clean (no errors)
- Performance normal
Migration Timeline
| Step | Task | Duration | Cumulative |
|---|---|---|---|
| 1 | Update cloudbuild.yaml | 3 min | 3 min |
| 2 | Update hybrid pods to latest image | 5 min | 8 min |
| 3 | Update Ingress routing | 2 min | 10 min |
| 4 | Verify production traffic | 2 min | 12 min |
| 5 | Monitor for issues | 3 min | 15 min |
Total: 12-15 minutes
Downtime: <1 minute (Ingress switch only)
Cost Impact
Before Migration
Storage (3 pods):
- workspace PVCs: 3 × 50 GB = 150 GB
- Config PVCs: 3 × 5 GB = 15 GB
- Total: 165 GB × $0.20/GB/month = $33/month
Compute (3 pods):
- Pod resources: 512Mi-2Gi RAM, 500m-2000m CPU
- Estimated: ~$150/month (unchanged)
Total: ~$183/month
After Migration
Storage (3 pods):
- workspace PVCs: 3 × 10 GB = 30 GB
- Config PVCs: 3 × 5 GB = 15 GB
- Total: 45 GB × $0.20/GB/month = $9/month
Compute (3 pods):
- Pod resources: Same (512Mi-2Gi RAM, 500m-2000m CPU)
- Estimated: ~$150/month (unchanged)
Total: ~$159/month
Savings: $24/month ($288/year) = 13% total cost reduction
Risk Assessment
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Hybrid pods crash on load | LOW (proven stable 26h) | HIGH | Quick rollback (<2 min) |
| Ingress routing fails | VERY LOW (simple config) | HIGH | Rollback Ingress patch |
| User workspace >10 GB | VERY LOW (MVP, no users) | MEDIUM | Expand PVC if needed |
| Health check failures | LOW (same image) | HIGH | Monitor logs, rollback |
| Session affinity breaks | VERY LOW (same config) | MEDIUM | Verify sticky sessions |
Overall Risk: ✅ LOW
Confidence: HIGH (hybrid proven stable, simple rollback, no user data)
Success Criteria
Migration is successful when:
- ✅ Hybrid pods receiving production traffic
- ✅ https://coditect.ai responding normally
- ✅ theia IDE loading without errors
- ✅ No pod restarts or crashes
- ✅ Ingress backends showing HEALTHY
- ✅ WebSocket connections working
- ✅ Cost reduced by $24/month
- ✅ No functional degradation
Communication Plan
Before Migration:
- No user communication needed (MVP, no production users)
- Team notification: "Migrating to hybrid storage for cost optimization"
During Migration:
- Monitor Slack channel for any alerts
- Be ready for quick rollback
After Migration:
- Confirm success in team channel
- Update documentation to reflect hybrid as production
- Monitor for 24 hours
Related Documentation
- Architecture Decision:
docs/07-adr/ADR-028-PART-1-HYBRID-STORAGE-PROBLEM-analysis.md - Implementation Plan:
docs/07-adr/adr-028-part-2-hybrid-storage-decision-implementation.md - Kubernetes Configs:
- Standard:
k8s/theia-statefulset.yaml - Hybrid:
k8s/theia-statefulset-hybrid.yaml
- Standard:
Appendix A: Command Reference
Quick Status Check
# Check all deployments
kubectl get statefulset -n coditect-app
kubectl get pods -n coditect-app -l app=coditect-combined
kubectl get pods -n coditect-app -l app=coditect-combined-hybrid
# Check services
kubectl get svc -n coditect-app | grep combined
# Check Ingress
kubectl get ingress -n coditect-app
kubectl describe ingress coditect-production-ingress -n coditect-app
# Check PVCs
kubectl get pvc -n coditect-app | grep -E "NAME|combined"
Monitor Migration
# Watch pod status (in separate terminal)
watch -n 2 'kubectl get pods -n coditect-app -l app=coditect-combined-hybrid'
# Stream logs
kubectl logs -f coditect-combined-hybrid-0 -n coditect-app
# Check events
kubectl get events -n coditect-app --sort-by='.lastTimestamp' | tail -20
Performance Metrics
# Pod resource usage
kubectl top pods -n coditect-app -l app=coditect-combined-hybrid
# PVC usage
kubectl exec coditect-combined-hybrid-0 -n coditect-app -- df -h /workspace
Appendix B: Troubleshooting
Issue: Hybrid pods not receiving traffic
Symptoms: Ingress updated but traffic still going to standard
Solution:
# Check Ingress configuration
kubectl get ingress coditect-production-ingress -n coditect-app -o yaml | grep -A 5 "backend"
# Force Ingress reload
kubectl annotate ingress coditect-production-ingress -n coditect-app \
kubectl.kubernetes.io/restartedAt="$(date +%s)" --overwrite
# Wait 30 seconds for propagation
sleep 30
Issue: Health checks failing
Symptoms: Pods showing 0/1 ready
Solution:
# Check health endpoint
kubectl exec coditect-combined-hybrid-0 -n coditect-app -- curl -I http://localhost:80/health
# Check pod logs for errors
kubectl logs coditect-combined-hybrid-0 -n coditect-app --tail=100
# If NGINX not starting, check start script
kubectl describe pod coditect-combined-hybrid-0 -n coditect-app
Issue: Out of disk space (unlikely)
Symptoms: workspace PVC full at 10 GB
Solution:
# Expand PVC (dynamic, no downtime)
kubectl patch pvc workspace-coditect-combined-hybrid-0 -n coditect-app \
-p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# Verify expansion
kubectl get pvc workspace-coditect-combined-hybrid-0 -n coditect-app
Migration Prepared: 2025-10-29 Status: ✅ Ready to Execute Next Step: Execute Phase 1 (Update cloudbuild.yaml)