Skip to main content

Hybrid Storage Migration - Standard → Hybrid Cutover

Date: 2025-10-29 Type: Zero-Downtime Production Migration Estimated Duration: 10-15 minutes Risk Level: ✅ LOW (proven in testing, simple rollback)


Executive Summary

Goal: Switch production traffic from standard deployment (50GB PVCs) to hybrid deployment (10GB PVCs) for immediate cost savings and better resource utilization.

Key Benefits:

  • 💰 Cost Savings: $24/month immediate savings (73% storage reduction)
  • 🚀 Performance: Faster pod startup (less PVC data)
  • 📦 Architecture: System tools in Docker image (shared), user files in PVCs
  • Zero Risk: Same image, proven stable (26h uptime), simple rollback

No Data Loss: Fresh migration, no existing user data to preserve


Pre-Migration Status

Current Production (Standard)

StatefulSet: coditect-combined Pods: 3 (2/3 ready, 1 terminating) Image: fe55d53d-43b0-4d39-b5b6-4da0d5c7363e Storage: 50GB workspace + 5GB config = 55GB per pod (165GB total) Traffic: ✅ Receiving production traffic via coditect-combined-service

Hybrid Testing Deployment

StatefulSet: coditect-combined-hybrid Pods: 3/3 ready Image: 904176d4-4627-4fb4-8ba6-e57e9e4028fd (older but stable) Storage: 10GB workspace + 5GB config = 15GB per pod (45GB total) Traffic: ❌ No traffic (isolated testing) Uptime: 26 hours without issues

Ingress Configuration

Current Routing:

coditect.ai → Ingress → coditect-combined-service → coditect-combined-0/1/2

Target Routing:

coditect.ai → Ingress → coditect-combined-service-hybrid → coditect-combined-hybrid-0/1/2

Migration Plan

Phase 1: Update Cloud Build Configuration (3 min)

Change cloudbuild to deploy to hybrid:

File: cloudbuild-combined.yaml

Changes:

  1. Line 57: k8s/theia-statefulset.yamlk8s/theia-statefulset-hybrid.yaml
  2. Line 69: statefulset/coditect-combinedstatefulset/coditect-combined-hybrid
  3. Line 83: statefulset/coditect-combinedstatefulset/coditect-combined-hybrid

Verification:

git diff cloudbuild-combined.yaml
git add cloudbuild-combined.yaml
git commit -m "build: Switch to hybrid storage deployment"
git push origin main

Phase 2: Update Hybrid Pods to Latest Image (5 min)

Update hybrid StatefulSet to use latest production image:

# Get current production image
PROD_IMAGE=$(kubectl get statefulset coditect-combined -n coditect-app -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Production image: $PROD_IMAGE"

# Update hybrid to same image
kubectl set image statefulset/coditect-combined-hybrid \
combined=$PROD_IMAGE \
-n coditect-app

# Wait for rollout
kubectl rollout status statefulset/coditect-combined-hybrid -n coditect-app --timeout=5m

Expected Output:

statefulset rolling update complete 3 pods at revision coditect-combined-hybrid-xxxxx

Verification:

# Verify all pods running
kubectl get pods -n coditect-app -l app=coditect-combined-hybrid

# Should show 3/3 ready
NAME READY STATUS RESTARTS AGE
coditect-combined-hybrid-0 1/1 Running 0 2m
coditect-combined-hybrid-1 1/1 Running 0 2m
coditect-combined-hybrid-2 1/1 Running 0 2m

Phase 3: Update Ingress Routing (2 min)

Switch Ingress to route to hybrid service:

File: k8s/ingress.yaml (or wherever Ingress is defined)

Change:

# BEFORE
backend:
service:
name: coditect-combined-service
port:
number: 80

# AFTER
backend:
service:
name: coditect-combined-service-hybrid
port:
number: 80

Apply:

# Update Ingress
kubectl apply -f k8s/ingress.yaml

# Wait for backend update (30-60 seconds)
sleep 60

# Verify backends healthy
kubectl describe ingress coditect-production-ingress -n coditect-app | grep -A 10 "Backends"

Expected Output:

Backends:
k8s1-...-coditect-combined-service-hybrid-80-...: HEALTHY

Phase 4: Verify Production Traffic (2 min)

Test production URLs:

# Test frontend
curl -I https://coditect.ai
# Expected: HTTP/2 200

# Test theia IDE
curl -I https://coditect.ai/theia
# Expected: HTTP/2 200

# Test V5 API
curl https://api.coditect.ai/health
# Expected: {"status":"ok"}

# Check WebSocket support
curl -I -H "Connection: Upgrade" -H "Upgrade: websocket" https://coditect.ai
# Expected: Connection: upgrade header in response

Verify pods receiving traffic:

# Check pod logs for incoming requests
kubectl logs -f coditect-combined-hybrid-0 -n coditect-app --tail=50

Phase 5: Cleanup Old Standard Deployment (Optional - Later)

⚠️ DO NOT run immediately - wait 24-48 hours to ensure stability

# Scale down standard deployment (keep PVCs)
kubectl scale statefulset/coditect-combined --replicas=0 -n coditect-app

# After 48 hours of stable hybrid operation:

# Delete standard StatefulSet (keeps PVCs for recovery)
kubectl delete statefulset/coditect-combined -n coditect-app

# Delete standard service
kubectl delete service/coditect-combined-service -n coditect-app

# After 7 days, delete PVCs (point of no return)
kubectl delete pvc workspace-coditect-combined-0 -n coditect-app
kubectl delete pvc workspace-coditect-combined-1 -n coditect-app
kubectl delete pvc workspace-coditect-combined-2 -n coditect-app
kubectl delete pvc theia-config-coditect-combined-0 -n coditect-app
kubectl delete pvc theia-config-coditect-combined-1 -n coditect-app
kubectl delete pvc theia-config-coditect-combined-2 -n coditect-app

Rollback Plan

If issues occur, rollback in <2 minutes:

Quick Rollback (Ingress Only)

# Switch Ingress back to standard service
kubectl patch ingress coditect-production-ingress -n coditect-app --type='json' \
-p='[{"op": "replace", "path": "/spec/rules/0/http/paths/0/backend/service/name", "value": "coditect-combined-service"}]'

# Verify
kubectl describe ingress coditect-production-ingress -n coditect-app

Downtime: <30 seconds (Ingress update propagation)

Full Rollback (Scale Up Standard)

# If standard was scaled down
kubectl scale statefulset/coditect-combined --replicas=3 -n coditect-app

# Wait for pods ready
kubectl rollout status statefulset/coditect-combined -n coditect-app

# Switch Ingress back
kubectl patch ingress coditect-production-ingress -n coditect-app --type='json' \
-p='[{"op": "replace", "path": "/spec/rules/0/http/paths/0/backend/service/name", "value": "coditect-combined-service"}]'

Downtime: 2-3 minutes (pod startup + Ingress switch)


Verification Checklist

Pre-Migration Checks

  • Hybrid pods all running (3/3)
  • Hybrid pods healthy (liveness/readiness passing)
  • Standard pods status noted (for comparison)
  • No active users (MVP, no production traffic yet)
  • Backup of current Ingress config
  • Git changes committed

Post-Migration Checks

24-Hour Post-Migration

  • No pod restarts
  • Memory usage stable
  • No CrashLoopBackOff
  • Logs clean (no errors)
  • Performance normal

Migration Timeline

StepTaskDurationCumulative
1Update cloudbuild.yaml3 min3 min
2Update hybrid pods to latest image5 min8 min
3Update Ingress routing2 min10 min
4Verify production traffic2 min12 min
5Monitor for issues3 min15 min

Total: 12-15 minutes

Downtime: <1 minute (Ingress switch only)


Cost Impact

Before Migration

Storage (3 pods):
- workspace PVCs: 3 × 50 GB = 150 GB
- Config PVCs: 3 × 5 GB = 15 GB
- Total: 165 GB × $0.20/GB/month = $33/month

Compute (3 pods):
- Pod resources: 512Mi-2Gi RAM, 500m-2000m CPU
- Estimated: ~$150/month (unchanged)

Total: ~$183/month

After Migration

Storage (3 pods):
- workspace PVCs: 3 × 10 GB = 30 GB
- Config PVCs: 3 × 5 GB = 15 GB
- Total: 45 GB × $0.20/GB/month = $9/month

Compute (3 pods):
- Pod resources: Same (512Mi-2Gi RAM, 500m-2000m CPU)
- Estimated: ~$150/month (unchanged)

Total: ~$159/month

Savings: $24/month ($288/year) = 13% total cost reduction


Risk Assessment

RiskProbabilityImpactMitigation
Hybrid pods crash on loadLOW (proven stable 26h)HIGHQuick rollback (<2 min)
Ingress routing failsVERY LOW (simple config)HIGHRollback Ingress patch
User workspace >10 GBVERY LOW (MVP, no users)MEDIUMExpand PVC if needed
Health check failuresLOW (same image)HIGHMonitor logs, rollback
Session affinity breaksVERY LOW (same config)MEDIUMVerify sticky sessions

Overall Risk: ✅ LOW

Confidence: HIGH (hybrid proven stable, simple rollback, no user data)


Success Criteria

Migration is successful when:

  1. ✅ Hybrid pods receiving production traffic
  2. https://coditect.ai responding normally
  3. ✅ theia IDE loading without errors
  4. ✅ No pod restarts or crashes
  5. ✅ Ingress backends showing HEALTHY
  6. ✅ WebSocket connections working
  7. ✅ Cost reduced by $24/month
  8. ✅ No functional degradation

Communication Plan

Before Migration:

  • No user communication needed (MVP, no production users)
  • Team notification: "Migrating to hybrid storage for cost optimization"

During Migration:

  • Monitor Slack channel for any alerts
  • Be ready for quick rollback

After Migration:

  • Confirm success in team channel
  • Update documentation to reflect hybrid as production
  • Monitor for 24 hours


Appendix A: Command Reference

Quick Status Check

# Check all deployments
kubectl get statefulset -n coditect-app
kubectl get pods -n coditect-app -l app=coditect-combined
kubectl get pods -n coditect-app -l app=coditect-combined-hybrid

# Check services
kubectl get svc -n coditect-app | grep combined

# Check Ingress
kubectl get ingress -n coditect-app
kubectl describe ingress coditect-production-ingress -n coditect-app

# Check PVCs
kubectl get pvc -n coditect-app | grep -E "NAME|combined"

Monitor Migration

# Watch pod status (in separate terminal)
watch -n 2 'kubectl get pods -n coditect-app -l app=coditect-combined-hybrid'

# Stream logs
kubectl logs -f coditect-combined-hybrid-0 -n coditect-app

# Check events
kubectl get events -n coditect-app --sort-by='.lastTimestamp' | tail -20

Performance Metrics

# Pod resource usage
kubectl top pods -n coditect-app -l app=coditect-combined-hybrid

# PVC usage
kubectl exec coditect-combined-hybrid-0 -n coditect-app -- df -h /workspace

Appendix B: Troubleshooting

Issue: Hybrid pods not receiving traffic

Symptoms: Ingress updated but traffic still going to standard

Solution:

# Check Ingress configuration
kubectl get ingress coditect-production-ingress -n coditect-app -o yaml | grep -A 5 "backend"

# Force Ingress reload
kubectl annotate ingress coditect-production-ingress -n coditect-app \
kubectl.kubernetes.io/restartedAt="$(date +%s)" --overwrite

# Wait 30 seconds for propagation
sleep 30

Issue: Health checks failing

Symptoms: Pods showing 0/1 ready

Solution:

# Check health endpoint
kubectl exec coditect-combined-hybrid-0 -n coditect-app -- curl -I http://localhost:80/health

# Check pod logs for errors
kubectl logs coditect-combined-hybrid-0 -n coditect-app --tail=100

# If NGINX not starting, check start script
kubectl describe pod coditect-combined-hybrid-0 -n coditect-app

Issue: Out of disk space (unlikely)

Symptoms: workspace PVC full at 10 GB

Solution:

# Expand PVC (dynamic, no downtime)
kubectl patch pvc workspace-coditect-combined-hybrid-0 -n coditect-app \
-p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

# Verify expansion
kubectl get pvc workspace-coditect-combined-hybrid-0 -n coditect-app

Migration Prepared: 2025-10-29 Status: ✅ Ready to Execute Next Step: Execute Phase 1 (Update cloudbuild.yaml)