Socket.IO Troubleshooting Quick Reference
Print this page for rapid incident response
๐ฅ Emergency Quick Fixes (< 5 minutes)โ
# 1. Add WebSocket support (most likely fix - 85% success rate)
kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout="86400" --overwrite
# 2. Create health endpoint (70% success rate)
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
echo "location /health { return 200 \"healthy\"; }" >> /etc/nginx/sites-available/default
nginx -t && nginx -s reload
'
# 3. Wait for GKE to apply changes
sleep 180
# 4. Test
curl -s -o /dev/null -w "%{http_code}\n" https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
# Expected: 200 (was 400)
๐ Rapid Diagnostics (< 2 minutes)โ
# Test internal Socket.IO (should work)
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s http://localhost:3000/socket.io/?EIO=4&transport=polling | grep -q sid && \
echo "โ
Internal OK" || echo "โ Internal FAIL"
# Test external Socket.IO (currently fails)
curl -s https://coditect.ai/theia/socket.io/?EIO=4&transport=polling | grep -q sid && \
echo "โ
External OK" || echo "โ External FAIL"
# Check health endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s -o /dev/null -w "Health: %{http_code}\n" http://localhost/health
# Check GKE backend health
gcloud compute backend-services get-health \
$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]') --global | grep -q HEALTHY && \
echo "โ
Backend healthy" || echo "โ Backend unhealthy"
๐ Key Configuration Valuesโ
| Component | Setting | Current | Recommended |
|---|---|---|---|
| Ingress | WebSocket timeout | โ Not set | โ 86400s |
| BackendConfig | Timeout | 3600s | 86400s |
| BackendConfig | Draining | 300s | 30s |
| BackendConfig | Session affinity | โ None | โ CLIENT_IP |
| Health check | Path | /health | /health |
| Health check | Status | โ 404 | โ 200 |
๐ฏ Root Cause Probabilityโ
โโโโโโโโโโโโโโโโโ 85% GKE missing WebSocket annotation
โโโโโโโโโโโโ 60% No session affinity at LB level
โโโโโโโโ 40% Health check endpoint 404
โโโโโโ 30% Host/Origin header corruption
โโโโ 20% Connection draining too long
๐ง Configuration Commandsโ
View Current Configโ
# Ingress annotations
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations}' | jq .
# BackendConfig
kubectl get backendconfig -n coditect-app coditect-backend-config -o yaml
# GKE backend service
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | jq -r 'keys[0]')
gcloud compute backend-services describe $BACKEND --global
Backup Before Changesโ
# Create backups (ALWAYS do this first!)
kubectl get ingress -n coditect-app coditect-production-ingress \
-o yaml > /tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml
๐ Rollback Commandsโ
# Restore from most recent backup
kubectl apply -f /tmp/ingress-backup-*.yaml
kubectl apply -f /tmp/backendconfig-backup-*.yaml
# Remove health endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
sed -i '/location \/health/,/}/d' /etc/nginx/sites-available/default
kubectl exec -n coditect-app deployment/coditect-combined -- nginx -s reload
# Wait for GKE
sleep 180
๐ Monitoring Commandsโ
# Watch Socket.IO logs in real-time
kubectl logs -n coditect-app -l app=coditect-combined -f | grep socket.io
# Count 400 errors in last 100 lines
kubectl logs -n coditect-app -l app=coditect-combined --tail=100 | \
grep socket.io | grep -c "400"
# Check active connections
kubectl exec -n coditect-app deployment/coditect-combined -- \
netstat -an | grep :3000 | grep ESTABLISHED | wc -l
# Monitor GKE backend health
watch -n 10 'gcloud compute backend-services get-health \
$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath="{.metadata.annotations.ingress\.kubernetes\.io/backends}" | \
jq -r "keys[0]") --global'
๐งช Test Suiteโ
# Full test suite (copy-paste ready)
echo "=== Socket.IO Test Suite ===" && \
echo -n "1. Internal direct (3000): " && \
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s http://localhost:3000/socket.io/?EIO=4&transport=polling | \
grep -q sid && echo "โ
PASS" || echo "โ FAIL" && \
echo -n "2. Internal nginx (80): " && \
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s http://localhost/theia/socket.io/?EIO=4&transport=polling | \
grep -q sid && echo "โ
PASS" || echo "โ FAIL" && \
echo -n "3. External (HTTPS): " && \
curl -s https://coditect.ai/theia/socket.io/?EIO=4&transport=polling | \
grep -q sid && echo "โ
PASS" || echo "โ FAIL" && \
echo -n "4. Health endpoint: " && \
curl -s -o /dev/null -w "%{http_code}" https://coditect.ai/health | \
grep -q 200 && echo "โ
PASS" || echo "โ FAIL"
๐จ Error Code Referenceโ
| Status | Meaning | Likely Cause | Quick Fix |
|---|---|---|---|
| 200 | โ OK | Working! | - |
| 400 | Bad Request | WebSocket headers missing | Add annotation |
| 404 | Not Found | Path routing issue | Check ingress path |
| 502 | Bad Gateway | Backend down | Check pod status |
| 503 | Unavailable | Health check failing | Create /health |
| 504 | Gateway Timeout | Backend slow | Check pod logs |
๐ Escalation Pathโ
- L1: Run diagnostics script:
./socketio-diagnostics.sh - L2: Apply quick fixes (WebSocket annotation + health endpoint)
- L3: Review detailed analysis:
socket-io-investigation-analysis.md - L4: Implement comprehensive fixes:
fix-implementation-guide.md - L5: Engage Google Cloud support with diagnostic output
๐ Key Filesโ
| File | Purpose | Location |
|---|---|---|
executive-summary.md | One-page overview | /home/claude/ |
investigation-runbook.md | Detailed procedures | /home/claude/ |
diagnostic-decision-tree.md | Troubleshooting tree | /home/claude/ |
socketio-diagnostics.sh | Automated tests | /home/claude/ |
fix-implementation-guide.md | Fix procedures | /home/claude/ |
architecture-diagrams.md | Visual diagrams | /home/claude/ |
๐ก Pro Tipsโ
โ DO:
- Always backup before changes
- Test internal connectivity first
- Wait 2-5 min after GKE changes
- Apply fixes incrementally
- Monitor logs during tests
โ DON'T:
- Skip backups
- Apply all fixes at once without testing
- Forget to wait for GKE reconciliation
- Test immediately after changes
- Make changes during peak hours
๐ Learning Resourcesโ
- GKE WebSocket: https://cloud.google.com/load-balancing/docs/https#websocket
- Socket.IO Protocol: https://socket.io/docs/v4/how-it-works/
- BackendConfig: https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-features
๐ Quick Notes Sectionโ
Date: _______________________
Issue: _______________________
Fix Applied: _______________________
Result: _______________________
Notes: _______________________
_______________________
_______________________
Last Updated: October 20, 2025
Version: 1.0
Owner: DevOps Team
Print this page and keep it handy for rapid incident response!