Skip to main content

Socket.IO Troubleshooting Quick Reference

Print this page for rapid incident response


๐Ÿ”ฅ Emergency Quick Fixes (< 5 minutes)โ€‹

# 1. Add WebSocket support (most likely fix - 85% success rate)
kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout="86400" --overwrite

# 2. Create health endpoint (70% success rate)
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
echo "location /health { return 200 \"healthy\"; }" >> /etc/nginx/sites-available/default
nginx -t && nginx -s reload
'

# 3. Wait for GKE to apply changes
sleep 180

# 4. Test
curl -s -o /dev/null -w "%{http_code}\n" https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
# Expected: 200 (was 400)

๐Ÿ” Rapid Diagnostics (< 2 minutes)โ€‹

# Test internal Socket.IO (should work)
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s http://localhost:3000/socket.io/?EIO=4&transport=polling | grep -q sid && \
echo "โœ… Internal OK" || echo "โŒ Internal FAIL"

# Test external Socket.IO (currently fails)
curl -s https://coditect.ai/theia/socket.io/?EIO=4&transport=polling | grep -q sid && \
echo "โœ… External OK" || echo "โŒ External FAIL"

# Check health endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s -o /dev/null -w "Health: %{http_code}\n" http://localhost/health

# Check GKE backend health
gcloud compute backend-services get-health \
$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]') --global | grep -q HEALTHY && \
echo "โœ… Backend healthy" || echo "โŒ Backend unhealthy"

๐Ÿ“‹ Key Configuration Valuesโ€‹

ComponentSettingCurrentRecommended
IngressWebSocket timeoutโŒ Not setโœ… 86400s
BackendConfigTimeout3600s86400s
BackendConfigDraining300s30s
BackendConfigSession affinityโŒ Noneโœ… CLIENT_IP
Health checkPath/health/health
Health checkStatusโŒ 404โœ… 200

๐ŸŽฏ Root Cause Probabilityโ€‹

โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 85%  GKE missing WebSocket annotation
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 60% No session affinity at LB level
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 40% Health check endpoint 404
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 30% Host/Origin header corruption
โ–ˆโ–ˆโ–ˆโ–ˆ 20% Connection draining too long

๐Ÿ”ง Configuration Commandsโ€‹

View Current Configโ€‹

# Ingress annotations
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations}' | jq .

# BackendConfig
kubectl get backendconfig -n coditect-app coditect-backend-config -o yaml

# GKE backend service
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | jq -r 'keys[0]')
gcloud compute backend-services describe $BACKEND --global

Backup Before Changesโ€‹

# Create backups (ALWAYS do this first!)
kubectl get ingress -n coditect-app coditect-production-ingress \
-o yaml > /tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml

kubectl get backendconfig -n coditect-app coditect-backend-config \
-o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml

๐Ÿ”„ Rollback Commandsโ€‹

# Restore from most recent backup
kubectl apply -f /tmp/ingress-backup-*.yaml
kubectl apply -f /tmp/backendconfig-backup-*.yaml

# Remove health endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
sed -i '/location \/health/,/}/d' /etc/nginx/sites-available/default
kubectl exec -n coditect-app deployment/coditect-combined -- nginx -s reload

# Wait for GKE
sleep 180

๐Ÿ“Š Monitoring Commandsโ€‹

# Watch Socket.IO logs in real-time
kubectl logs -n coditect-app -l app=coditect-combined -f | grep socket.io

# Count 400 errors in last 100 lines
kubectl logs -n coditect-app -l app=coditect-combined --tail=100 | \
grep socket.io | grep -c "400"

# Check active connections
kubectl exec -n coditect-app deployment/coditect-combined -- \
netstat -an | grep :3000 | grep ESTABLISHED | wc -l

# Monitor GKE backend health
watch -n 10 'gcloud compute backend-services get-health \
$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath="{.metadata.annotations.ingress\.kubernetes\.io/backends}" | \
jq -r "keys[0]") --global'

๐Ÿงช Test Suiteโ€‹

# Full test suite (copy-paste ready)
echo "=== Socket.IO Test Suite ===" && \
echo -n "1. Internal direct (3000): " && \
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s http://localhost:3000/socket.io/?EIO=4&transport=polling | \
grep -q sid && echo "โœ… PASS" || echo "โŒ FAIL" && \
echo -n "2. Internal nginx (80): " && \
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s http://localhost/theia/socket.io/?EIO=4&transport=polling | \
grep -q sid && echo "โœ… PASS" || echo "โŒ FAIL" && \
echo -n "3. External (HTTPS): " && \
curl -s https://coditect.ai/theia/socket.io/?EIO=4&transport=polling | \
grep -q sid && echo "โœ… PASS" || echo "โŒ FAIL" && \
echo -n "4. Health endpoint: " && \
curl -s -o /dev/null -w "%{http_code}" https://coditect.ai/health | \
grep -q 200 && echo "โœ… PASS" || echo "โŒ FAIL"

๐Ÿšจ Error Code Referenceโ€‹

StatusMeaningLikely CauseQuick Fix
200โœ… OKWorking!-
400Bad RequestWebSocket headers missingAdd annotation
404Not FoundPath routing issueCheck ingress path
502Bad GatewayBackend downCheck pod status
503UnavailableHealth check failingCreate /health
504Gateway TimeoutBackend slowCheck pod logs

๐Ÿ“ž Escalation Pathโ€‹

  1. L1: Run diagnostics script: ./socketio-diagnostics.sh
  2. L2: Apply quick fixes (WebSocket annotation + health endpoint)
  3. L3: Review detailed analysis: socket-io-investigation-analysis.md
  4. L4: Implement comprehensive fixes: fix-implementation-guide.md
  5. L5: Engage Google Cloud support with diagnostic output

๐Ÿ”‘ Key Filesโ€‹

FilePurposeLocation
executive-summary.mdOne-page overview/home/claude/
investigation-runbook.mdDetailed procedures/home/claude/
diagnostic-decision-tree.mdTroubleshooting tree/home/claude/
socketio-diagnostics.shAutomated tests/home/claude/
fix-implementation-guide.mdFix procedures/home/claude/
architecture-diagrams.mdVisual diagrams/home/claude/

๐Ÿ’ก Pro Tipsโ€‹

โœ… DO:

  • Always backup before changes
  • Test internal connectivity first
  • Wait 2-5 min after GKE changes
  • Apply fixes incrementally
  • Monitor logs during tests

โŒ DON'T:

  • Skip backups
  • Apply all fixes at once without testing
  • Forget to wait for GKE reconciliation
  • Test immediately after changes
  • Make changes during peak hours

๐ŸŽ“ Learning Resourcesโ€‹


๐Ÿ“ Quick Notes Sectionโ€‹

Date: _______________________
Issue: _______________________
Fix Applied: _______________________
Result: _______________________
Notes: _______________________
_______________________
_______________________

Last Updated: October 20, 2025
Version: 1.0
Owner: DevOps Team

Print this page and keep it handy for rapid incident response!