Socket.IO Fix Implementation Guide
Target Issue: Socket.IO 400 errors from external requests (internal requests work)
Risk Level: MEDIUM (requires GKE load balancer reconfiguration)
Downtime: None expected (rolling changes)
Rollback Time: 2-5 minutes
Pre-Implementation Checklist
- Diagnostics completed (run
socketio-diagnostics.sh) - Current configuration backed up
- kubectl access verified:
kubectl get pods -n coditect-app - gcloud CLI authenticated:
gcloud auth list - Maintenance window scheduled (optional, changes are non-disruptive)
- Rollback procedure reviewed
Fix Priority Matrix
Based on diagnostic results, apply fixes in this order:
| Priority | Fix | Probability of Success | Risk | Time |
|---|---|---|---|---|
| P0 | Add WebSocket annotation to Ingress | 85% | LOW | 5 min |
| P0 | Create health check endpoint | 70% | LOW | 5 min |
| P1 | Configure session affinity in BackendConfig | 60% | LOW | 10 min |
| P1 | Update timeout settings | 30% | LOW | 10 min |
| P2 | Reduce connection draining | 20% | MEDIUM | 10 min |
FIX #1: Add WebSocket Support to GKE Ingress [P0]
Problem: GKE L7 load balancer doesn't forward WebSocket Upgrade headers by default
Symptoms:
- 400 errors on Socket.IO connections
- Works internally but fails externally
- Headers missing in nginx logs
Solution: Add WebSocket idle timeout annotation to Ingress
Implementation
# 1. Verify current annotations
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations}' | jq .
# 2. Backup current ingress config
kubectl get ingress -n coditect-app coditect-production-ingress \
-o yaml > /tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml
# 3. Add WebSocket annotation (24 hour idle timeout)
kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout="86400" \
--overwrite
# 4. Verify annotation applied
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'
# Expected output: 86400
Validation
# Wait 2-5 minutes for GKE to reconcile
# Test Socket.IO connection
curl -s -o /dev/null -w "Status: %{http_code}\n" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"
# Expected: Status: 200 (was 400 before)
# Test WebSocket upgrade
curl -s -I "https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket" \
-H "Upgrade: websocket" \
-H "Connection: Upgrade" \
-H "Sec-WebSocket-Version: 13" \
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" | \
grep -E "(HTTP|Upgrade|Connection)"
# Expected: HTTP/1.1 101 Switching Protocols (or 200 if polling fallback works)
Rollback
# Remove annotation
kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout- \
--overwrite
# Or restore from backup
kubectl apply -f /tmp/ingress-backup-*.yaml
FIX #2: Create Health Check Endpoint [P0]
Problem: BackendConfig health check points to non-existent /health endpoint
Symptoms:
- Backend marked as unhealthy
- Intermittent connection failures
- 503 errors during health checks
Solution: Create /health endpoint in nginx OR update BackendConfig path
Option A: Create /health Endpoint (Recommended)
# 1. Get current nginx config
POD=$(kubectl get pod -n coditect-app -l app=coditect-combined \
-o jsonpath='{.items[0].metadata.name}')
kubectl exec -n coditect-app "$POD" -- \
cat /etc/nginx/sites-available/default > /tmp/nginx-config-backup.txt
# 2. Create health endpoint via ConfigMap or direct injection
cat <<'EOF' > /tmp/health-endpoint.conf
# Health check endpoint for GKE load balancer
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
EOF
# 3. Apply to all pods (use ConfigMap for persistence)
kubectl create configmap -n coditect-app nginx-health-config \
--from-file=health.conf=/tmp/health-endpoint.conf \
--dry-run=client -o yaml | kubectl apply -f -
# 4. Mount ConfigMap in deployment (requires deployment update)
# OR inject directly into running pods (temporary)
kubectl exec -n coditect-app "$POD" -- bash -c '
cat >> /etc/nginx/sites-available/default <<EOF
# Health check endpoint for GKE load balancer
location /health {
access_log off;
return 200 "healthy\\n";
add_header Content-Type text/plain;
}
EOF
'
# 5. Test configuration and reload
kubectl exec -n coditect-app "$POD" -- nginx -t
kubectl exec -n coditect-app "$POD" -- nginx -s reload
# 6. Verify endpoint works
kubectl exec -n coditect-app "$POD" -- \
curl -s -o /dev/null -w "%{http_code}" http://localhost/health
# Expected: 200
Option B: Update BackendConfig Path (Alternative)
# Use existing endpoint instead (e.g., /theia/)
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
healthCheck:
requestPath: /theia/
checkIntervalSec: 15
timeoutSec: 5
healthyThreshold: 2
unhealthyThreshold: 5
'
# Wait for GKE to apply
sleep 120
Validation
# Check endpoint responds
curl -s https://coditect.ai/health
# Expected: healthy
# Check GKE backend health
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')
gcloud compute backend-services get-health "$BACKEND" --global
# Expected: All backends HEALTHY
Rollback
# Restore original nginx config
kubectl exec -n coditect-app "$POD" -- bash -c '
cat > /etc/nginx/sites-available/default <<EOF
$(cat /tmp/nginx-config-backup.txt)
EOF
'
kubectl exec -n coditect-app "$POD" -- nginx -s reload
FIX #3: Configure Session Affinity [P1]
Problem: Requests distributed across multiple pods break Socket.IO sessions
Symptoms:
- Intermittent 400 errors
- Connection drops after handshake
- Different pods handling handshake vs upgrade
Solution: Configure CLIENT_IP session affinity in BackendConfig
Implementation
# 1. Backup current BackendConfig
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml
# 2. Update BackendConfig with session affinity
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
sessionAffinity:
affinityType: "CLIENT_IP"
affinityCookieTtlSec: 10800
'
# 3. Verify changes applied
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o jsonpath='{.spec.sessionAffinity}'
# Expected: {"affinityType":"CLIENT_IP","affinityCookieTtlSec":10800}
# 4. Wait for GKE to reconcile (2-5 minutes)
echo "Waiting for GKE to apply session affinity..."
sleep 180
Validation
# Make 5 sequential requests and check if they hit same backend
for i in {1..5}; do
curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling&test=$i" \
-c /tmp/cookies.txt -b /tmp/cookies.txt > /dev/null
sleep 1
done
# Check pod logs to see if same pod handled all requests
kubectl logs -n coditect-app -l app=coditect-combined --tail=20 | \
grep "socket.io" | awk '{print $1}' | sort | uniq -c
# Expected: All requests from same client IP hit same pod
# Output example:
# 5 10.128.0.12 (all requests to same pod)
Rollback
# Remove session affinity
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
sessionAffinity: null
'
# Or restore from backup
kubectl apply -f /tmp/backendconfig-backup-*.yaml
FIX #4: Optimize Timeout Settings [P1]
Problem: Backend timeout may be too short for long-lived WebSocket connections
Symptoms:
- Connections drop after 1 hour
- No reconnection attempts
- Silent failures in long sessions
Solution: Increase backend timeout to 24 hours
Implementation
# 1. Backup current BackendConfig (if not already done)
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml
# 2. Update timeout to 24 hours (86400 seconds)
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
timeoutSec: 86400
'
# 3. Verify changes
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o jsonpath='{.spec.timeoutSec}'
# Expected: 86400
# 4. Wait for GKE to apply
sleep 180
# 5. Verify GKE backend service updated
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')
gcloud compute backend-services describe "$BACKEND" --global \
--format="value(timeoutSec)"
# Expected: 86400
Validation
# Long-running connection test (run in background)
(
echo "Starting long-lived connection test..."
START_TIME=$(date +%s)
# Establish Socket.IO connection and keep alive
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" \
> /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "[$ELAPSED seconds] Connection still alive"
else
echo "[$ELAPSED seconds] Connection FAILED"
break
fi
sleep 300 # Check every 5 minutes
# Stop after 2 hours
if [ $ELAPSED -gt 7200 ]; then
echo "Test completed: Connection survived 2 hours"
break
fi
done
) &
echo "Long-lived connection test started in background"
echo "Check progress with: tail -f /tmp/long-connection-test.log"
Rollback
# Restore original timeout
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
timeoutSec: 3600
'
# Or restore from backup
kubectl apply -f /tmp/backendconfig-backup-*.yaml
FIX #5: Reduce Connection Draining [P2]
Problem: Long connection draining kills active WebSocket connections during updates
Symptoms:
- Disconnections during deployments
- Slow rolling updates
- Pod restarts drop connections
Solution: Reduce draining timeout to 30 seconds
Implementation
# 1. Update connection draining
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
connectionDraining:
drainingTimeoutSec: 30
'
# 2. Verify changes
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o jsonpath='{.spec.connectionDraining.drainingTimeoutSec}'
# Expected: 30
# 3. Wait for GKE to apply
sleep 180
Validation
# Test during rolling update
# 1. Start long-lived connection
curl -N "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" &
CONN_PID=$!
# 2. Trigger rolling update
kubectl rollout restart deployment -n coditect-app coditect-combined
# 3. Monitor connection
sleep 35 # Wait past draining timeout
# 4. Check if connection still alive
kill -0 $CONN_PID 2>/dev/null && \
echo "Connection survived rolling update" || \
echo "Connection dropped during rolling update"
Rollback
# Restore original draining timeout
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
connectionDraining:
drainingTimeoutSec: 300
'
COMPREHENSIVE FIX: Apply All Changes at Once
For maximum efficiency, apply all fixes simultaneously:
#!/bin/bash
# Comprehensive Socket.IO fix script
set -euo pipefail
NAMESPACE="coditect-app"
INGRESS="coditect-production-ingress"
BACKENDCONFIG="coditect-backend-config"
DEPLOYMENT="coditect-combined"
echo "=== Applying comprehensive Socket.IO fixes ==="
# Backup current configurations
echo "Creating backups..."
kubectl get ingress -n "$NAMESPACE" "$INGRESS" \
-o yaml > "/tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml"
kubectl get backendconfig -n "$NAMESPACE" "$BACKENDCONFIG" \
-o yaml > "/tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml"
# Fix #1: Add WebSocket annotation
echo "Adding WebSocket support annotation..."
kubectl annotate ingress -n "$NAMESPACE" "$INGRESS" \
cloud.google.com/websocket-max-idle-timeout="86400" \
--overwrite
# Fix #2: Create health endpoint
echo "Creating health endpoint..."
POD=$(kubectl get pod -n "$NAMESPACE" -l "app=$DEPLOYMENT" \
-o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- bash -c '
if ! grep -q "location /health" /etc/nginx/sites-available/default; then
cat >> /etc/nginx/sites-available/default <<EOF
# Health check endpoint for GKE load balancer
location /health {
access_log off;
return 200 "healthy\\n";
add_header Content-Type text/plain;
}
EOF
nginx -t && nginx -s reload
fi
'
# Fix #3-5: Update BackendConfig (all settings)
echo "Updating BackendConfig..."
cat <<EOF | kubectl apply -f -
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: $BACKENDCONFIG
namespace: $NAMESPACE
spec:
# Extended timeout for long-lived WebSocket connections
timeoutSec: 86400 # 24 hours
# Minimal connection draining
connectionDraining:
drainingTimeoutSec: 30
# Session affinity for Socket.IO
sessionAffinity:
affinityType: "CLIENT_IP"
affinityCookieTtlSec: 10800 # 3 hours
# Health check configuration
healthCheck:
checkIntervalSec: 15
timeoutSec: 5
healthyThreshold: 2
unhealthyThreshold: 5
type: HTTP
requestPath: /health
port: 80
# Enable logging for debugging
logging:
enable: true
sampleRate: 1.0
EOF
echo "=== Changes applied ==="
echo "Waiting 5 minutes for GKE to reconcile..."
sleep 300
echo "=== Testing Socket.IO connection ==="
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")
if [ "$HTTP_CODE" = "200" ]; then
echo "✅ SUCCESS: Socket.IO now returns 200"
else
echo "❌ FAILED: Socket.IO still returns $HTTP_CODE"
echo "Check logs for more details"
fi
echo "=== Fix complete ==="
echo "Backups saved to /tmp/*-backup-*.yaml"
Post-Implementation Validation
Comprehensive Test Suite
#!/bin/bash
# Socket.IO validation test suite
echo "=== Socket.IO Validation Test Suite ==="
PASS=0
FAIL=0
# Test 1: Polling transport
echo -n "Test 1: Polling transport... "
RESPONSE=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")
if echo "$RESPONSE" | grep -q '"sid"'; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL"
((FAIL++))
fi
# Test 2: WebSocket upgrade
echo -n "Test 2: WebSocket upgrade... "
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket" \
-H "Upgrade: websocket" \
-H "Connection: Upgrade")
if [ "$STATUS" = "101" ] || [ "$STATUS" = "200" ]; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL (status: $STATUS)"
((FAIL++))
fi
# Test 3: Health endpoint
echo -n "Test 3: Health endpoint... "
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "https://coditect.ai/health")
if [ "$STATUS" = "200" ]; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL (status: $STATUS)"
((FAIL++))
fi
# Test 4: Session persistence
echo -n "Test 4: Session persistence... "
SID1=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
grep -o '"sid":"[^"]*"' | cut -d'"' -f4)
sleep 2
SID2=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
grep -o '"sid":"[^"]*"' | cut -d'"' -f4)
if [ -n "$SID1" ] && [ -n "$SID2" ]; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL"
((FAIL++))
fi
echo ""
echo "=== Results ==="
echo "Passed: $PASS/4"
echo "Failed: $FAIL/4"
if [ $FAIL -eq 0 ]; then
echo "✅ All tests passed!"
exit 0
else
echo "❌ Some tests failed"
exit 1
fi
Monitoring and Observability
Add Logging for Socket.IO Connections
# Add detailed logging to nginx
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
cat > /etc/nginx/conf.d/socketio-logging.conf <<EOF
log_format socketio_access '\''[\$time_local] client=\$remote_addr '
'\''status=\$status request="\$request" '
'\''upgrade=\$http_upgrade connection=\$http_connection '
'\''sid=\$arg_sid transport=\$arg_transport'\'';
access_log /var/log/nginx/socketio-access.log socketio_access;
EOF
nginx -s reload
'
# Monitor Socket.IO connections in real-time
kubectl logs -n coditect-app -l app=coditect-combined -f | \
grep "socket.io"
Set Up Alerts
# GKE monitoring alert for Socket.IO errors
apiVersion: monitoring.googleapis.com/v1alpha1
kind: AlertPolicy
metadata:
name: socketio-400-errors
spec:
conditions:
- displayName: High Socket.IO 400 Error Rate
conditionThreshold:
filter: |
resource.type="k8s_pod"
resource.labels.namespace_name="coditect-app"
jsonPayload.request=~"socket.io"
jsonPayload.status="400"
comparison: COMPARISON_GT
thresholdValue: 10
duration: 300s
notificationChannels:
- <your-notification-channel>
Rollback Procedure
If issues arise after applying fixes:
#!/bin/bash
# Complete rollback script
set -euo pipefail
echo "=== Rolling back Socket.IO fixes ==="
# Find most recent backup
INGRESS_BACKUP=$(ls -t /tmp/ingress-backup-*.yaml | head -1)
BACKENDCONFIG_BACKUP=$(ls -t /tmp/backendconfig-backup-*.yaml | head -1)
echo "Restoring from backups:"
echo " Ingress: $INGRESS_BACKUP"
echo " BackendConfig: $BACKENDCONFIG_BACKUP"
# Restore Ingress
kubectl apply -f "$INGRESS_BACKUP"
# Restore BackendConfig
kubectl apply -f "$BACKENDCONFIG_BACKUP"
# Remove health endpoint (if added)
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
sed -i "/# Health check endpoint/,/}/d" /etc/nginx/sites-available/default
nginx -t && nginx -s reload
'
echo "=== Rollback complete ==="
echo "Waiting for GKE to reconcile..."
sleep 180
echo "Testing Socket.IO after rollback..."
curl -s -o /dev/null -w "Status: %{http_code}\n" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"
Success Criteria
✅ Socket.IO handshake succeeds from external browser (HTTP 200) ✅ WebSocket upgrade completes (HTTP 101) ✅ Health endpoint returns 200 ✅ Session affinity maintains connection to same pod ✅ Long-lived connections survive >1 hour ✅ Connections persist through rolling updates ✅ No 400 errors in logs after 24 hours
Troubleshooting
Issue: Still getting 400 after fixes
Check:
# Verify WebSocket annotation applied
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'
# Check GKE ingress status
kubectl describe ingress -n coditect-app coditect-production-ingress
# Verify BackendConfig changes propagated
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')
gcloud compute backend-services describe "$BACKEND" --global
Issue: Intermittent failures
Check session affinity:
# Make 10 requests with same client
for i in {1..10}; do
curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" \
-c /tmp/cookies.txt -b /tmp/cookies.txt > /dev/null
done
# Check if same pod handled all
kubectl logs -n coditect-app -l app=coditect-combined --tail=20 | \
grep "socket.io" | awk '{print $1}' | sort | uniq -c
Issue: Health checks failing
Check endpoint:
# Test from inside pod
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -v http://localhost/health
# Test from external
curl -v https://coditect.ai/health
# Check backend health
gcloud compute backend-services get-health \
$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]') \
--global
Support and Documentation
- GKE WebSocket Documentation: https://cloud.google.com/load-balancing/docs/https#websocket
- Socket.IO Protocol: https://socket.io/docs/v4/how-it-works/
- BackendConfig Reference: https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-features
Contact: DevOps team for implementation assistance