Skip to main content

Socket.IO Fix Implementation Guide

Target Issue: Socket.IO 400 errors from external requests (internal requests work)

Risk Level: MEDIUM (requires GKE load balancer reconfiguration)

Downtime: None expected (rolling changes)

Rollback Time: 2-5 minutes


Pre-Implementation Checklist

  • Diagnostics completed (run socketio-diagnostics.sh)
  • Current configuration backed up
  • kubectl access verified: kubectl get pods -n coditect-app
  • gcloud CLI authenticated: gcloud auth list
  • Maintenance window scheduled (optional, changes are non-disruptive)
  • Rollback procedure reviewed

Fix Priority Matrix

Based on diagnostic results, apply fixes in this order:

PriorityFixProbability of SuccessRiskTime
P0Add WebSocket annotation to Ingress85%LOW5 min
P0Create health check endpoint70%LOW5 min
P1Configure session affinity in BackendConfig60%LOW10 min
P1Update timeout settings30%LOW10 min
P2Reduce connection draining20%MEDIUM10 min

FIX #1: Add WebSocket Support to GKE Ingress [P0]

Problem: GKE L7 load balancer doesn't forward WebSocket Upgrade headers by default

Symptoms:

  • 400 errors on Socket.IO connections
  • Works internally but fails externally
  • Headers missing in nginx logs

Solution: Add WebSocket idle timeout annotation to Ingress

Implementation

# 1. Verify current annotations
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations}' | jq .

# 2. Backup current ingress config
kubectl get ingress -n coditect-app coditect-production-ingress \
-o yaml > /tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml

# 3. Add WebSocket annotation (24 hour idle timeout)
kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout="86400" \
--overwrite

# 4. Verify annotation applied
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'

# Expected output: 86400

Validation

# Wait 2-5 minutes for GKE to reconcile

# Test Socket.IO connection
curl -s -o /dev/null -w "Status: %{http_code}\n" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

# Expected: Status: 200 (was 400 before)

# Test WebSocket upgrade
curl -s -I "https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket" \
-H "Upgrade: websocket" \
-H "Connection: Upgrade" \
-H "Sec-WebSocket-Version: 13" \
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" | \
grep -E "(HTTP|Upgrade|Connection)"

# Expected: HTTP/1.1 101 Switching Protocols (or 200 if polling fallback works)

Rollback

# Remove annotation
kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout- \
--overwrite

# Or restore from backup
kubectl apply -f /tmp/ingress-backup-*.yaml

FIX #2: Create Health Check Endpoint [P0]

Problem: BackendConfig health check points to non-existent /health endpoint

Symptoms:

  • Backend marked as unhealthy
  • Intermittent connection failures
  • 503 errors during health checks

Solution: Create /health endpoint in nginx OR update BackendConfig path

# 1. Get current nginx config
POD=$(kubectl get pod -n coditect-app -l app=coditect-combined \
-o jsonpath='{.items[0].metadata.name}')

kubectl exec -n coditect-app "$POD" -- \
cat /etc/nginx/sites-available/default > /tmp/nginx-config-backup.txt

# 2. Create health endpoint via ConfigMap or direct injection
cat <<'EOF' > /tmp/health-endpoint.conf
# Health check endpoint for GKE load balancer
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
EOF

# 3. Apply to all pods (use ConfigMap for persistence)
kubectl create configmap -n coditect-app nginx-health-config \
--from-file=health.conf=/tmp/health-endpoint.conf \
--dry-run=client -o yaml | kubectl apply -f -

# 4. Mount ConfigMap in deployment (requires deployment update)
# OR inject directly into running pods (temporary)
kubectl exec -n coditect-app "$POD" -- bash -c '
cat >> /etc/nginx/sites-available/default <<EOF

# Health check endpoint for GKE load balancer
location /health {
access_log off;
return 200 "healthy\\n";
add_header Content-Type text/plain;
}
EOF
'

# 5. Test configuration and reload
kubectl exec -n coditect-app "$POD" -- nginx -t
kubectl exec -n coditect-app "$POD" -- nginx -s reload

# 6. Verify endpoint works
kubectl exec -n coditect-app "$POD" -- \
curl -s -o /dev/null -w "%{http_code}" http://localhost/health

# Expected: 200

Option B: Update BackendConfig Path (Alternative)

# Use existing endpoint instead (e.g., /theia/)
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
healthCheck:
requestPath: /theia/
checkIntervalSec: 15
timeoutSec: 5
healthyThreshold: 2
unhealthyThreshold: 5
'

# Wait for GKE to apply
sleep 120

Validation

# Check endpoint responds
curl -s https://coditect.ai/health

# Expected: healthy

# Check GKE backend health
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')

gcloud compute backend-services get-health "$BACKEND" --global

# Expected: All backends HEALTHY

Rollback

# Restore original nginx config
kubectl exec -n coditect-app "$POD" -- bash -c '
cat > /etc/nginx/sites-available/default <<EOF
$(cat /tmp/nginx-config-backup.txt)
EOF
'

kubectl exec -n coditect-app "$POD" -- nginx -s reload

FIX #3: Configure Session Affinity [P1]

Problem: Requests distributed across multiple pods break Socket.IO sessions

Symptoms:

  • Intermittent 400 errors
  • Connection drops after handshake
  • Different pods handling handshake vs upgrade

Solution: Configure CLIENT_IP session affinity in BackendConfig

Implementation

# 1. Backup current BackendConfig
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml

# 2. Update BackendConfig with session affinity
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
sessionAffinity:
affinityType: "CLIENT_IP"
affinityCookieTtlSec: 10800
'

# 3. Verify changes applied
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o jsonpath='{.spec.sessionAffinity}'

# Expected: {"affinityType":"CLIENT_IP","affinityCookieTtlSec":10800}

# 4. Wait for GKE to reconcile (2-5 minutes)
echo "Waiting for GKE to apply session affinity..."
sleep 180

Validation

# Make 5 sequential requests and check if they hit same backend
for i in {1..5}; do
curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling&test=$i" \
-c /tmp/cookies.txt -b /tmp/cookies.txt > /dev/null
sleep 1
done

# Check pod logs to see if same pod handled all requests
kubectl logs -n coditect-app -l app=coditect-combined --tail=20 | \
grep "socket.io" | awk '{print $1}' | sort | uniq -c

# Expected: All requests from same client IP hit same pod
# Output example:
# 5 10.128.0.12 (all requests to same pod)

Rollback

# Remove session affinity
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
sessionAffinity: null
'

# Or restore from backup
kubectl apply -f /tmp/backendconfig-backup-*.yaml

FIX #4: Optimize Timeout Settings [P1]

Problem: Backend timeout may be too short for long-lived WebSocket connections

Symptoms:

  • Connections drop after 1 hour
  • No reconnection attempts
  • Silent failures in long sessions

Solution: Increase backend timeout to 24 hours

Implementation

# 1. Backup current BackendConfig (if not already done)
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml

# 2. Update timeout to 24 hours (86400 seconds)
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
timeoutSec: 86400
'

# 3. Verify changes
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o jsonpath='{.spec.timeoutSec}'

# Expected: 86400

# 4. Wait for GKE to apply
sleep 180

# 5. Verify GKE backend service updated
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')

gcloud compute backend-services describe "$BACKEND" --global \
--format="value(timeoutSec)"

# Expected: 86400

Validation

# Long-running connection test (run in background)
(
echo "Starting long-lived connection test..."
START_TIME=$(date +%s)

# Establish Socket.IO connection and keep alive
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))

curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" \
> /dev/null 2>&1

if [ $? -eq 0 ]; then
echo "[$ELAPSED seconds] Connection still alive"
else
echo "[$ELAPSED seconds] Connection FAILED"
break
fi

sleep 300 # Check every 5 minutes

# Stop after 2 hours
if [ $ELAPSED -gt 7200 ]; then
echo "Test completed: Connection survived 2 hours"
break
fi
done
) &

echo "Long-lived connection test started in background"
echo "Check progress with: tail -f /tmp/long-connection-test.log"

Rollback

# Restore original timeout
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
timeoutSec: 3600
'

# Or restore from backup
kubectl apply -f /tmp/backendconfig-backup-*.yaml

FIX #5: Reduce Connection Draining [P2]

Problem: Long connection draining kills active WebSocket connections during updates

Symptoms:

  • Disconnections during deployments
  • Slow rolling updates
  • Pod restarts drop connections

Solution: Reduce draining timeout to 30 seconds

Implementation

# 1. Update connection draining
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
connectionDraining:
drainingTimeoutSec: 30
'

# 2. Verify changes
kubectl get backendconfig -n coditect-app coditect-backend-config \
-o jsonpath='{.spec.connectionDraining.drainingTimeoutSec}'

# Expected: 30

# 3. Wait for GKE to apply
sleep 180

Validation

# Test during rolling update
# 1. Start long-lived connection
curl -N "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" &
CONN_PID=$!

# 2. Trigger rolling update
kubectl rollout restart deployment -n coditect-app coditect-combined

# 3. Monitor connection
sleep 35 # Wait past draining timeout

# 4. Check if connection still alive
kill -0 $CONN_PID 2>/dev/null && \
echo "Connection survived rolling update" || \
echo "Connection dropped during rolling update"

Rollback

# Restore original draining timeout
kubectl patch backendconfig -n coditect-app coditect-backend-config \
--type=merge -p '
spec:
connectionDraining:
drainingTimeoutSec: 300
'

COMPREHENSIVE FIX: Apply All Changes at Once

For maximum efficiency, apply all fixes simultaneously:

#!/bin/bash
# Comprehensive Socket.IO fix script

set -euo pipefail

NAMESPACE="coditect-app"
INGRESS="coditect-production-ingress"
BACKENDCONFIG="coditect-backend-config"
DEPLOYMENT="coditect-combined"

echo "=== Applying comprehensive Socket.IO fixes ==="

# Backup current configurations
echo "Creating backups..."
kubectl get ingress -n "$NAMESPACE" "$INGRESS" \
-o yaml > "/tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml"

kubectl get backendconfig -n "$NAMESPACE" "$BACKENDCONFIG" \
-o yaml > "/tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml"

# Fix #1: Add WebSocket annotation
echo "Adding WebSocket support annotation..."
kubectl annotate ingress -n "$NAMESPACE" "$INGRESS" \
cloud.google.com/websocket-max-idle-timeout="86400" \
--overwrite

# Fix #2: Create health endpoint
echo "Creating health endpoint..."
POD=$(kubectl get pod -n "$NAMESPACE" -l "app=$DEPLOYMENT" \
-o jsonpath='{.items[0].metadata.name}')

kubectl exec -n "$NAMESPACE" "$POD" -- bash -c '
if ! grep -q "location /health" /etc/nginx/sites-available/default; then
cat >> /etc/nginx/sites-available/default <<EOF

# Health check endpoint for GKE load balancer
location /health {
access_log off;
return 200 "healthy\\n";
add_header Content-Type text/plain;
}
EOF
nginx -t && nginx -s reload
fi
'

# Fix #3-5: Update BackendConfig (all settings)
echo "Updating BackendConfig..."
cat <<EOF | kubectl apply -f -
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: $BACKENDCONFIG
namespace: $NAMESPACE
spec:
# Extended timeout for long-lived WebSocket connections
timeoutSec: 86400 # 24 hours

# Minimal connection draining
connectionDraining:
drainingTimeoutSec: 30

# Session affinity for Socket.IO
sessionAffinity:
affinityType: "CLIENT_IP"
affinityCookieTtlSec: 10800 # 3 hours

# Health check configuration
healthCheck:
checkIntervalSec: 15
timeoutSec: 5
healthyThreshold: 2
unhealthyThreshold: 5
type: HTTP
requestPath: /health
port: 80

# Enable logging for debugging
logging:
enable: true
sampleRate: 1.0
EOF

echo "=== Changes applied ==="
echo "Waiting 5 minutes for GKE to reconcile..."
sleep 300

echo "=== Testing Socket.IO connection ==="
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")

if [ "$HTTP_CODE" = "200" ]; then
echo "✅ SUCCESS: Socket.IO now returns 200"
else
echo "❌ FAILED: Socket.IO still returns $HTTP_CODE"
echo "Check logs for more details"
fi

echo "=== Fix complete ==="
echo "Backups saved to /tmp/*-backup-*.yaml"

Post-Implementation Validation

Comprehensive Test Suite

#!/bin/bash
# Socket.IO validation test suite

echo "=== Socket.IO Validation Test Suite ==="

PASS=0
FAIL=0

# Test 1: Polling transport
echo -n "Test 1: Polling transport... "
RESPONSE=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")
if echo "$RESPONSE" | grep -q '"sid"'; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL"
((FAIL++))
fi

# Test 2: WebSocket upgrade
echo -n "Test 2: WebSocket upgrade... "
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket" \
-H "Upgrade: websocket" \
-H "Connection: Upgrade")
if [ "$STATUS" = "101" ] || [ "$STATUS" = "200" ]; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL (status: $STATUS)"
((FAIL++))
fi

# Test 3: Health endpoint
echo -n "Test 3: Health endpoint... "
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "https://coditect.ai/health")
if [ "$STATUS" = "200" ]; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL (status: $STATUS)"
((FAIL++))
fi

# Test 4: Session persistence
echo -n "Test 4: Session persistence... "
SID1=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
grep -o '"sid":"[^"]*"' | cut -d'"' -f4)
sleep 2
SID2=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
grep -o '"sid":"[^"]*"' | cut -d'"' -f4)

if [ -n "$SID1" ] && [ -n "$SID2" ]; then
echo "✅ PASS"
((PASS++))
else
echo "❌ FAIL"
((FAIL++))
fi

echo ""
echo "=== Results ==="
echo "Passed: $PASS/4"
echo "Failed: $FAIL/4"

if [ $FAIL -eq 0 ]; then
echo "✅ All tests passed!"
exit 0
else
echo "❌ Some tests failed"
exit 1
fi

Monitoring and Observability

Add Logging for Socket.IO Connections

# Add detailed logging to nginx
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
cat > /etc/nginx/conf.d/socketio-logging.conf <<EOF
log_format socketio_access '\''[\$time_local] client=\$remote_addr '
'\''status=\$status request="\$request" '
'\''upgrade=\$http_upgrade connection=\$http_connection '
'\''sid=\$arg_sid transport=\$arg_transport'\'';

access_log /var/log/nginx/socketio-access.log socketio_access;
EOF

nginx -s reload
'

# Monitor Socket.IO connections in real-time
kubectl logs -n coditect-app -l app=coditect-combined -f | \
grep "socket.io"

Set Up Alerts

# GKE monitoring alert for Socket.IO errors
apiVersion: monitoring.googleapis.com/v1alpha1
kind: AlertPolicy
metadata:
name: socketio-400-errors
spec:
conditions:
- displayName: High Socket.IO 400 Error Rate
conditionThreshold:
filter: |
resource.type="k8s_pod"
resource.labels.namespace_name="coditect-app"
jsonPayload.request=~"socket.io"
jsonPayload.status="400"
comparison: COMPARISON_GT
thresholdValue: 10
duration: 300s
notificationChannels:
- <your-notification-channel>

Rollback Procedure

If issues arise after applying fixes:

#!/bin/bash
# Complete rollback script

set -euo pipefail

echo "=== Rolling back Socket.IO fixes ==="

# Find most recent backup
INGRESS_BACKUP=$(ls -t /tmp/ingress-backup-*.yaml | head -1)
BACKENDCONFIG_BACKUP=$(ls -t /tmp/backendconfig-backup-*.yaml | head -1)

echo "Restoring from backups:"
echo " Ingress: $INGRESS_BACKUP"
echo " BackendConfig: $BACKENDCONFIG_BACKUP"

# Restore Ingress
kubectl apply -f "$INGRESS_BACKUP"

# Restore BackendConfig
kubectl apply -f "$BACKENDCONFIG_BACKUP"

# Remove health endpoint (if added)
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
sed -i "/# Health check endpoint/,/}/d" /etc/nginx/sites-available/default
nginx -t && nginx -s reload
'

echo "=== Rollback complete ==="
echo "Waiting for GKE to reconcile..."
sleep 180

echo "Testing Socket.IO after rollback..."
curl -s -o /dev/null -w "Status: %{http_code}\n" \
"https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

Success Criteria

✅ Socket.IO handshake succeeds from external browser (HTTP 200) ✅ WebSocket upgrade completes (HTTP 101) ✅ Health endpoint returns 200 ✅ Session affinity maintains connection to same pod ✅ Long-lived connections survive >1 hour ✅ Connections persist through rolling updates ✅ No 400 errors in logs after 24 hours


Troubleshooting

Issue: Still getting 400 after fixes

Check:

# Verify WebSocket annotation applied
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'

# Check GKE ingress status
kubectl describe ingress -n coditect-app coditect-production-ingress

# Verify BackendConfig changes propagated
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')

gcloud compute backend-services describe "$BACKEND" --global

Issue: Intermittent failures

Check session affinity:

# Make 10 requests with same client
for i in {1..10}; do
curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" \
-c /tmp/cookies.txt -b /tmp/cookies.txt > /dev/null
done

# Check if same pod handled all
kubectl logs -n coditect-app -l app=coditect-combined --tail=20 | \
grep "socket.io" | awk '{print $1}' | sort | uniq -c

Issue: Health checks failing

Check endpoint:

# Test from inside pod
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -v http://localhost/health

# Test from external
curl -v https://coditect.ai/health

# Check backend health
gcloud compute backend-services get-health \
$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]') \
--global

Support and Documentation

Contact: DevOps team for implementation assistance