Socket.IO Fix Implementation Guide

Target Issue: Socket.IO 400 errors from external requests (internal requests work)

Risk Level: MEDIUM (requires GKE load balancer reconfiguration)

Downtime: None expected (rolling changes)

Rollback Time: 2-5 minutes

Pre-Implementation Checklist

Diagnostics completed (run socketio-diagnostics.sh)
Current configuration backed up
kubectl access verified: kubectl get pods -n coditect-app
gcloud CLI authenticated: gcloud auth list
Maintenance window scheduled (optional, changes are non-disruptive)
Rollback procedure reviewed

Fix Priority Matrix

Based on diagnostic results, apply fixes in this order:

Priority	Fix	Probability of Success	Risk	Time
P0	Add WebSocket annotation to Ingress	85%	LOW	5 min
P0	Create health check endpoint	70%	LOW	5 min
P1	Configure session affinity in BackendConfig	60%	LOW	10 min
P1	Update timeout settings	30%	LOW	10 min
P2	Reduce connection draining	20%	MEDIUM	10 min

FIX #1: Add WebSocket Support to GKE Ingress [P0]

Problem: GKE L7 load balancer doesn't forward WebSocket Upgrade headers by default

Symptoms:

400 errors on Socket.IO connections
Works internally but fails externally
Headers missing in nginx logs

Solution: Add WebSocket idle timeout annotation to Ingress

Implementation

# 1. Verify current annotations
kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations}' | jq .

# 2. Backup current ingress config
kubectl get ingress -n coditect-app coditect-production-ingress \
  -o yaml > /tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml

# 3. Add WebSocket annotation (24 hour idle timeout)
kubectl annotate ingress -n coditect-app coditect-production-ingress \
  cloud.google.com/websocket-max-idle-timeout="86400" \
  --overwrite

# 4. Verify annotation applied
kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'

# Expected output: 86400

Validation

# Wait 2-5 minutes for GKE to reconcile

# Test Socket.IO connection
curl -s -o /dev/null -w "Status: %{http_code}\n" \
  "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

# Expected: Status: 200 (was 400 before)

# Test WebSocket upgrade
curl -s -I "https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket" \
  -H "Upgrade: websocket" \
  -H "Connection: Upgrade" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" | \
  grep -E "(HTTP|Upgrade|Connection)"

# Expected: HTTP/1.1 101 Switching Protocols (or 200 if polling fallback works)

Rollback

# Remove annotation
kubectl annotate ingress -n coditect-app coditect-production-ingress \
  cloud.google.com/websocket-max-idle-timeout- \
  --overwrite

# Or restore from backup
kubectl apply -f /tmp/ingress-backup-*.yaml

FIX #2: Create Health Check Endpoint [P0]

Problem: BackendConfig health check points to non-existent /health endpoint

Symptoms:

Backend marked as unhealthy
Intermittent connection failures
503 errors during health checks

Solution: Create /health endpoint in nginx OR update BackendConfig path

Option A: Create /health Endpoint (Recommended)

# 1. Get current nginx config
POD=$(kubectl get pod -n coditect-app -l app=coditect-combined \
  -o jsonpath='{.items[0].metadata.name}')

kubectl exec -n coditect-app "$POD" -- \
  cat /etc/nginx/sites-available/default > /tmp/nginx-config-backup.txt

# 2. Create health endpoint via ConfigMap or direct injection
cat <<'EOF' > /tmp/health-endpoint.conf
# Health check endpoint for GKE load balancer
location /health {
    access_log off;
    return 200 "healthy\n";
    add_header Content-Type text/plain;
}
EOF

# 3. Apply to all pods (use ConfigMap for persistence)
kubectl create configmap -n coditect-app nginx-health-config \
  --from-file=health.conf=/tmp/health-endpoint.conf \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Mount ConfigMap in deployment (requires deployment update)
# OR inject directly into running pods (temporary)
kubectl exec -n coditect-app "$POD" -- bash -c '
cat >> /etc/nginx/sites-available/default <<EOF

# Health check endpoint for GKE load balancer
location /health {
    access_log off;
    return 200 "healthy\\n";
    add_header Content-Type text/plain;
}
EOF
'

# 5. Test configuration and reload
kubectl exec -n coditect-app "$POD" -- nginx -t
kubectl exec -n coditect-app "$POD" -- nginx -s reload

# 6. Verify endpoint works
kubectl exec -n coditect-app "$POD" -- \
  curl -s -o /dev/null -w "%{http_code}" http://localhost/health

# Expected: 200

Option B: Update BackendConfig Path (Alternative)

# Use existing endpoint instead (e.g., /theia/)
kubectl patch backendconfig -n coditect-app coditect-backend-config \
  --type=merge -p '
spec:
  healthCheck:
    requestPath: /theia/
    checkIntervalSec: 15
    timeoutSec: 5
    healthyThreshold: 2
    unhealthyThreshold: 5
'

# Wait for GKE to apply
sleep 120

Validation

# Check endpoint responds
curl -s https://coditect.ai/health

# Expected: healthy

# Check GKE backend health
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
  jq -r 'keys[0]')

gcloud compute backend-services get-health "$BACKEND" --global

# Expected: All backends HEALTHY

Rollback

# Restore original nginx config
kubectl exec -n coditect-app "$POD" -- bash -c '
  cat > /etc/nginx/sites-available/default <<EOF
$(cat /tmp/nginx-config-backup.txt)
EOF
'

kubectl exec -n coditect-app "$POD" -- nginx -s reload

FIX #3: Configure Session Affinity [P1]

Problem: Requests distributed across multiple pods break Socket.IO sessions

Symptoms:

Intermittent 400 errors
Connection drops after handshake
Different pods handling handshake vs upgrade

Solution: Configure CLIENT_IP session affinity in BackendConfig

Implementation

# 1. Backup current BackendConfig
kubectl get backendconfig -n coditect-app coditect-backend-config \
  -o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml

# 2. Update BackendConfig with session affinity
kubectl patch backendconfig -n coditect-app coditect-backend-config \
  --type=merge -p '
spec:
  sessionAffinity:
    affinityType: "CLIENT_IP"
    affinityCookieTtlSec: 10800
'

# 3. Verify changes applied
kubectl get backendconfig -n coditect-app coditect-backend-config \
  -o jsonpath='{.spec.sessionAffinity}'

# Expected: {"affinityType":"CLIENT_IP","affinityCookieTtlSec":10800}

# 4. Wait for GKE to reconcile (2-5 minutes)
echo "Waiting for GKE to apply session affinity..."
sleep 180

Validation

# Make 5 sequential requests and check if they hit same backend
for i in {1..5}; do
  curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling&test=$i" \
    -c /tmp/cookies.txt -b /tmp/cookies.txt > /dev/null
  sleep 1
done

# Check pod logs to see if same pod handled all requests
kubectl logs -n coditect-app -l app=coditect-combined --tail=20 | \
  grep "socket.io" | awk '{print $1}' | sort | uniq -c

# Expected: All requests from same client IP hit same pod
# Output example:
#   5 10.128.0.12  (all requests to same pod)

Rollback

# Remove session affinity
kubectl patch backendconfig -n coditect-app coditect-backend-config \
  --type=merge -p '
spec:
  sessionAffinity: null
'

# Or restore from backup
kubectl apply -f /tmp/backendconfig-backup-*.yaml

FIX #4: Optimize Timeout Settings [P1]

Problem: Backend timeout may be too short for long-lived WebSocket connections

Symptoms:

Connections drop after 1 hour
No reconnection attempts
Silent failures in long sessions

Solution: Increase backend timeout to 24 hours

Implementation

# 1. Backup current BackendConfig (if not already done)
kubectl get backendconfig -n coditect-app coditect-backend-config \
  -o yaml > /tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml

# 2. Update timeout to 24 hours (86400 seconds)
kubectl patch backendconfig -n coditect-app coditect-backend-config \
  --type=merge -p '
spec:
  timeoutSec: 86400
'

# 3. Verify changes
kubectl get backendconfig -n coditect-app coditect-backend-config \
  -o jsonpath='{.spec.timeoutSec}'

# Expected: 86400

# 4. Wait for GKE to apply
sleep 180

# 5. Verify GKE backend service updated
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
  jq -r 'keys[0]')

gcloud compute backend-services describe "$BACKEND" --global \
  --format="value(timeoutSec)"

# Expected: 86400

Validation

# Long-running connection test (run in background)
(
  echo "Starting long-lived connection test..."
  START_TIME=$(date +%s)
  
  # Establish Socket.IO connection and keep alive
  while true; do
    CURRENT_TIME=$(date +%s)
    ELAPSED=$((CURRENT_TIME - START_TIME))
    
    curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" \
      > /dev/null 2>&1
    
    if [ $? -eq 0 ]; then
      echo "[$ELAPSED seconds] Connection still alive"
    else
      echo "[$ELAPSED seconds] Connection FAILED"
      break
    fi
    
    sleep 300  # Check every 5 minutes
    
    # Stop after 2 hours
    if [ $ELAPSED -gt 7200 ]; then
      echo "Test completed: Connection survived 2 hours"
      break
    fi
  done
) &

echo "Long-lived connection test started in background"
echo "Check progress with: tail -f /tmp/long-connection-test.log"

Rollback

# Restore original timeout
kubectl patch backendconfig -n coditect-app coditect-backend-config \
  --type=merge -p '
spec:
  timeoutSec: 3600
'

# Or restore from backup
kubectl apply -f /tmp/backendconfig-backup-*.yaml

FIX #5: Reduce Connection Draining [P2]

Problem: Long connection draining kills active WebSocket connections during updates

Symptoms:

Disconnections during deployments
Slow rolling updates
Pod restarts drop connections

Solution: Reduce draining timeout to 30 seconds

Implementation

# 1. Update connection draining
kubectl patch backendconfig -n coditect-app coditect-backend-config \
  --type=merge -p '
spec:
  connectionDraining:
    drainingTimeoutSec: 30
'

# 2. Verify changes
kubectl get backendconfig -n coditect-app coditect-backend-config \
  -o jsonpath='{.spec.connectionDraining.drainingTimeoutSec}'

# Expected: 30

# 3. Wait for GKE to apply
sleep 180

Validation

# Test during rolling update
# 1. Start long-lived connection
curl -N "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" &
CONN_PID=$!

# 2. Trigger rolling update
kubectl rollout restart deployment -n coditect-app coditect-combined

# 3. Monitor connection
sleep 35  # Wait past draining timeout

# 4. Check if connection still alive
kill -0 $CONN_PID 2>/dev/null && \
  echo "Connection survived rolling update" || \
  echo "Connection dropped during rolling update"

Rollback

# Restore original draining timeout
kubectl patch backendconfig -n coditect-app coditect-backend-config \
  --type=merge -p '
spec:
  connectionDraining:
    drainingTimeoutSec: 300
'

COMPREHENSIVE FIX: Apply All Changes at Once

For maximum efficiency, apply all fixes simultaneously:

#!/bin/bash
# Comprehensive Socket.IO fix script

set -euo pipefail

NAMESPACE="coditect-app"
INGRESS="coditect-production-ingress"
BACKENDCONFIG="coditect-backend-config"
DEPLOYMENT="coditect-combined"

echo "=== Applying comprehensive Socket.IO fixes ==="

# Backup current configurations
echo "Creating backups..."
kubectl get ingress -n "$NAMESPACE" "$INGRESS" \
  -o yaml > "/tmp/ingress-backup-$(date +%Y%m%d-%H%M%S).yaml"

kubectl get backendconfig -n "$NAMESPACE" "$BACKENDCONFIG" \
  -o yaml > "/tmp/backendconfig-backup-$(date +%Y%m%d-%H%M%S).yaml"

# Fix #1: Add WebSocket annotation
echo "Adding WebSocket support annotation..."
kubectl annotate ingress -n "$NAMESPACE" "$INGRESS" \
  cloud.google.com/websocket-max-idle-timeout="86400" \
  --overwrite

# Fix #2: Create health endpoint
echo "Creating health endpoint..."
POD=$(kubectl get pod -n "$NAMESPACE" -l "app=$DEPLOYMENT" \
  -o jsonpath='{.items[0].metadata.name}')

kubectl exec -n "$NAMESPACE" "$POD" -- bash -c '
if ! grep -q "location /health" /etc/nginx/sites-available/default; then
  cat >> /etc/nginx/sites-available/default <<EOF

# Health check endpoint for GKE load balancer
location /health {
    access_log off;
    return 200 "healthy\\n";
    add_header Content-Type text/plain;
}
EOF
  nginx -t && nginx -s reload
fi
'

# Fix #3-5: Update BackendConfig (all settings)
echo "Updating BackendConfig..."
cat <<EOF | kubectl apply -f -
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: $BACKENDCONFIG
  namespace: $NAMESPACE
spec:
  # Extended timeout for long-lived WebSocket connections
  timeoutSec: 86400  # 24 hours
  
  # Minimal connection draining
  connectionDraining:
    drainingTimeoutSec: 30
  
  # Session affinity for Socket.IO
  sessionAffinity:
    affinityType: "CLIENT_IP"
    affinityCookieTtlSec: 10800  # 3 hours
  
  # Health check configuration
  healthCheck:
    checkIntervalSec: 15
    timeoutSec: 5
    healthyThreshold: 2
    unhealthyThreshold: 5
    type: HTTP
    requestPath: /health
    port: 80
  
  # Enable logging for debugging
  logging:
    enable: true
    sampleRate: 1.0
EOF

echo "=== Changes applied ==="
echo "Waiting 5 minutes for GKE to reconcile..."
sleep 300

echo "=== Testing Socket.IO connection ==="
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")

if [ "$HTTP_CODE" = "200" ]; then
  echo "✅ SUCCESS: Socket.IO now returns 200"
else
  echo "❌ FAILED: Socket.IO still returns $HTTP_CODE"
  echo "Check logs for more details"
fi

echo "=== Fix complete ==="
echo "Backups saved to /tmp/*-backup-*.yaml"

Post-Implementation Validation

Comprehensive Test Suite

#!/bin/bash
# Socket.IO validation test suite

echo "=== Socket.IO Validation Test Suite ==="

PASS=0
FAIL=0

# Test 1: Polling transport
echo -n "Test 1: Polling transport... "
RESPONSE=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")
if echo "$RESPONSE" | grep -q '"sid"'; then
  echo "✅ PASS"
  ((PASS++))
else
  echo "❌ FAIL"
  ((FAIL++))
fi

# Test 2: WebSocket upgrade
echo -n "Test 2: WebSocket upgrade... "
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  "https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket" \
  -H "Upgrade: websocket" \
  -H "Connection: Upgrade")
if [ "$STATUS" = "101" ] || [ "$STATUS" = "200" ]; then
  echo "✅ PASS"
  ((PASS++))
else
  echo "❌ FAIL (status: $STATUS)"
  ((FAIL++))
fi

# Test 3: Health endpoint
echo -n "Test 3: Health endpoint... "
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "https://coditect.ai/health")
if [ "$STATUS" = "200" ]; then
  echo "✅ PASS"
  ((PASS++))
else
  echo "❌ FAIL (status: $STATUS)"
  ((FAIL++))
fi

# Test 4: Session persistence
echo -n "Test 4: Session persistence... "
SID1=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
  grep -o '"sid":"[^"]*"' | cut -d'"' -f4)
sleep 2
SID2=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
  grep -o '"sid":"[^"]*"' | cut -d'"' -f4)

if [ -n "$SID1" ] && [ -n "$SID2" ]; then
  echo "✅ PASS"
  ((PASS++))
else
  echo "❌ FAIL"
  ((FAIL++))
fi

echo ""
echo "=== Results ==="
echo "Passed: $PASS/4"
echo "Failed: $FAIL/4"

if [ $FAIL -eq 0 ]; then
  echo "✅ All tests passed!"
  exit 0
else
  echo "❌ Some tests failed"
  exit 1
fi

Monitoring and Observability

Add Logging for Socket.IO Connections

# Add detailed logging to nginx
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
cat > /etc/nginx/conf.d/socketio-logging.conf <<EOF
log_format socketio_access '\''[\$time_local] client=\$remote_addr '
                            '\''status=\$status request="\$request" '
                            '\''upgrade=\$http_upgrade connection=\$http_connection '
                            '\''sid=\$arg_sid transport=\$arg_transport'\'';

access_log /var/log/nginx/socketio-access.log socketio_access;
EOF

nginx -s reload
'

# Monitor Socket.IO connections in real-time
kubectl logs -n coditect-app -l app=coditect-combined -f | \
  grep "socket.io"

Set Up Alerts

# GKE monitoring alert for Socket.IO errors
apiVersion: monitoring.googleapis.com/v1alpha1
kind: AlertPolicy
metadata:
  name: socketio-400-errors
spec:
  conditions:
  - displayName: High Socket.IO 400 Error Rate
    conditionThreshold:
      filter: |
        resource.type="k8s_pod"
        resource.labels.namespace_name="coditect-app"
        jsonPayload.request=~"socket.io"
        jsonPayload.status="400"
      comparison: COMPARISON_GT
      thresholdValue: 10
      duration: 300s
  notificationChannels:
  - <your-notification-channel>

Rollback Procedure

If issues arise after applying fixes:

#!/bin/bash
# Complete rollback script

set -euo pipefail

echo "=== Rolling back Socket.IO fixes ==="

# Find most recent backup
INGRESS_BACKUP=$(ls -t /tmp/ingress-backup-*.yaml | head -1)
BACKENDCONFIG_BACKUP=$(ls -t /tmp/backendconfig-backup-*.yaml | head -1)

echo "Restoring from backups:"
echo "  Ingress: $INGRESS_BACKUP"
echo "  BackendConfig: $BACKENDCONFIG_BACKUP"

# Restore Ingress
kubectl apply -f "$INGRESS_BACKUP"

# Restore BackendConfig
kubectl apply -f "$BACKENDCONFIG_BACKUP"

# Remove health endpoint (if added)
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
  sed -i "/# Health check endpoint/,/}/d" /etc/nginx/sites-available/default
  nginx -t && nginx -s reload
'

echo "=== Rollback complete ==="
echo "Waiting for GKE to reconcile..."
sleep 180

echo "Testing Socket.IO after rollback..."
curl -s -o /dev/null -w "Status: %{http_code}\n" \
  "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

Success Criteria

✅ Socket.IO handshake succeeds from external browser (HTTP 200) ✅ WebSocket upgrade completes (HTTP 101) ✅ Health endpoint returns 200 ✅ Session affinity maintains connection to same pod ✅ Long-lived connections survive >1 hour ✅ Connections persist through rolling updates ✅ No 400 errors in logs after 24 hours

Troubleshooting

Issue: Still getting 400 after fixes

Check:

# Verify WebSocket annotation applied
kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'

# Check GKE ingress status
kubectl describe ingress -n coditect-app coditect-production-ingress

# Verify BackendConfig changes propagated
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
  jq -r 'keys[0]')

gcloud compute backend-services describe "$BACKEND" --global

Issue: Intermittent failures

Check session affinity:

# Make 10 requests with same client
for i in {1..10}; do
  curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" \
    -c /tmp/cookies.txt -b /tmp/cookies.txt > /dev/null
done

# Check if same pod handled all
kubectl logs -n coditect-app -l app=coditect-combined --tail=20 | \
  grep "socket.io" | awk '{print $1}' | sort | uniq -c

Issue: Health checks failing

Check endpoint:

# Test from inside pod
kubectl exec -n coditect-app deployment/coditect-combined -- \
  curl -v http://localhost/health

# Test from external
curl -v https://coditect.ai/health

# Check backend health
gcloud compute backend-services get-health \
  $(kubectl get ingress -n coditect-app coditect-production-ingress \
    -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
    jq -r 'keys[0]') \
  --global

Support and Documentation

GKE WebSocket Documentation: https://cloud.google.com/load-balancing/docs/https#websocket
Socket.IO Protocol: https://socket.io/docs/v4/how-it-works/
BackendConfig Reference: https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-features

Contact: DevOps team for implementation assistance

Pre-Implementation Checklist​

Fix Priority Matrix​

FIX #1: Add WebSocket Support to GKE Ingress [P0]​

Implementation​

Validation​

Rollback​

FIX #2: Create Health Check Endpoint [P0]​

Option A: Create /health Endpoint (Recommended)​

Option B: Update BackendConfig Path (Alternative)​

Validation​

Rollback​

FIX #3: Configure Session Affinity [P1]​

Implementation​

Validation​

Rollback​

FIX #4: Optimize Timeout Settings [P1]​

Implementation​

Validation​

Rollback​

FIX #5: Reduce Connection Draining [P2]​

Implementation​

Validation​

Rollback​

COMPREHENSIVE FIX: Apply All Changes at Once​

Post-Implementation Validation​

Comprehensive Test Suite​

Monitoring and Observability​

Add Logging for Socket.IO Connections​

Set Up Alerts​

Rollback Procedure​

Success Criteria​

Troubleshooting​

Issue: Still getting 400 after fixes​

Issue: Intermittent failures​

Issue: Health checks failing​

Support and Documentation​

Pre-Implementation Checklist

Fix Priority Matrix

FIX #1: Add WebSocket Support to GKE Ingress [P0]

Implementation

Validation

Rollback

FIX #2: Create Health Check Endpoint [P0]

Option A: Create /health Endpoint (Recommended)

Option B: Update BackendConfig Path (Alternative)

Validation

Rollback

FIX #3: Configure Session Affinity [P1]

Implementation

Validation

Rollback

FIX #4: Optimize Timeout Settings [P1]

Implementation

Validation

Rollback

FIX #5: Reduce Connection Draining [P2]

Implementation

Validation

Rollback

COMPREHENSIVE FIX: Apply All Changes at Once

Post-Implementation Validation

Comprehensive Test Suite

Monitoring and Observability

Add Logging for Socket.IO Connections

Set Up Alerts

Rollback Procedure

Success Criteria

Troubleshooting

Issue: Still getting 400 after fixes

Issue: Intermittent failures

Issue: Health checks failing

Support and Documentation