Socket.IO Investigation Runbook

Purpose: Systematically diagnose Socket.IO 400 errors in GKE environment

Duration: 30-60 minutes

Prerequisites:

kubectl access to coditect-app namespace
gcloud CLI configured
Browser with DevTools access
Root cause: Issue is between GKE Load Balancer → Nginx (internal tests pass)

Phase 1: Baseline Verification (5 min)

1.1 Confirm Internal Functionality

# Test direct Socket.IO endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
  curl -s -i "http://localhost:3000/socket.io/?EIO=4&transport=polling" | \
  head -20

# Expected: HTTP 200 OK with handshake body

# Test through nginx
kubectl exec -n coditect-app deployment/coditect-combined -- \
  curl -s -i "http://localhost/theia/socket.io/?EIO=4&transport=polling" | \
  head -20

# Expected: HTTP 200 OK with handshake body

✅ Pass: Both return 200 OK ❌ Fail: If either fails, issue is in nginx/theia config (not GKE)

1.2 Verify External Failure

# From local machine (outside cluster)
curl -s -i "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
  head -20

# Expected: HTTP 400 Bad Request

✅ Pass: Returns 400 (confirms external-only failure) ❌ Fail: If 200, issue may have self-resolved

Phase 2: Header Analysis (10 min)

2.1 Add Request Header Logging

# Create nginx config patch
cat <<'EOF' > /tmp/nginx-debug-config.conf
log_format socket_debug '$remote_addr - [$time_local] "$request" '
                        'status=$status '
                        'upgrade="$http_upgrade" '
                        'connection="$http_connection" '
                        'host="$http_host" '
                        'origin="$http_origin" '
                        'x_forwarded_for="$http_x_forwarded_for" '
                        'x_forwarded_proto="$http_x_forwarded_proto"';

server {
    listen 80;
    
    location ~ ^/theia/socket\.io/ {
        access_log /var/log/nginx/socket_debug.log socket_debug;
        
        # Rest of existing config...
    }
}
EOF

# Apply to running container
kubectl cp /tmp/nginx-debug-config.conf \
  coditect-app/$(kubectl get pod -n coditect-app -l app=coditect-combined -o jsonpath='{.items[0].metadata.name}'):/tmp/

kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
  cat /tmp/nginx-debug-config.conf >> /etc/nginx/sites-available/default
  nginx -t && nginx -s reload
'

2.2 Capture Headers from Failing Request

# Trigger request from browser (or curl from external machine)
curl -v "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" \
  -H "Origin: https://coditect.ai" \
  -H "User-Agent: Mozilla/5.0" 2>&1 | \
  tee /tmp/external-request-headers.txt

# Check nginx logs for received headers
kubectl logs -n coditect-app -l app=coditect-combined --tail=50 | \
  grep socket_debug | \
  tail -5

# Save logs
kubectl logs -n coditect-app -l app=coditect-combined --tail=50 > \
  /tmp/nginx-socket-debug.log

2.3 Compare Internal vs External Headers

# Internal request (working)
kubectl exec -n coditect-app deployment/coditect-combined -- \
  curl -v "http://localhost/theia/socket.io/?EIO=4&transport=polling" \
  -H "Host: localhost" \
  -H "Origin: http://localhost" 2>&1 | \
  grep -E "^(>|<)" > /tmp/internal-headers.txt

# External request (failing) - already captured above

# Compare
diff -u /tmp/internal-headers.txt /tmp/external-request-headers.txt | \
  tee /tmp/header-diff.txt

Key Questions:

Is Upgrade: websocket present in external request?
Is Connection: upgrade present?
Is Host header modified?
Are X-Forwarded-* headers present?

Phase 3: GKE Backend Investigation (15 min)

3.1 Get Backend Service Details

# Extract backend service name from ingress annotations
BACKENDS=$(kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}')

echo "Backend services: $BACKENDS"

# Get first backend service name
BACKEND_NAME=$(echo $BACKENDS | jq -r 'keys[0]')
echo "Primary backend: $BACKEND_NAME"

# Describe backend configuration
gcloud compute backend-services describe $BACKEND_NAME --global \
  --format="yaml" > /tmp/gke-backend-config.yaml

cat /tmp/gke-backend-config.yaml

3.2 Check Critical Backend Settings

# Check timeout (should be ≥3600 for long connections)
TIMEOUT=$(gcloud compute backend-services describe $BACKEND_NAME --global \
  --format="value(timeoutSec)")
echo "Backend timeout: ${TIMEOUT}s"
if [ "$TIMEOUT" -lt 3600 ]; then
  echo "⚠️ WARNING: Timeout too short for Socket.IO ($TIMEOUT < 3600)"
fi

# Check session affinity
AFFINITY=$(gcloud compute backend-services describe $BACKEND_NAME --global \
  --format="value(sessionAffinity)")
echo "Session affinity: $AFFINITY"
if [ "$AFFINITY" != "CLIENT_IP" ]; then
  echo "⚠️ WARNING: Session affinity not set to CLIENT_IP"
fi

# Check connection draining
DRAINING=$(gcloud compute backend-services describe $BACKEND_NAME --global \
  --format="value(connectionDraining.drainingTimeoutSec)")
echo "Connection draining: ${DRAINING}s"
if [ "$DRAINING" -gt 60 ]; then
  echo "⚠️ WARNING: Connection draining too long (${DRAINING}s > 60s)"
fi

# Check health check
HEALTH_CHECKS=$(gcloud compute backend-services describe $BACKEND_NAME --global \
  --format="value(healthChecks)")
echo "Health checks: $HEALTH_CHECKS"

# Get health check details
for HC in $HEALTH_CHECKS; do
  HC_NAME=$(basename $HC)
  gcloud compute health-checks describe $HC_NAME --global \
    --format="yaml(requestPath,port,checkIntervalSec)"
done

3.3 Verify Health Check Endpoint

# Get health check path
HC_PATH=$(gcloud compute health-checks describe \
  $(gcloud compute backend-services describe $BACKEND_NAME --global \
    --format="value(healthChecks)" | head -1 | xargs basename) \
  --global --format="value(requestPath)")

echo "Health check path: $HC_PATH"

# Test health check endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
  curl -s -o /dev/null -w "Health check status: %{http_code}\n" \
  "http://localhost${HC_PATH}"

# If 404, create health endpoint
if [ $? -ne 0 ]; then
  echo "⚠️ Health check endpoint missing!"
  echo "Creating /health endpoint..."
  
  kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
    echo "location /health {
      access_log off;
      return 200 \"healthy\";
      add_header Content-Type text/plain;
    }" >> /etc/nginx/sites-available/default
    
    nginx -t && nginx -s reload
  '
fi

Phase 4: BackendConfig Analysis (10 min)

4.1 Review Current BackendConfig

kubectl get backendconfig -n coditect-app coditect-backend-config -o yaml | \
  tee /tmp/current-backendconfig.yaml

# Key fields to verify:
# - spec.timeoutSec: Should be ≥3600
# - spec.connectionDraining.drainingTimeoutSec: Should be ≤60
# - spec.sessionAffinity: Should have CLIENT_IP affinity
# - spec.healthCheck.requestPath: Should be a valid endpoint

4.2 Generate Optimized BackendConfig

cat <<'EOF' > /tmp/optimized-backendconfig.yaml
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: coditect-backend-config-v2
  namespace: coditect-app
spec:
  # Extended timeout for long-lived WebSocket connections
  timeoutSec: 86400  # 24 hours
  
  # Minimal connection draining to allow quick failover
  connectionDraining:
    drainingTimeoutSec: 30
  
  # Session affinity for Socket.IO connection persistence
  sessionAffinity:
    affinityType: "CLIENT_IP"
    affinityCookieTtlSec: 10800  # 3 hours
  
  # Health check configuration
  healthCheck:
    checkIntervalSec: 15
    timeoutSec: 5
    healthyThreshold: 2
    unhealthyThreshold: 3
    type: HTTP
    requestPath: /health  # Ensure this endpoint exists
    port: 80
  
  # Logging for debugging
  logging:
    enable: true
    sampleRate: 1.0
EOF

# Show diff
echo "=== Differences from current config ==="
diff -u /tmp/current-backendconfig.yaml /tmp/optimized-backendconfig.yaml || true

Phase 5: Live Traffic Testing (10 min)

5.1 Browser DevTools Inspection

echo "
=== BROWSER TEST PROCEDURE ===

1. Open browser to: https://coditect.ai/theia/
2. Open DevTools (F12) → Network tab
3. Filter by: 'socket.io'
4. Attempt to load IDE
5. Find failing Socket.IO request
6. Right-click → Copy → Copy as cURL

Expected observations:
- Request URL: https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
- Status: 400 Bad Request
- Headers to check:
  * Request: Upgrade, Connection, Origin, Host
  * Response: Look for error messages
"

# Wait for user to copy cURL command
echo "Paste the cURL command here:"
read -r CURL_CMD

# Execute and analyze
echo "$CURL_CMD" > /tmp/browser-curl.sh
bash /tmp/browser-curl.sh -v 2>&1 | tee /tmp/browser-curl-output.txt

5.2 WebSocket Handshake Simulation

# Simulate full Socket.IO handshake sequence

# Step 1: Polling transport (should work)
echo "=== Step 1: Initial polling handshake ==="
HANDSHAKE=$(curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")
echo "Handshake response: $HANDSHAKE"

# Extract session ID
SID=$(echo "$HANDSHAKE" | grep -o '"sid":"[^"]*"' | cut -d'"' -f4)
echo "Session ID: $SID"

if [ -z "$SID" ]; then
  echo "❌ FAILED: No session ID received"
  exit 1
fi

# Step 2: WebSocket upgrade (likely to fail)
echo -e "\n=== Step 2: WebSocket upgrade ==="
curl -v "https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket&sid=$SID" \
  -H "Upgrade: websocket" \
  -H "Connection: Upgrade" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
  -H "Origin: https://coditect.ai" 2>&1 | \
  tee /tmp/websocket-upgrade-attempt.txt

# Check for 101 Switching Protocols (success) or 400 (failure)
grep -E "(101 Switching Protocols|400 Bad Request)" /tmp/websocket-upgrade-attempt.txt

Phase 6: Network Path Tracing (10 min)

6.1 Trace Request Path

# Identify pod handling request
echo "=== Active pods ==="
kubectl get pods -n coditect-app -l app=coditect-combined -o wide

# Monitor real-time logs during test
echo "=== Starting log monitoring ==="
kubectl logs -n coditect-app -l app=coditect-combined -f --tail=0 &
LOG_PID=$!

echo "Trigger a Socket.IO request now..."
sleep 10

kill $LOG_PID

# Check which pod received request
kubectl logs -n coditect-app -l app=coditect-combined --tail=100 | \
  grep -E "(socket\.io|theia)" | \
  grep -v "200 OK" | \
  tail -20

6.2 Test Session Affinity

# Make multiple requests, track which pod handles each
echo "=== Testing session affinity ==="

for i in {1..5}; do
  RESPONSE=$(curl -s -D- "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling")
  POD_IP=$(kubectl logs -n coditect-app -l app=coditect-combined --tail=5 | \
    grep "socket.io" | tail -1 | awk '{print $1}')
  echo "Request $i → Pod IP: $POD_IP"
  sleep 1
done

# All requests should hit same pod if affinity working

Phase 7: Diagnosis Summary

7.1 Generate Report

cat <<'EOF' > /tmp/diagnostic-report.md
# Socket.IO Diagnostic Report

**Date**: $(date)
**Cluster**: coditect-production
**Namespace**: coditect-app

## Test Results

### Phase 1: Baseline Verification
- Internal Socket.IO (port 3000): [PASS/FAIL]
- Through nginx (port 80): [PASS/FAIL]
- External through GKE: [PASS/FAIL]

### Phase 2: Header Analysis
- Upgrade header preserved: [YES/NO]
- Connection header preserved: [YES/NO]
- Host header correct: [YES/NO]
- Origin header correct: [YES/NO]

**Header Diff**: See /tmp/header-diff.txt

### Phase 3: GKE Backend Configuration
- Timeout setting: [VALUE]s
- Session affinity: [VALUE]
- Connection draining: [VALUE]s
- Health check path: [VALUE]
- Health check status: [VALUE]

### Phase 4: BackendConfig
**Current Configuration**: See /tmp/current-backendconfig.yaml
**Recommended Changes**: See /tmp/optimized-backendconfig.yaml

### Phase 5: Live Traffic Testing
**Browser Request**: See /tmp/browser-curl-output.txt
**WebSocket Upgrade**: See /tmp/websocket-upgrade-attempt.txt

### Phase 6: Network Path
**Pod Distribution**: [FINDINGS]
**Session Affinity**: [WORKING/BROKEN]

## Root Cause

[Based on above tests, identify most likely cause]

## Recommended Actions

1. [Priority 1 action]
2. [Priority 2 action]
3. [Priority 3 action]

## Supporting Data

- Full logs: /tmp/nginx-socket-debug.log
- GKE backend config: /tmp/gke-backend-config.yaml
- Header comparison: /tmp/header-diff.txt

EOF

cat /tmp/diagnostic-report.md

Quick Fix Attempts

Fix #1: Update BackendConfig (Low Risk)

# Apply optimized BackendConfig
kubectl apply -f /tmp/optimized-backendconfig.yaml

# Update service annotation to use new config
kubectl patch service -n coditect-app coditect-combined-service -p '
{
  "metadata": {
    "annotations": {
      "cloud.google.com/backend-config": "{\"default\": \"coditect-backend-config-v2\"}"
    }
  }
}'

# Wait for GKE to reconcile (2-5 minutes)
echo "Waiting for GKE to apply changes..."
sleep 120

# Test
curl -s -o /dev/null -w "Status: %{http_code}\n" \
  "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

Fix #2: Disable Health Checks Temporarily (Medium Risk)

# Increase unhealthy threshold to prevent pod marking as unhealthy
kubectl patch backendconfig -n coditect-app coditect-backend-config --type=merge -p '
spec:
  healthCheck:
    unhealthyThreshold: 10
    checkIntervalSec: 60
'

# Wait and test
sleep 120
curl -s -o /dev/null -w "Status: %{http_code}\n" \
  "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

Fix #3: Add Explicit WebSocket Support (High Risk - Requires Ingress Update)

# Check if WebSocket annotation exists
kubectl get ingress -n coditect-app coditect-production-ingress \
  -o jsonpath='{.metadata.annotations}' | jq .

# Add WebSocket support annotation (if missing)
kubectl annotate ingress -n coditect-app coditect-production-ingress \
  cloud.google.com/websocket-max-idle-timeout="86400" \
  --overwrite

# Wait for GKE ingress controller to reconcile
sleep 120

# Test
curl -s -o /dev/null -w "Status: %{http_code}\n" \
  "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

Emergency Rollback

# If changes make things worse, revert to original config
kubectl apply -f /tmp/current-backendconfig.yaml

kubectl patch service -n coditect-app coditect-combined-service -p '
{
  "metadata": {
    "annotations": {
      "cloud.google.com/backend-config": "{\"default\": \"coditect-backend-config\"}"
    }
  }
}'

kubectl annotate ingress -n coditect-app coditect-production-ingress \
  cloud.google.com/websocket-max-idle-timeout- \
  --overwrite

Success Metrics

After applying fixes, verify:

# 1. Socket.IO handshake succeeds
curl -s "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling" | \
  grep -q '"sid"' && echo "✅ Handshake working" || echo "❌ Still failing"

# 2. WebSocket upgrade succeeds
# (Requires WebSocket client - test in browser)

# 3. Connection persists >1 hour
# (Long-running test)

# 4. Pod rolling update doesn't kill connections
kubectl rollout restart deployment -n coditect-app coditect-combined
# Monitor for disconnections during rollout

Appendix: Useful Commands

# Watch backend service health
watch -n 5 'gcloud compute backend-services get-health $BACKEND_NAME --global'

# Monitor nginx access logs in real-time
kubectl logs -n coditect-app -l app=coditect-combined -f | grep socket.io

# Check GKE load balancer logs
gcloud logging read "resource.type=http_load_balancer AND resource.labels.url_map_name=k8s-um-coditect-app-coditect-production-ingress" --limit 50 --format json

# Force backend service update
gcloud compute backend-services update $BACKEND_NAME --global --timeout=86400

# Test from multiple external IPs (check session affinity)
for IP in 1.2.3.4 5.6.7.8; do
  curl -s --interface $IP "https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"
done

Phase 1: Baseline Verification (5 min)​

1.1 Confirm Internal Functionality​

1.2 Verify External Failure​

Phase 2: Header Analysis (10 min)​

2.1 Add Request Header Logging​

2.2 Capture Headers from Failing Request​

2.3 Compare Internal vs External Headers​

Phase 3: GKE Backend Investigation (15 min)​

3.1 Get Backend Service Details​

3.2 Check Critical Backend Settings​

3.3 Verify Health Check Endpoint​

Phase 4: BackendConfig Analysis (10 min)​

4.1 Review Current BackendConfig​

4.2 Generate Optimized BackendConfig​

Phase 5: Live Traffic Testing (10 min)​

5.1 Browser DevTools Inspection​

5.2 WebSocket Handshake Simulation​

Phase 6: Network Path Tracing (10 min)​

6.1 Trace Request Path​

6.2 Test Session Affinity​

Phase 7: Diagnosis Summary​

7.1 Generate Report​

Quick Fix Attempts​

Fix #1: Update BackendConfig (Low Risk)​

Fix #2: Disable Health Checks Temporarily (Medium Risk)​

Fix #3: Add Explicit WebSocket Support (High Risk - Requires Ingress Update)​

Emergency Rollback​

Success Metrics​

Appendix: Useful Commands​