Skip to main content

Socket.IO 400 Error Analysis: GKE Load Balancer Investigation

Executive Summary

Problem: Socket.IO connections fail with HTTP 400 errors when accessed through GKE Ingress, despite working perfectly within the cluster.

Impact: WebSocket-dependent features (theia IDE real-time updates, terminal sessions) are non-functional for external users.

Root Cause Hypothesis: GKE L7 Load Balancer is corrupting or rejecting Socket.IO handshake requests due to improper WebSocket configuration, health check interference, or header transformation.


Evidence Chain

✅ Working Components

  1. Direct Socket.IO Access (Port 3000)

    localhost:3000/socket.io/?EIO=4&transport=polling
    → HTTP 200 OK
    → Valid handshake response
  2. Nginx Proxy Layer (Port 80)

    localhost/theia/socket.io/?EIO=4&transport=polling
    → HTTP 200 OK
    → Proper path rewriting
    → Headers preserved
  3. Service Configuration

    sessionAffinity: ClientIP
    sessionAffinityConfig:
    clientIP:
    timeoutSeconds: 10800 # 3 hours

❌ Failing Component

  1. External Access Through GKE Ingress
    https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
    → HTTP 400 Bad Request
    → Occurs AFTER passing nginx
    → Only fails from external clients

Architecture Layers

┌─────────────────────────────────────────────────────────────┐
│ Browser │
│ ↓ GET /theia/socket.io/?EIO=4&transport=polling │
└───────────────────────────────────────┬─────────────────────┘

↓ HTTPS (TLS termination)
┌─────────────────────────────────────────────────────────────┐
│ GKE L7 Load Balancer (Google Cloud) │
│ • Ingress Controller: coditect-production-ingress │
│ • IP: 34.8.51.57 │
│ • BackendConfig: coditect-backend-config │
│ - timeoutSec: 3600 │
│ - connectionDraining: 300s │
│ - healthCheck: /health endpoint │
│ • Issues: │
│ ⚠️ May strip/modify WebSocket headers │
│ ⚠️ Health checks may interfere with connections │
│ ⚠️ Session affinity not configured at LB level │
└───────────────────────────────────────┬─────────────────────┘

↓ HTTP (internal)
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Service: coditect-combined-service │
│ • Type: ClusterIP │
│ • SessionAffinity: ClientIP (10800s) │
│ • Port: 80 → targetPort: 80 │
│ ✅ Working correctly │
└───────────────────────────────────────┬─────────────────────┘

↓ HTTP
┌─────────────────────────────────────────────────────────────┐
│ Pod: coditect-combined │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Nginx (Port 80) │ │
│ │ • location ~ ^/theia/socket\.io/ │ │
│ │ • proxy_pass http://localhost:3000 │ │
│ │ • Upgrade/Connection headers configured │ │
│ │ ✅ Working correctly │ │
│ └────────────────────────────┬────────────────────────────┘ │
│ │ │
│ ↓ HTTP │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ theia IDE (Port 3000) │ │
│ │ • Socket.IO Server: /socket.io/ │ │
│ │ • Version: EIO 4 │ │
│ │ ✅ Working correctly │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Failure Point: Between GKE Load Balancer and Pod


Socket.IO Protocol Requirements

Handshake Sequence

1. Client → Server: GET /socket.io/?EIO=4&transport=polling
Headers:
- Host: coditect.ai
- Origin: https://coditect.ai
- Cookie: io=<session_id> (if reconnecting)

2. Server → Client: HTTP 200 OK
Body: 0{"sid":"<session_id>","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":5000}

3. Client → Server: GET /socket.io/?EIO=4&transport=websocket&sid=<session_id>
Headers:
- Upgrade: websocket
- Connection: Upgrade
- Sec-WebSocket-Key: <key>

4. Server → Client: HTTP 101 Switching Protocols
Headers:
- Upgrade: websocket
- Connection: Upgrade
- Sec-WebSocket-Accept: <accept>

Critical Headers for Socket.IO

HeaderPurposeGKE Risk
Upgrade: websocketInitiate WebSocket upgradeMay be stripped
Connection: UpgradeMaintain connectionMay be modified
Sec-WebSocket-KeyWebSocket handshakeMust be preserved
Sec-WebSocket-VersionProtocol versionMust be preserved
OriginCORS validationMay be modified
HostVirtual host routingMay be modified
X-Forwarded-ForClient IP (for affinity)Added by GKE
X-Forwarded-ProtoOriginal protocolAdded by GKE

GKE Load Balancer Known Issues

Issue #1: WebSocket Header Stripping

Symptoms: HTTP 400 on WebSocket upgrade requests

Cause: GKE L7 LB doesn't forward Upgrade headers by default

Detection:

# Compare headers received by backend
kubectl logs -n coditect-app -l app=coditect-combined --tail=100 | \
grep -i "upgrade"

Fix: Requires GKE 1.16+ with WebSocket support enabled


Issue #2: Session Affinity Mismatch

Symptoms: Different pods handle handshake vs upgrade

Cause: Session affinity configured at Service level but not BackendConfig

Detection:

# Check if requests hit different pods
kubectl logs -n coditect-app -l app=coditect-combined --tail=200 | \
grep "socket.io" | \
awk '{print $1, $10}' | \
sort | uniq -c

Fix: Add affinity to BackendConfig:

spec:
sessionAffinity:
affinityType: "CLIENT_IP"
affinityCookieTtlSec: 10800

Issue #3: Health Check Interference

Symptoms: Intermittent 400 errors during health checks

Cause: Health check to /health may not exist, causing backend to be marked unhealthy

Detection:

# Check health endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s -o /dev/null -w "%{http_code}" http://localhost/health

Fix: Update health check path or create endpoint


Issue #4: Connection Draining During Updates

Symptoms: Socket.IO disconnects during deployments

Cause: 300s draining timeout allows GKE to prematurely close connections

Detection: Check deployment logs during rolling updates

Fix: Reduce draining timeout to 30s or disable


Issue #5: Backend Timeout Insufficiency

Symptoms: Long-lived WebSocket connections terminate after 1 hour

Cause: timeoutSec: 3600 (1 hour) is borderline for long sessions

Detection: Monitor connection duration metrics

Fix: Increase to 86400 (24 hours)


Request Lifecycle Comparison

Internal Request (✅ Working)

curl http://localhost/theia/socket.io/?EIO=4&transport=polling

Step 1: curl → nginx:80
Headers: {Host: localhost, ...}

Step 2: nginx → theia:3000
Rewrite: /theia/socket.io/ → /socket.io/
Headers: {Upgrade: ✅, Connection: ✅, Host: localhost}

Step 3: theia response
Status: 200 OK
Body: 0{"sid":"...","upgrades":["websocket"]}

External Request (❌ Failing)

Browser → https://coditect.ai/theia/socket.io/?EIO=4&transport=polling

Step 1: Browser → GKE LB:443 (TLS termination)
Headers: {Host: coditect.ai, Origin: https://coditect.ai, ...}

Step 2: GKE LB → nginx:80
Headers: {
Host: coditect.ai OR modified ❓
X-Forwarded-For: <client_ip> ✅
X-Forwarded-Proto: https ✅
Upgrade: ??? (may be stripped ⚠️)
Connection: ??? (may be modified ⚠️)
}

Step 3: nginx → theia:3000
Rewrite: /theia/socket.io/ → /socket.io/
Headers: Forwarded from GKE (potentially corrupted ⚠️)

Step 4: theia response
Status: 400 Bad Request ❌
Reason: Invalid headers or missing required fields

Hypothesis Ranking

#HypothesisProbabilityImpactTest Priority
1GKE LB strips WebSocket Upgrade headers85%Critical🔥 P0
2Session affinity not configured at LB level60%High🔥 P0
3Health check endpoint /health returns 40440%Medium🟡 P1
4Origin/Host header corruption30%Medium🟡 P1
5Connection draining kills active sessions20%Low🟢 P2
6Backend timeout too short10%Low🟢 P2

Key Questions to Answer

Q1: What headers does theia actually receive?

Method: Add request logging to nginx

log_format socket_io_debug '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'upgrade=$http_upgrade connection=$http_connection '
'host=$http_host origin=$http_origin';

access_log /var/log/nginx/socket_io_debug.log socket_io_debug;

Test:

# Trigger external request, then:
kubectl logs -n coditect-app -l app=coditect-combined --tail=50 | \
grep socket_io_debug

Q2: Does the health check endpoint exist?

Test:

kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -v http://localhost/health

Expected: 200 OK or 404 Not Found


Q3: Is session affinity working at GKE level?

Test: Make multiple requests, check if same backend pod handles them

# In browser console:
for(let i=0; i<10; i++) {
fetch('/theia/socket.io/?EIO=4&transport=polling')
.then(r => console.log(i, r.status, r.headers.get('x-pod-name')));
}

Q4: What does GKE backend service configuration show?

Test:

# Get backend service name
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')

# Describe backend service
gcloud compute backend-services describe $BACKEND --global \
--format="yaml(timeoutSec,sessionAffinity,affinityCookieTtlSec,connectionDraining,healthChecks)"

Q5: Are WebSocket connections even reaching nginx?

Test: Enable nginx debug logging temporarily

kubectl exec -n coditect-app deployment/coditect-combined -- \
sed -i 's/error_log.*/error_log \/var\/log\/nginx\/error.log debug;/' /etc/nginx/nginx.conf

kubectl exec -n coditect-app deployment/coditect-combined -- \
nginx -s reload

# Trigger request, then check logs
kubectl logs -n coditect-app -l app=coditect-combined --tail=100 | \
grep -i "websocket\|upgrade"

Next Steps

  1. Immediate: Run diagnostic test suite (see investigation-runbook.md)
  2. Short-term: Implement header logging in nginx
  3. Medium-term: Configure BackendConfig for WebSocket support
  4. Long-term: Add comprehensive observability for Socket.IO connections

Success Criteria

  • Socket.IO handshake succeeds from external browser
  • WebSocket upgrade completes successfully
  • Long-lived connections survive >1 hour
  • Connections survive pod rolling updates
  • Session affinity maintains connection to same pod
  • Health checks don't interfere with active connections

References