Socket.IO 400 Error Analysis: GKE Load Balancer Investigation
Executive Summary
Problem: Socket.IO connections fail with HTTP 400 errors when accessed through GKE Ingress, despite working perfectly within the cluster.
Impact: WebSocket-dependent features (theia IDE real-time updates, terminal sessions) are non-functional for external users.
Root Cause Hypothesis: GKE L7 Load Balancer is corrupting or rejecting Socket.IO handshake requests due to improper WebSocket configuration, health check interference, or header transformation.
Evidence Chain
✅ Working Components
-
Direct Socket.IO Access (Port 3000)
localhost:3000/socket.io/?EIO=4&transport=polling
→ HTTP 200 OK
→ Valid handshake response -
Nginx Proxy Layer (Port 80)
localhost/theia/socket.io/?EIO=4&transport=polling
→ HTTP 200 OK
→ Proper path rewriting
→ Headers preserved -
Service Configuration
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
❌ Failing Component
- External Access Through GKE Ingress
https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
→ HTTP 400 Bad Request
→ Occurs AFTER passing nginx
→ Only fails from external clients
Architecture Layers
┌─────────────────────────────────────────────────────────────┐
│ Browser │
│ ↓ GET /theia/socket.io/?EIO=4&transport=polling │
└───────────────────────────────────────┬─────────────────────┘
│
↓ HTTPS (TLS termination)
┌─────────────────────────────────────────────────────────────┐
│ GKE L7 Load Balancer (Google Cloud) │
│ • Ingress Controller: coditect-production-ingress │
│ • IP: 34.8.51.57 │
│ • BackendConfig: coditect-backend-config │
│ - timeoutSec: 3600 │
│ - connectionDraining: 300s │
│ - healthCheck: /health endpoint │
│ • Issues: │
│ ⚠️ May strip/modify WebSocket headers │
│ ⚠️ Health checks may interfere with connections │
│ ⚠️ Session affinity not configured at LB level │
└───────────────────────────────────────┬─────────────────────┘
│
↓ HTTP (internal)
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Service: coditect-combined-service │
│ • Type: ClusterIP │
│ • SessionAffinity: ClientIP (10800s) │
│ • Port: 80 → targetPort: 80 │
│ ✅ Working correctly │
└───────────────────────────────────────┬─────────────────────┘
│
↓ HTTP
┌─────────────────────────────────────────────────────────────┐
│ Pod: coditect-combined │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Nginx (Port 80) │ │
│ │ • location ~ ^/theia/socket\.io/ │ │
│ │ • proxy_pass http://localhost:3000 │ │
│ │ • Upgrade/Connection headers configured │ │
│ │ ✅ Working correctly │ │
│ └────────────────────────────┬────────────────────────────┘ │
│ │ │
│ ↓ HTTP │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ theia IDE (Port 3000) │ │
│ │ • Socket.IO Server: /socket.io/ │ │
│ │ • Version: EIO 4 │ │
│ │ ✅ Working correctly │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Failure Point: Between GKE Load Balancer and Pod
Socket.IO Protocol Requirements
Handshake Sequence
1. Client → Server: GET /socket.io/?EIO=4&transport=polling
Headers:
- Host: coditect.ai
- Origin: https://coditect.ai
- Cookie: io=<session_id> (if reconnecting)
2. Server → Client: HTTP 200 OK
Body: 0{"sid":"<session_id>","upgrades":["websocket"],"pingInterval":25000,"pingTimeout":5000}
3. Client → Server: GET /socket.io/?EIO=4&transport=websocket&sid=<session_id>
Headers:
- Upgrade: websocket
- Connection: Upgrade
- Sec-WebSocket-Key: <key>
4. Server → Client: HTTP 101 Switching Protocols
Headers:
- Upgrade: websocket
- Connection: Upgrade
- Sec-WebSocket-Accept: <accept>
Critical Headers for Socket.IO
| Header | Purpose | GKE Risk |
|---|---|---|
Upgrade: websocket | Initiate WebSocket upgrade | May be stripped |
Connection: Upgrade | Maintain connection | May be modified |
Sec-WebSocket-Key | WebSocket handshake | Must be preserved |
Sec-WebSocket-Version | Protocol version | Must be preserved |
Origin | CORS validation | May be modified |
Host | Virtual host routing | May be modified |
X-Forwarded-For | Client IP (for affinity) | Added by GKE |
X-Forwarded-Proto | Original protocol | Added by GKE |
GKE Load Balancer Known Issues
Issue #1: WebSocket Header Stripping
Symptoms: HTTP 400 on WebSocket upgrade requests
Cause: GKE L7 LB doesn't forward Upgrade headers by default
Detection:
# Compare headers received by backend
kubectl logs -n coditect-app -l app=coditect-combined --tail=100 | \
grep -i "upgrade"
Fix: Requires GKE 1.16+ with WebSocket support enabled
Issue #2: Session Affinity Mismatch
Symptoms: Different pods handle handshake vs upgrade
Cause: Session affinity configured at Service level but not BackendConfig
Detection:
# Check if requests hit different pods
kubectl logs -n coditect-app -l app=coditect-combined --tail=200 | \
grep "socket.io" | \
awk '{print $1, $10}' | \
sort | uniq -c
Fix: Add affinity to BackendConfig:
spec:
sessionAffinity:
affinityType: "CLIENT_IP"
affinityCookieTtlSec: 10800
Issue #3: Health Check Interference
Symptoms: Intermittent 400 errors during health checks
Cause: Health check to /health may not exist, causing backend to be marked unhealthy
Detection:
# Check health endpoint
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s -o /dev/null -w "%{http_code}" http://localhost/health
Fix: Update health check path or create endpoint
Issue #4: Connection Draining During Updates
Symptoms: Socket.IO disconnects during deployments
Cause: 300s draining timeout allows GKE to prematurely close connections
Detection: Check deployment logs during rolling updates
Fix: Reduce draining timeout to 30s or disable
Issue #5: Backend Timeout Insufficiency
Symptoms: Long-lived WebSocket connections terminate after 1 hour
Cause: timeoutSec: 3600 (1 hour) is borderline for long sessions
Detection: Monitor connection duration metrics
Fix: Increase to 86400 (24 hours)
Request Lifecycle Comparison
Internal Request (✅ Working)
curl http://localhost/theia/socket.io/?EIO=4&transport=polling
Step 1: curl → nginx:80
Headers: {Host: localhost, ...}
Step 2: nginx → theia:3000
Rewrite: /theia/socket.io/ → /socket.io/
Headers: {Upgrade: ✅, Connection: ✅, Host: localhost}
Step 3: theia response
Status: 200 OK
Body: 0{"sid":"...","upgrades":["websocket"]}
External Request (❌ Failing)
Browser → https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
Step 1: Browser → GKE LB:443 (TLS termination)
Headers: {Host: coditect.ai, Origin: https://coditect.ai, ...}
Step 2: GKE LB → nginx:80
Headers: {
Host: coditect.ai OR modified ❓
X-Forwarded-For: <client_ip> ✅
X-Forwarded-Proto: https ✅
Upgrade: ??? (may be stripped ⚠️)
Connection: ??? (may be modified ⚠️)
}
Step 3: nginx → theia:3000
Rewrite: /theia/socket.io/ → /socket.io/
Headers: Forwarded from GKE (potentially corrupted ⚠️)
Step 4: theia response
Status: 400 Bad Request ❌
Reason: Invalid headers or missing required fields
Hypothesis Ranking
| # | Hypothesis | Probability | Impact | Test Priority |
|---|---|---|---|---|
| 1 | GKE LB strips WebSocket Upgrade headers | 85% | Critical | 🔥 P0 |
| 2 | Session affinity not configured at LB level | 60% | High | 🔥 P0 |
| 3 | Health check endpoint /health returns 404 | 40% | Medium | 🟡 P1 |
| 4 | Origin/Host header corruption | 30% | Medium | 🟡 P1 |
| 5 | Connection draining kills active sessions | 20% | Low | 🟢 P2 |
| 6 | Backend timeout too short | 10% | Low | 🟢 P2 |
Key Questions to Answer
Q1: What headers does theia actually receive?
Method: Add request logging to nginx
log_format socket_io_debug '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'upgrade=$http_upgrade connection=$http_connection '
'host=$http_host origin=$http_origin';
access_log /var/log/nginx/socket_io_debug.log socket_io_debug;
Test:
# Trigger external request, then:
kubectl logs -n coditect-app -l app=coditect-combined --tail=50 | \
grep socket_io_debug
Q2: Does the health check endpoint exist?
Test:
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -v http://localhost/health
Expected: 200 OK or 404 Not Found
Q3: Is session affinity working at GKE level?
Test: Make multiple requests, check if same backend pod handles them
# In browser console:
for(let i=0; i<10; i++) {
fetch('/theia/socket.io/?EIO=4&transport=polling')
.then(r => console.log(i, r.status, r.headers.get('x-pod-name')));
}
Q4: What does GKE backend service configuration show?
Test:
# Get backend service name
BACKEND=$(kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/backends}' | \
jq -r 'keys[0]')
# Describe backend service
gcloud compute backend-services describe $BACKEND --global \
--format="yaml(timeoutSec,sessionAffinity,affinityCookieTtlSec,connectionDraining,healthChecks)"
Q5: Are WebSocket connections even reaching nginx?
Test: Enable nginx debug logging temporarily
kubectl exec -n coditect-app deployment/coditect-combined -- \
sed -i 's/error_log.*/error_log \/var\/log\/nginx\/error.log debug;/' /etc/nginx/nginx.conf
kubectl exec -n coditect-app deployment/coditect-combined -- \
nginx -s reload
# Trigger request, then check logs
kubectl logs -n coditect-app -l app=coditect-combined --tail=100 | \
grep -i "websocket\|upgrade"
Next Steps
- Immediate: Run diagnostic test suite (see investigation-runbook.md)
- Short-term: Implement header logging in nginx
- Medium-term: Configure BackendConfig for WebSocket support
- Long-term: Add comprehensive observability for Socket.IO connections
Success Criteria
- Socket.IO handshake succeeds from external browser
- WebSocket upgrade completes successfully
- Long-lived connections survive >1 hour
- Connections survive pod rolling updates
- Session affinity maintains connection to same pod
- Health checks don't interfere with active connections