Skip to main content

Socket.IO Diagnostic Decision Tree

START: Socket.IO returns 400 from browser

├─ Q1: Does Socket.IO work internally?
│ │ Test: kubectl exec curl localhost:3000/socket.io/
│ │
│ ├─ NO → Issue in theia/Node.js configuration
│ │ └─ Check: theia Socket.IO server settings
│ │ - Port binding
│ │ - CORS configuration
│ │ - Socket.IO version compatibility
│ │
│ └─ YES → Continue to Q2

├─ Q2: Does Socket.IO work through nginx internally?
│ │ Test: kubectl exec curl localhost/theia/socket.io/
│ │
│ ├─ NO → Issue in nginx proxy configuration
│ │ └─ Check: nginx location blocks
│ │ - Path rewriting
│ │ - Upgrade/Connection headers
│ │ - Proxy timeout settings
│ │
│ └─ YES → Continue to Q3 ⚠️ [CURRENT STATE]

├─ Q3: What HTTP status code from external request?
│ │ Test: curl https://coditect.ai/theia/socket.io/
│ │
│ ├─ 502/503/504 → Backend connectivity issue
│ │ └─ Check: GKE backend health
│ │ - Health check endpoint
│ │ - Pod readiness
│ │ - Backend service configuration
│ │
│ ├─ 404 → Path routing issue
│ │ └─ Check: Ingress path rules
│ │ - Path prefix matching
│ │ - Service backend mapping
│ │
│ ├─ 400 → Request validation failure ⚠️ [CURRENT]
│ │ └─ Continue to Q4
│ │
│ └─ 401/403 → Authentication/Authorization issue
│ └─ Check: Ingress auth annotations

├─ Q4: Are WebSocket headers present in request?
│ │ Test: Browser DevTools → Network → Headers
│ │
│ ├─ NO → Client not sending WebSocket upgrade
│ │ └─ Check: Socket.IO client configuration
│ │ - Transport settings
│ │ - Browser compatibility
│ │ - JavaScript errors
│ │
│ └─ YES → Continue to Q5

├─ Q5: Do WebSocket headers reach nginx?
│ │ Test: nginx access logs with header logging
│ │
│ ├─ NO → GKE Load Balancer stripping headers ⚠️ [LIKELY ROOT CAUSE]
│ │ └─ FIX BRANCH A: GKE Header Issues
│ │ │
│ │ ├─ A1: Add WebSocket support annotation
│ │ │ Action: cloud.google.com/websocket-max-idle-timeout="86400"
│ │ │ Impact: HIGH - Enables WebSocket on GKE LB
│ │ │ Risk: LOW
│ │ │
│ │ ├─ A2: Update BackendConfig timeout
│ │ │ Action: timeoutSec: 86400
│ │ │ Impact: MEDIUM - Prevents premature disconnects
│ │ │ Risk: LOW
│ │ │
│ │ └─ A3: Verify TLS backend protocol
│ │ Action: Ensure backend uses HTTP, not HTTPS
│ │ Impact: HIGH - Protocol mismatch causes issues
│ │ Risk: LOW
│ │
│ └─ YES → Headers reach nginx → Continue to Q6

├─ Q6: Does nginx forward headers to theia?
│ │ Test: theia application logs / tcpdump
│ │
│ ├─ NO → nginx proxy_set_header misconfiguration
│ │ └─ Check: nginx location block
│ │ - proxy_set_header Upgrade $http_upgrade
│ │ - proxy_set_header Connection $connection_upgrade
│ │ - Missing map $http_upgrade $connection_upgrade
│ │
│ └─ YES → Headers reach theia → Continue to Q7

├─ Q7: Are request headers valid for Socket.IO?
│ │ Test: Compare working vs failing request headers
│ │
│ ├─ Headers corrupted/invalid → GKE transformation issue
│ │ └─ FIX BRANCH B: Header Transformation Issues
│ │ │
│ │ ├─ B1: Check Host header modification
│ │ │ Issue: Host changed from coditect.ai to internal
│ │ │ Fix: Ensure proxy_set_header Host $host
│ │ │
│ │ ├─ B2: Check Origin header
│ │ │ Issue: Origin missing or modified
│ │ │ Fix: Add X-Forwarded-* headers preservation
│ │ │
│ │ └─ B3: Check query parameter preservation
│ │ Issue: ?EIO=4&transport=polling stripped
│ │ Fix: Use $request_uri in rewrite rules
│ │
│ └─ Headers valid → Continue to Q8

├─ Q8: Does health check endpoint exist?
│ │ Test: curl localhost/health
│ │
│ ├─ NO (404) → Unhealthy backend affecting requests ⚠️ [POSSIBLE]
│ │ └─ FIX BRANCH C: Health Check Issues
│ │ │
│ │ ├─ C1: Create /health endpoint
│ │ │ Action: Add nginx location /health { return 200; }
│ │ │ Impact: MEDIUM - Stabilizes backend health
│ │ │ Risk: LOW
│ │ │
│ │ ├─ C2: Change health check path
│ │ │ Action: Update BackendConfig healthCheck.requestPath
│ │ │ Impact: MEDIUM - Use existing endpoint
│ │ │ Risk: LOW
│ │ │
│ │ └─ C3: Increase unhealthy threshold
│ │ Action: unhealthyThreshold: 10
│ │ Impact: LOW - Prevents premature marking
│ │ Risk: LOW
│ │
│ └─ YES (200) → Health checks passing → Continue to Q9

├─ Q9: Is session affinity configured at GKE level?
│ │ Test: gcloud compute backend-services describe
│ │
│ ├─ NO → Requests hitting different pods ⚠️ [POSSIBLE]
│ │ └─ FIX BRANCH D: Session Affinity Issues
│ │ │
│ │ ├─ D1: Add session affinity to BackendConfig
│ │ │ Action: sessionAffinity.affinityType: CLIENT_IP
│ │ │ Impact: HIGH - Ensures connection persistence
│ │ │ Risk: LOW
│ │ │
│ │ └─ D2: Add affinity cookie
│ │ Action: affinityCookieTtlSec: 10800
│ │ Impact: MEDIUM - Backup affinity mechanism
│ │ Risk: LOW
│ │
│ └─ YES → Session affinity configured → Continue to Q10

├─ Q10: Are there multiple pods?
│ │ Test: kubectl get pods -l app=coditect-combined
│ │
│ ├─ YES (>1 pod) → Potential session affinity failure
│ │ └─ Check: Do multiple requests hit same pod?
│ │ Test: Make 5 requests, check pod IPs in logs
│ │ │
│ │ ├─ Different pods → Session affinity not working
│ │ │ └─ FIX BRANCH D (above)
│ │ │
│ │ └─ Same pod → Session affinity working → Continue to Q11
│ │
│ └─ NO (1 pod) → Session affinity irrelevant → Continue to Q11

├─ Q11: Does polling transport work but not WebSocket?
│ │ Test: Force polling-only transport in client
│ │
│ ├─ Polling works → WebSocket-specific issue
│ │ └─ FIX BRANCH E: WebSocket Protocol Issues
│ │ │
│ │ ├─ E1: Check Sec-WebSocket-Key validation
│ │ │ Issue: Invalid key format
│ │ │ Fix: Ensure GKE preserves exact key
│ │ │
│ │ ├─ E2: Check WebSocket version
│ │ │ Issue: Version mismatch (need v13)
│ │ │ Fix: Verify Sec-WebSocket-Version: 13
│ │ │
│ │ └─ E3: Check for WebSocket-blocking proxy
│ │ Issue: Corporate firewall/proxy
│ │ Fix: Use polling-only or different port
│ │
│ └─ Polling also fails → Generic Socket.IO issue → Continue to Q12

├─ Q12: Is EIO version compatible?
│ │ Test: Check Socket.IO client/server versions
│ │
│ ├─ Mismatch → Version incompatibility
│ │ └─ Check: theia Socket.IO server version vs client
│ │ Common: EIO 3 (Socket.IO v2) vs EIO 4 (Socket.IO v3+)
│ │ Fix: Upgrade client or server to match
│ │
│ └─ Compatible → Continue to Q13

├─ Q13: Are there CORS errors in browser console?
│ │ Test: Browser DevTools → Console
│ │
│ ├─ YES → CORS configuration issue
│ │ └─ FIX BRANCH F: CORS Issues
│ │ │
│ │ ├─ F1: Add CORS headers in nginx
│ │ │ Headers: Access-Control-Allow-Origin
│ │ │ Impact: HIGH - Allows cross-origin requests
│ │ │ Risk: LOW (if properly scoped)
│ │ │
│ │ └─ F2: Configure Socket.IO CORS
│ │ Action: Update theia Socket.IO server CORS settings
│ │ Impact: HIGH - Server-side CORS validation
│ │ Risk: LOW
│ │
│ └─ NO → No CORS errors → Continue to Q14

├─ Q14: Does the same request work from different networks?
│ │ Test: Try from mobile data, different ISP, VPN
│ │
│ ├─ Works from some networks → Network-specific blocking
│ │ └─ Check: Corporate firewall, ISP restrictions
│ │ - WebSocket ports blocked
│ │ - DPI (Deep Packet Inspection) interfering
│ │ - Try different port or protocol
│ │
│ └─ Fails from all networks → Not network-specific → Q15

└─ Q15: Are there recent deployment changes?
│ Test: Review git commits, kubectl rollout history

├─ YES → Regression introduced
│ └─ Action: kubectl rollout undo deployment/coditect-combined
│ Then investigate change that caused regression

└─ NO → Check for external factors
└─ Final checks:
- GKE cluster version upgrade
- Certificate expiration
- DNS changes
- Rate limiting

═══════════════════════════════════════════════════════════════

RECOMMENDED INVESTIGATION ORDER (Based on Current Evidence)
═══════════════════════════════════════════════════════════════

Current State:
✅ Q1: YES - Internal Socket.IO works (localhost:3000)
✅ Q2: YES - Through nginx works (localhost/theia/socket.io)
✅ Q3: 400 - External returns Bad Request
⚠️ Q4: UNKNOWN - Need to check browser headers
⚠️ Q5: UNKNOWN - Need nginx header logging

Priority Investigation Path:
┌─────────────────────────────────────────────────────────────┐
│ 1. Q5: Check if headers reach nginx [PHASE 2 of runbook] │
│ → Add header logging to nginx │
│ → Capture failing request headers │
│ → Compare with working internal request │
│ │
│ 2. Q8: Verify health check endpoint [PHASE 3 of runbook] │
│ → curl localhost/health │
│ → Check GKE backend health status │
│ │
│ 3. Q9: Check session affinity [PHASE 3 of runbook] │
│ → gcloud compute backend-services describe │
│ → Verify CLIENT_IP affinity configured │
│ │
│ 4. Q5 Result Analysis: │
│ IF headers NOT reaching nginx: │
│ → FIX BRANCH A: GKE WebSocket configuration │
│ ELSE IF headers reaching nginx but corrupted: │
│ → FIX BRANCH B: Header transformation fixes │
│ ELSE: │
│ → Continue to Q6-Q7 │
└─────────────────────────────────────────────────────────────┘

═══════════════════════════════════════════════════════════════

MOST LIKELY ROOT CAUSES (Probability-Ranked)
═══════════════════════════════════════════════════════════════

1. [85%] GKE Load Balancer stripping WebSocket headers
├─ Symptoms: 400 error, works internally
├─ Fix: Add WebSocket annotation to Ingress
└─ Branch: FIX BRANCH A1

2. [60%] Session affinity not configured at GKE LB level
├─ Symptoms: Intermittent failures, different pod handling requests
├─ Fix: Add sessionAffinity to BackendConfig
└─ Branch: FIX BRANCH D1

3. [40%] Health check endpoint missing (returns 404)
├─ Symptoms: Backend marked unhealthy, requests fail
├─ Fix: Create /health endpoint or change health check path
└─ Branch: FIX BRANCH C1

4. [30%] Host/Origin headers corrupted by GKE
├─ Symptoms: 400 error, CORS issues
├─ Fix: Verify header preservation in nginx
└─ Branch: FIX BRANCH B1

5. [20%] Connection draining killing active sessions
├─ Symptoms: Disconnects during updates
├─ Fix: Reduce drainingTimeoutSec to 30
└─ Branch: Implicit in BackendConfig update

═══════════════════════════════════════════════════════════════

NEXT ACTIONS
═══════════════════════════════════════════════════════════════

Execute in order:

1. Run Phase 2 of investigation-runbook.md (Header Analysis)
Time: 10 minutes
Risk: LOW - Read-only diagnostics

2. Run Phase 3 of investigation-runbook.md (GKE Backend Investigation)
Time: 15 minutes
Risk: LOW - Read-only diagnostics

3. Based on results, apply appropriate FIX BRANCH
Time: 5 minutes + 2-5 minute GKE reconciliation
Risk: MEDIUM - Configuration changes

4. Verify with Phase 5 (Live Traffic Testing)
Time: 10 minutes
Risk: LOW - Testing only

Total estimated time: 45-60 minutes