Skip to main content

Socket.IO CDN Caching Fix

Date: 2025-10-20 Issue: Socket.IO 400 Bad Request errors preventing theia IDE from loading Status: ✅ FIXED - CDN disabled

Problem

Socket.IO polling requests were failing with HTTP 400 errors when accessing theia IDE at https://coditect.ai/theia.

Symptoms:

  • Repeated 400 errors on /theia/socket.io/?EIO=4&transport=polling&t=XXX&sid=YYY
  • WebSocket upgrade failures: wss://coditect.ai/theia/socket.io/?EIO=4&transport=websocket&sid=XXX failed
  • theia IDE not loading in browser
  • BUT Socket.IO worked perfectly inside the cluster when tested with curl

Root Cause

GCP Cloud CDN was caching Socket.IO requests with stale session IDs.

BackendConfig had:

spec:
cdn:
cachePolicy:
includeHost: true
includeProtocol: true
includeQueryString: false # ← PROBLEM: Socket.IO query params ignored
enabled: true # ← CDN caching Socket.IO requests

Why this broke Socket.IO:

  1. Socket.IO uses query parameters for session management: ?sid=XXXXX&t=YYYYY
  2. CDN cache policy had includeQueryString: false → query params ignored in cache key
  3. All Socket.IO requests got same cached response regardless of session ID
  4. Cached response contained stale session ID → 400 Bad Request
  5. WebSocket upgrade also failed because session handshake never completed

Solution

Disabled CDN in BackendConfig to allow Socket.IO session-based polling to work correctly.

File: k8s/backend-config-no-cdn.yaml

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: coditect-backend-config
namespace: coditect-app
spec:
# CDN DISABLED - Socket.IO requires session affinity without caching
cdn:
enabled: false

connectionDraining:
drainingTimeoutSec: 60

healthCheck:
checkIntervalSec: 10
healthyThreshold: 2
port: 80
requestPath: /health
timeoutSec: 5
type: HTTP
unhealthyThreshold: 3

# Session affinity REQUIRED for Socket.IO
sessionAffinity:
affinityCookieTtlSec: 86400 # 24-hour session cookie
affinityType: CLIENT_IP

# 24-hour timeout for long-lived WebSocket connections
timeoutSec: 86400

Applied with:

kubectl apply -f k8s/backend-config-no-cdn.yaml

Verification

# Check CDN is disabled
kubectl get backendconfig -n coditect-app coditect-backend-config -o jsonpath='{.spec.cdn.enabled}'
# Output: false

# Check ingress still references the BackendConfig
kubectl get ingress -n coditect-app coditect-production-ingress -o jsonpath='{.metadata.annotations.cloud\.google\.com/backend-config}'
# Output: {"default": "coditect-backend-config"}

Timeline

Builds #25-31: nginx Configuration (Oct 19-20, 2025)

  • Build #29: Added /bundle.js location block
  • Build #30: Added Connection header mapping for WebSocket
  • Build #31: Added dedicated Socket.IO location block with regex matching
  • Result: Socket.IO worked INSIDE cluster but failed from browser

BackendConfig Fix (Oct 20, 2025)

  • Discovery: CDN caching was the root cause (not nginx)
  • Fix: Disabled CDN in BackendConfig
  • Status: Applied and propagating to GCP load balancer

nginx Configuration (Already Correct)

The nginx configuration in Build #31 was already correct. It properly routes Socket.IO with:

# Socket.IO - MUST come before /theia location for proper matching
location ~ ^/theia/socket\.io/ {
rewrite ^/theia(.*)$ $1 break;
proxy_pass http://localhost:3000;
proxy_http_version 1.1;

# WebSocket upgrade support
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;

# Socket.IO requirements
proxy_buffering off;
proxy_cache off;

# Timeouts (24 hours)
proxy_read_timeout 86400;
proxy_send_timeout 86400;
}

Tested inside cluster:

# Direct to theia backend - works
curl "http://localhost:3000/socket.io/?EIO=4&transport=polling"
# HTTP/1.1 200 OK
# 0{"sid":"mExhI5ZBcHQt0b0TAASG","upgrades":["websocket"],...}

# Through nginx - works
curl "http://localhost/theia/socket.io/?EIO=4&transport=polling"
# HTTP/1.1 200 OK
# 0{"sid":"VzxUlVmTF3Drtpr7AATQ","upgrades":["websocket"],...}

Testing

After GCP load balancer picks up the BackendConfig change (typically 2-5 minutes):

  1. Visit https://coditect.ai/theia
  2. Open browser DevTools Console
  3. Check for Socket.IO errors:
    • ✅ Should see no 400 errors
    • ✅ Should see successful Socket.IO connection
    • ✅ theia IDE should load

Future Optimization (Optional)

To re-enable CDN while preserving Socket.IO functionality, use path-based exclusions:

Option 1: Include query string in cache key (NOT recommended for Socket.IO):

spec:
cdn:
cachePolicy:
includeQueryString: true # Include Socket.IO session params
enabled: true

Problem: Still caches Socket.IO responses, just with different keys. Session IDs expire quickly.

Option 2: Path-based CDN bypass (BETTER): Use separate BackendConfigs for static content (with CDN) and dynamic content (without CDN).

# backend-config-static.yaml (for /, /assets/*, etc.)
spec:
cdn:
enabled: true
cachePolicy:
includeQueryString: false

# backend-config-dynamic.yaml (for /theia/*, /api/*)
spec:
cdn:
enabled: false

Then annotate services with different BackendConfigs:

metadata:
annotations:
cloud.google.com/backend-config: '{"ports": {"80":"backend-config-static"}}'

Complexity: Requires splitting service into multiple backends with path-based routing.

References

  • Build History: See CLAUDE.md "API URL Configuration" section
  • nginx Config: nginx-combined.conf:26-49 (Socket.IO location block)
  • BackendConfig: k8s/backend-config-no-cdn.yaml
  • Ingress: kubectl get ingress -n coditect-app coditect-production-ingress

Key Lessons

  1. CDN and WebSocket/Socket.IO don't mix - Session-based protocols require direct connection
  2. Test at multiple layers - Internal cluster tests showed nginx was correct; external tests revealed CDN issue
  3. Query parameters matter - Socket.IO session IDs in query params must not be ignored
  4. Session affinity is critical - Both service-level and ingress-level session affinity required
  5. GCP BackendConfig propagation - Takes 2-5 minutes for load balancer to pick up config changes