Skip to main content

Additional Research Pathways & Recommendations

Based on Analysis of Investigation Package (files.zip) Date: October 20, 2025 Integration with: analysis-troubleshooting-guide.md


Executive Summary

The comprehensive investigation package (files.zip) contains 8 documents totaling ~150 pages of analysis, procedures, and diagnostic tools. This document identifies additional research pathways and recommended next steps based on comparing:

  1. Our actual investigation findings (analysis-troubleshooting-guide.md)
  2. Reference documentation comprehensive diagnostic frameworks (files.zip)

I. High-Value Research Pathways (Prioritized)

A. WebSocket Annotation Testing [P0 - 85% Success Probability]

Source: fix-implementation-guide.md, executive-summary.md

Why This Matters: The reference documentation identifies this as the highest probability fix (85%) but we haven't tested it yet because we were focusing on session affinity.

Implementation:

kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout="86400" \
--overwrite

Expected Impact:

  • Enables GKE L7 load balancer to properly handle WebSocket protocol
  • Preserves Upgrade: websocket headers through load balancer
  • Supports long-lived WebSocket connections (24 hours)

Testing Procedure:

# Test WebSocket upgrade
curl -v https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket \
-H "Upgrade: websocket" \
-H "Connection: Upgrade" \
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
-H "Sec-WebSocket-Version: 13"

# Expected: HTTP/1.1 101 Switching Protocols

Priority: Apply AFTER session affinity propagates (next fix to test)


B. Automated Diagnostic Script [P0 - Validation Tool]

Source: socketio-diagnostics.sh (400 lines)

Capabilities:

  1. Phase 1: Environment validation (kubectl, gcloud, curl)
  2. Phase 2: Header analysis (WebSocket headers through load balancer)
  3. Phase 3: GKE backend investigation (BackendConfig, session affinity)
  4. Phase 4: BackendConfig deep analysis (CDN, timeouts, health checks)
  5. Phase 5: Health check verification (endpoint availability)
  6. Phase 6: Session affinity testing (multi-request routing)
  7. Phase 7: WebSocket handshake simulation (full protocol test)
  8. Summary Report: Consolidated findings with recommendations
  9. Auto-fix mode: --fix flag applies recommended fixes

Usage:

cd socket.io-issue
chmod +x socketio-diagnostics.sh

# Run diagnostics only
./socketio-diagnostics.sh

# Run with verbose output
./socketio-diagnostics.sh --verbose

# Run diagnostics and auto-apply fixes
./socketio-diagnostics.sh --fix

# Check results
cat /tmp/socketio-diagnostics-*/summary.txt

Value:

  • Comprehensive automated testing (30 minutes runtime)
  • Identifies issues we may have missed
  • Generates actionable report
  • Can auto-apply fixes in production

Priority: Run AFTER applying WebSocket annotation


C. Health Check Endpoint Investigation [P1 - 70% Fix]

Source: diagnostic-decision-tree.md Q8, fix-implementation-guide.md Fix #2

Confirmed Issue:

kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s -o /dev/null -w "%{http_code}\n" http://localhost/health

# Result: 404 ← Endpoint doesn't exist!

Impact:

  • BackendConfig health check points to /health
  • 404 responses mark backend as unhealthy
  • May cause intermittent connection failures
  • Load balancer may route around "unhealthy" pods

Two Options:

Option A: Create /health endpoint in nginx (Recommended)

location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}

Option B: Change BackendConfig health check path

healthCheck:
requestPath: / # Use existing endpoint instead

Implementation Priority: P1 (implement after WebSocket annotation validates)


D. Header Analysis & Tracing [P1 - Diagnostic]

Source: investigation-runbook.md Phase 2

Purpose: Verify WebSocket headers reach nginx through GKE load balancer

Procedure:

# Enable detailed header logging in nginx
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
cat >> /etc/nginx/nginx.conf <<EOF
log_format headers '\''$remote_addr - $remote_user [$time_local] ''
'\''"$request" $status $body_bytes_sent ''
'\''"$http_referer" "$http_user_agent" ''
'\''"Upgrade: $http_upgrade" ''
'\''"Connection: $http_connection"'\'';
EOF
'

# Update server block to use headers format
kubectl exec -n coditect-app deployment/coditect-combined -- \
sed -i 's/access_log .*/access_log \/var\/log\/nginx\/access.log headers;/' \
/etc/nginx/sites-available/default

# Reload nginx
kubectl exec -n coditect-app deployment/coditect-combined -- \
nginx -s reload

# Make test request
curl -v https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket \
-H "Upgrade: websocket" \
-H "Connection: Upgrade"

# Check logs
kubectl logs -n coditect-app deployment/coditect-combined | \
grep "Upgrade:" | tail -5

What to Look For:

  • ✅ Headers present → GKE preserving headers (good)
  • ❌ Headers missing → GKE stripping headers (need WebSocket annotation)
  • ❌ Headers corrupted → GKE transformation issue (deeper investigation)

Priority: Run if WebSocket annotation doesn't fix issue


E. Network Path Tracing [P2 - Advanced Diagnostic]

Source: investigation-runbook.md Phase 4

Purpose: Trace complete request path from browser to backend pod

Tools & Commands:

# 1. Check GKE forwarding rules
gcloud compute forwarding-rules list --format="table(name,IPAddress,target)" | \
grep coditect

# 2. Describe target proxy
gcloud compute target-https-proxies describe \
k8s2-ts-e6wek3rw-coditect-app-coditect-production-ingr-x8lkl70c \
--format="yaml(sslCertificates,urlMap)"

# 3. Check URL map routing
gcloud compute url-maps describe \
k8s2-um-e6wek3rw-coditect-app-coditect-production-ingr-x8lkl70c \
--format="yaml(defaultService,pathMatchers)"

# 4. Verify backend service health
gcloud compute backend-services get-health \
k8s1-28b74fc1-coditect-app-coditect-combined-service-8-b2e75de7 \
--global

# 5. Check certificate status
gcloud compute ssl-certificates describe \
mcrt-1e9fda11-9f24-455a-bd4b-074f776e3282 \
--format="yaml(certificate,expireTime)"

When to Use:

  • If other fixes don't resolve issue
  • To understand complete request flow
  • For documentation and training
  • During post-mortem analysis

II. Long-Term Improvements & Monitoring

Purpose: Proactive detection of Socket.IO issues

Metrics to Track:

1. Connection Success Rate
- socket.io_handshake_success_total / socket.io_handshake_attempts_total
- Target: >99.9%

2. WebSocket Upgrade Rate
- socket.io_websocket_upgrades_total / socket.io_connections_total
- Target: >95% (allow 5% for polling-only clients)

3. Average Session Duration
- histogram_quantile(0.5, socket.io_session_duration_seconds)
- Target: >3600 seconds (1 hour)

4. 400 Error Rate
- socket.io_http_400_errors_total
- Target: 0 errors/hour

5. Backend Health Status
- gke_backend_healthy_instances / gke_backend_total_instances
- Target: 100%

Implementation:

# Create Prometheus ServiceMonitor
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: socket-io-monitor
namespace: coditect-app
spec:
selector:
matchLabels:
app: coditect-combined
endpoints:
- port: metrics
path: /metrics
interval: 30s
EOF

# Create Grafana dashboard (import JSON)
curl -X POST http://grafana:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @socket-io-dashboard.json

Purpose: Continuous validation of Socket.IO functionality

External Synthetic Monitoring:

# cron job every 5 minutes
*/5 * * * * /usr/local/bin/socket-io-health-check.sh

# socket-io-health-check.sh
#!/bin/bash
ENDPOINT="https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

# Test initial handshake
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$ENDPOINT")

if [ "$STATUS" != "200" ]; then
# Alert on failure
curl -X POST https://alerts.example.com/webhook \
-d "{\"service\":\"socket-io\",\"status\":$STATUS,\"severity\":\"critical\"}"
fi

Internal Pod-Level Checks:

# Add to coditect-combined deployment
livenessProbe:
httpGet:
path: /socket.io/?EIO=4&transport=polling
port: 3000
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 5
failureThreshold: 3

Purpose: Prevent future Socket.IO regressions

Test Suite:

// tests/e2e/socket-io.test.ts
describe('Socket.IO Integration', () => {
it('should complete initial handshake', async () => {
const response = await fetch(
'https://coditect.ai/theia/socket.io/?EIO=4&transport=polling'
);
expect(response.status).toBe(200);

const body = await response.text();
const data = JSON.parse(body.substring(1)); // Remove leading '0'
expect(data).toHaveProperty('sid');
expect(data.upgrades).toContain('websocket');
});

it('should maintain session across requests', async () => {
// Initial handshake
const response1 = await fetch(...);
const sid = extractSid(response1);

// Subsequent request with sid
const response2 = await fetch(
`https://coditect.ai/theia/socket.io/?EIO=4&transport=polling&sid=${sid}`
);
expect(response2.status).toBe(200); // Not 400!
});

it('should upgrade to WebSocket', async () => {
const ws = new WebSocket(
'wss://coditect.ai/theia/socket.io/?EIO=4&transport=websocket'
);

await waitForOpen(ws);
expect(ws.readyState).toBe(WebSocket.OPEN);
});
});

CI/CD Integration:

# .github/workflows/socket-io-tests.yml
name: Socket.IO Regression Tests
on: [push, pull_request]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
- run: npm install
- run: npm run test:socket-io
- name: Notify on failure
if: failure()
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text":"Socket.IO regression tests failed!"}'

D. Dedicated WebSocket Gateway [Future Consideration]

Source: executive-summary.md Long-term recommendations

Architecture:

Browser

GKE Ingress (websocket.coditect.ai)

WebSocket Gateway (nginx + Socket.IO proxy)
↓ (internal only)
theia Backend (coditect-combined)

Benefits:

  • Isolate WebSocket traffic from static content
  • Specialized configuration for long-lived connections
  • Independent scaling for WebSocket vs HTTP
  • Better observability and monitoring
  • Easier to troubleshoot

Implementation (future):

# websocket-gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: websocket-gateway
namespace: coditect-app
spec:
replicas: 3
selector:
matchLabels:
app: websocket-gateway
template:
metadata:
labels:
app: websocket-gateway
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
volumeMounts:
- name: config
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
volumes:
- name: config
configMap:
name: websocket-gateway-config

Priority: Consider in Q1 2026 if Socket.IO traffic grows significantly


III. Documentation & Knowledge Management

A. Runbook Creation [High Priority]

Purpose: Standardize Socket.IO troubleshooting procedures

Structure:

# Socket.IO Troubleshooting Runbook

## Quick Diagnostics (5 minutes)
1. Test external endpoint
2. Check GCP backend session affinity
3. Verify Service annotation
4. Check pod health

## Common Issues
1. Session affinity missing → Add Service annotation
2. CDN caching enabled → Disable CDN in BackendConfig
3. WebSocket headers stripped → Add Ingress annotation
4. Health endpoint missing → Create /health in nginx

## Escalation Criteria
- Issue not resolved after 30 minutes
- Multiple services affected
- Customer-reported outages

## Post-Resolution
- Update incident log
- Run regression tests
- Review monitoring dashboards

Location: docs/runbooks/socket-io-troubleshooting.md


Purpose: Document future Socket.IO issues for pattern analysis

Template:

# Socket.IO Incident Report

**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: P0/P1/P2
**Impact**: Description

## Timeline
- HH:MM: Issue detected
- HH:MM: Investigation started
- HH:MM: Root cause identified
- HH:MM: Fix applied
- HH:MM: Validated and resolved

## Root Cause
[Detailed description]

## Resolution
[Fixes applied]

## Lessons Learned
- What went well
- What could be improved
- Action items

## Preventive Measures
[Future safeguards]

C. Architecture Documentation Update [Required]

Add Section: "Socket.IO Configuration Requirements"

Content:

## Socket.IO Configuration Requirements

### GKE Ingress
- WebSocket annotation: `cloud.google.com/websocket-max-idle-timeout="86400"`
- Backend timeout: 24 hours minimum

### Service (NEG-based)
- **CRITICAL**: Must have BackendConfig annotation
- Format: `cloud.google.com/backend-config='{"default":"coditect-backend-config"}'`

### BackendConfig
- CDN: Disabled (Socket.IO incompatible)
- Session affinity: CLIENT_IP
- Affinity cookie TTL: 86400 (24 hours)
- Timeout: 86400 (24 hours)
- Health check: Use /health endpoint

### nginx
- WebSocket headers: Upgrade + Connection
- Proxy buffering: off
- Proxy cache: off
- Read/send timeout: 86400

### Application
- Socket.IO server on port 3000
- Path: /socket.io/ (default)
- CORS: Configured for coditect.ai

IV. Team Training & Knowledge Transfer

Duration: 2 hours

Agenda:

  1. Hour 1: Technical deep dive

    • Socket.IO protocol fundamentals
    • GKE load balancer architecture
    • NEG vs non-NEG services
    • BackendConfig requirements
  2. Hour 2: Hands-on troubleshooting

    • Use diagnostic decision tree
    • Run automated diagnostic script
    • Simulate common failures
    • Practice fix application

Materials:

  • This investigation package (all 10 documents)
  • Live GKE cluster access
  • Sample failure scenarios

B. On-Call Playbook [High Priority]

Quick Reference Card:

Socket.IO 400 Errors - On-Call Guide
=====================================

1. CHECK: Is it affecting all users?
→ Yes: P0 incident, page team
→ No: P1, investigate during business hours

2. QUICK TEST:
curl https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
→ 400: Continue to step 3
→ 200: Issue resolved or intermittent

3. CHECK SESSION AFFINITY:
gcloud compute backend-services describe \
k8s1-28b74fc1-coditect-app-coditect-combined-service-8-b2e75de7 \
--global --format="value(sessionAffinity)"
→ NONE: Apply Service annotation (see runbook)
→ CLIENT_IP: Continue to step 4

4. CHECK SERVICE ANNOTATION:
kubectl get service -n coditect-app coditect-combined-service \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/backend-config}'
→ Empty: Add annotation (see runbook)
→ Present: Check WebSocket annotation (step 5)

5. CHECK WEBSOCKET ANNOTATION:
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'
→ Empty: Add annotation (see runbook)
→ Present: Escalate to platform engineering

ESCALATION: Platform Engineering team
RUNBOOK: docs/runbooks/socket-io-troubleshooting.md
DIAGNOSTIC SCRIPT: /home/claude/socketio-diagnostics.sh

Immediate (Next 24 hours)

  • Apply WebSocket annotation to Ingress (P0)
  • Wait for session affinity propagation completion (P0)
  • Run automated diagnostic script (socketio-diagnostics.sh) (P0)
  • Create /health endpoint in nginx (P1)
  • Validate with browser testing (P0)

Short-term (Next week)

  • Create Socket.IO monitoring dashboard (P1)
  • Implement automated health checks (P1)
  • Document findings in architecture docs (P1)
  • Create troubleshooting runbook (P1)
  • Add regression tests to CI/CD (P2)

Medium-term (Next month)

  • Conduct Socket.IO troubleshooting workshop (P2)
  • Create on-call playbook (P1)
  • Review all services for annotation gaps (P2)
  • Evaluate dedicated WebSocket gateway (P3)
  • Implement header analysis in diagnostic script (P3)

VI. Success Metrics

Track these to validate improvements:

MetricBaselineTargetTimeline
Socket.IO availability0%99.9%1 week
MTTR (Mean Time To Resolution)4+ hours<30 min1 month
False positive incidentsTBD<1/month3 months
Team troubleshooting confidenceTBD90%3 months

VII. Conclusion

The comprehensive investigation package (files.zip) provides excellent frameworks that complement our actual findings. Key gaps identified:

  1. Service annotation requirement for NEG-based services (not in reference docs)
  2. WebSocket annotation as high-probability fix (85%) still needs testing
  3. Automated diagnostic tools ready for immediate use
  4. Long-term improvements clearly defined and prioritized

Next Steps: Execute immediate actions (P0), then progressively implement short-term and medium-term improvements for production-grade Socket.IO reliability.


Document Created: October 20, 2025 Last Updated: October 20, 2025 13:40 UTC Version: 1.0 Related Documents:

  • analysis-troubleshooting-guide.md (our investigation)
  • files.zip contents (reference documentation)