Additional Research Pathways & Recommendations

Based on Analysis of Investigation Package (files.zip) Date: October 20, 2025 Integration with: analysis-troubleshooting-guide.md

Executive Summary

The comprehensive investigation package (files.zip) contains 8 documents totaling ~150 pages of analysis, procedures, and diagnostic tools. This document identifies additional research pathways and recommended next steps based on comparing:

Our actual investigation findings (analysis-troubleshooting-guide.md)
Reference documentation comprehensive diagnostic frameworks (files.zip)

I. High-Value Research Pathways (Prioritized)

A. WebSocket Annotation Testing [P0 - 85% Success Probability]

Source: fix-implementation-guide.md, executive-summary.md

Why This Matters: The reference documentation identifies this as the highest probability fix (85%) but we haven't tested it yet because we were focusing on session affinity.

Implementation:

kubectl annotate ingress -n coditect-app coditect-production-ingress \
  cloud.google.com/websocket-max-idle-timeout="86400" \
  --overwrite

Expected Impact:

Enables GKE L7 load balancer to properly handle WebSocket protocol
Preserves Upgrade: websocket headers through load balancer
Supports long-lived WebSocket connections (24 hours)

Testing Procedure:

# Test WebSocket upgrade
curl -v https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket \
  -H "Upgrade: websocket" \
  -H "Connection: Upgrade" \
  -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
  -H "Sec-WebSocket-Version: 13"

# Expected: HTTP/1.1 101 Switching Protocols

Priority: Apply AFTER session affinity propagates (next fix to test)

B. Automated Diagnostic Script [P0 - Validation Tool]

Source: socketio-diagnostics.sh (400 lines)

Capabilities:

Phase 1: Environment validation (kubectl, gcloud, curl)
Phase 2: Header analysis (WebSocket headers through load balancer)
Phase 3: GKE backend investigation (BackendConfig, session affinity)
Phase 4: BackendConfig deep analysis (CDN, timeouts, health checks)
Phase 5: Health check verification (endpoint availability)
Phase 6: Session affinity testing (multi-request routing)
Phase 7: WebSocket handshake simulation (full protocol test)
Summary Report: Consolidated findings with recommendations
Auto-fix mode: --fix flag applies recommended fixes

Usage:

cd socket.io-issue
chmod +x socketio-diagnostics.sh

# Run diagnostics only
./socketio-diagnostics.sh

# Run with verbose output
./socketio-diagnostics.sh --verbose

# Run diagnostics and auto-apply fixes
./socketio-diagnostics.sh --fix

# Check results
cat /tmp/socketio-diagnostics-*/summary.txt

Value:

Comprehensive automated testing (30 minutes runtime)
Identifies issues we may have missed
Generates actionable report
Can auto-apply fixes in production

Priority: Run AFTER applying WebSocket annotation

C. Health Check Endpoint Investigation [P1 - 70% Fix]

Source: diagnostic-decision-tree.md Q8, fix-implementation-guide.md Fix #2

Confirmed Issue:

kubectl exec -n coditect-app deployment/coditect-combined -- \
  curl -s -o /dev/null -w "%{http_code}\n" http://localhost/health

# Result: 404  ← Endpoint doesn't exist!

Impact:

BackendConfig health check points to /health
404 responses mark backend as unhealthy
May cause intermittent connection failures
Load balancer may route around "unhealthy" pods

Two Options:

Option A: Create /health endpoint in nginx (Recommended)

location /health {
    access_log off;
    return 200 "healthy\n";
    add_header Content-Type text/plain;
}

Option B: Change BackendConfig health check path

healthCheck:
  requestPath: /  # Use existing endpoint instead

Implementation Priority: P1 (implement after WebSocket annotation validates)

D. Header Analysis & Tracing [P1 - Diagnostic]

Source: investigation-runbook.md Phase 2

Purpose: Verify WebSocket headers reach nginx through GKE load balancer

Procedure:

# Enable detailed header logging in nginx
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
cat >> /etc/nginx/nginx.conf <<EOF
log_format headers '\''$remote_addr - $remote_user [$time_local] ''
                   '\''"$request" $status $body_bytes_sent ''
                   '\''"$http_referer" "$http_user_agent" ''
                   '\''"Upgrade: $http_upgrade" ''
                   '\''"Connection: $http_connection"'\'';
EOF
'

# Update server block to use headers format
kubectl exec -n coditect-app deployment/coditect-combined -- \
  sed -i 's/access_log .*/access_log \/var\/log\/nginx\/access.log headers;/' \
  /etc/nginx/sites-available/default

# Reload nginx
kubectl exec -n coditect-app deployment/coditect-combined -- \
  nginx -s reload

# Make test request
curl -v https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket \
  -H "Upgrade: websocket" \
  -H "Connection: Upgrade"

# Check logs
kubectl logs -n coditect-app deployment/coditect-combined | \
  grep "Upgrade:" | tail -5

What to Look For:

✅ Headers present → GKE preserving headers (good)
❌ Headers missing → GKE stripping headers (need WebSocket annotation)
❌ Headers corrupted → GKE transformation issue (deeper investigation)

Priority: Run if WebSocket annotation doesn't fix issue

E. Network Path Tracing [P2 - Advanced Diagnostic]

Source: investigation-runbook.md Phase 4

Purpose: Trace complete request path from browser to backend pod

Tools & Commands:

# 1. Check GKE forwarding rules
gcloud compute forwarding-rules list --format="table(name,IPAddress,target)" | \
  grep coditect

# 2. Describe target proxy
gcloud compute target-https-proxies describe \
  k8s2-ts-e6wek3rw-coditect-app-coditect-production-ingr-x8lkl70c \
  --format="yaml(sslCertificates,urlMap)"

# 3. Check URL map routing
gcloud compute url-maps describe \
  k8s2-um-e6wek3rw-coditect-app-coditect-production-ingr-x8lkl70c \
  --format="yaml(defaultService,pathMatchers)"

# 4. Verify backend service health
gcloud compute backend-services get-health \
  k8s1-28b74fc1-coditect-app-coditect-combined-service-8-b2e75de7 \
  --global

# 5. Check certificate status
gcloud compute ssl-certificates describe \
  mcrt-1e9fda11-9f24-455a-bd4b-074f776e3282 \
  --format="yaml(certificate,expireTime)"

When to Use:

If other fixes don't resolve issue
To understand complete request flow
For documentation and training
During post-mortem analysis

II. Long-Term Improvements & Monitoring

A. Socket.IO Monitoring Dashboard [Recommended]

Purpose: Proactive detection of Socket.IO issues

Metrics to Track:

1. Connection Success Rate
   - socket.io_handshake_success_total / socket.io_handshake_attempts_total
   - Target: >99.9%

2. WebSocket Upgrade Rate
   - socket.io_websocket_upgrades_total / socket.io_connections_total
   - Target: >95% (allow 5% for polling-only clients)

3. Average Session Duration
   - histogram_quantile(0.5, socket.io_session_duration_seconds)
   - Target: >3600 seconds (1 hour)

4. 400 Error Rate
   - socket.io_http_400_errors_total
   - Target: 0 errors/hour

5. Backend Health Status
   - gke_backend_healthy_instances / gke_backend_total_instances
   - Target: 100%

Implementation:

# Create Prometheus ServiceMonitor
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: socket-io-monitor
  namespace: coditect-app
spec:
  selector:
    matchLabels:
      app: coditect-combined
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
EOF

# Create Grafana dashboard (import JSON)
curl -X POST http://grafana:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @socket-io-dashboard.json

B. Automated Health Checks [Recommended]

Purpose: Continuous validation of Socket.IO functionality

External Synthetic Monitoring:

# cron job every 5 minutes
*/5 * * * * /usr/local/bin/socket-io-health-check.sh

# socket-io-health-check.sh
#!/bin/bash
ENDPOINT="https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"

# Test initial handshake
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$ENDPOINT")

if [ "$STATUS" != "200" ]; then
  # Alert on failure
  curl -X POST https://alerts.example.com/webhook \
    -d "{\"service\":\"socket-io\",\"status\":$STATUS,\"severity\":\"critical\"}"
fi

Internal Pod-Level Checks:

# Add to coditect-combined deployment
livenessProbe:
  httpGet:
    path: /socket.io/?EIO=4&transport=polling
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 60
  timeoutSeconds: 5
  failureThreshold: 3

C. Regression Testing in CI/CD [Recommended]

Purpose: Prevent future Socket.IO regressions

Test Suite:

// tests/e2e/socket-io.test.ts
describe('Socket.IO Integration', () => {
  it('should complete initial handshake', async () => {
    const response = await fetch(
      'https://coditect.ai/theia/socket.io/?EIO=4&transport=polling'
    );
    expect(response.status).toBe(200);

    const body = await response.text();
    const data = JSON.parse(body.substring(1)); // Remove leading '0'
    expect(data).toHaveProperty('sid');
    expect(data.upgrades).toContain('websocket');
  });

  it('should maintain session across requests', async () => {
    // Initial handshake
    const response1 = await fetch(...);
    const sid = extractSid(response1);

    // Subsequent request with sid
    const response2 = await fetch(
      `https://coditect.ai/theia/socket.io/?EIO=4&transport=polling&sid=${sid}`
    );
    expect(response2.status).toBe(200); // Not 400!
  });

  it('should upgrade to WebSocket', async () => {
    const ws = new WebSocket(
      'wss://coditect.ai/theia/socket.io/?EIO=4&transport=websocket'
    );

    await waitForOpen(ws);
    expect(ws.readyState).toBe(WebSocket.OPEN);
  });
});

CI/CD Integration:

# .github/workflows/socket-io-tests.yml
name: Socket.IO Regression Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
      - run: npm install
      - run: npm run test:socket-io
      - name: Notify on failure
        if: failure()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text":"Socket.IO regression tests failed!"}'

D. Dedicated WebSocket Gateway [Future Consideration]

Source: executive-summary.md Long-term recommendations

Architecture:

Browser
  ↓
GKE Ingress (websocket.coditect.ai)
  ↓
WebSocket Gateway (nginx + Socket.IO proxy)
  ↓ (internal only)
theia Backend (coditect-combined)

Benefits:

Isolate WebSocket traffic from static content
Specialized configuration for long-lived connections
Independent scaling for WebSocket vs HTTP
Better observability and monitoring
Easier to troubleshoot

Implementation (future):

# websocket-gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: websocket-gateway
  namespace: coditect-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: websocket-gateway
  template:
    metadata:
      labels:
        app: websocket-gateway
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
        volumeMounts:
        - name: config
          mountPath: /etc/nginx/nginx.conf
          subPath: nginx.conf
      volumes:
      - name: config
        configMap:
          name: websocket-gateway-config

Priority: Consider in Q1 2026 if Socket.IO traffic grows significantly

III. Documentation & Knowledge Management

A. Runbook Creation [High Priority]

Purpose: Standardize Socket.IO troubleshooting procedures

Structure:

# Socket.IO Troubleshooting Runbook

## Quick Diagnostics (5 minutes)
1. Test external endpoint
2. Check GCP backend session affinity
3. Verify Service annotation
4. Check pod health

## Common Issues
1. Session affinity missing → Add Service annotation
2. CDN caching enabled → Disable CDN in BackendConfig
3. WebSocket headers stripped → Add Ingress annotation
4. Health endpoint missing → Create /health in nginx

## Escalation Criteria
- Issue not resolved after 30 minutes
- Multiple services affected
- Customer-reported outages

## Post-Resolution
- Update incident log
- Run regression tests
- Review monitoring dashboards

Location: docs/runbooks/socket-io-troubleshooting.md

B. Incident Report Template [Recommended]

Purpose: Document future Socket.IO issues for pattern analysis

Template:

# Socket.IO Incident Report

**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: P0/P1/P2
**Impact**: Description

## Timeline
- HH:MM: Issue detected
- HH:MM: Investigation started
- HH:MM: Root cause identified
- HH:MM: Fix applied
- HH:MM: Validated and resolved

## Root Cause
[Detailed description]

## Resolution
[Fixes applied]

## Lessons Learned
- What went well
- What could be improved
- Action items

## Preventive Measures
[Future safeguards]

C. Architecture Documentation Update [Required]

Add Section: "Socket.IO Configuration Requirements"

Content:

## Socket.IO Configuration Requirements

### GKE Ingress
- WebSocket annotation: `cloud.google.com/websocket-max-idle-timeout="86400"`
- Backend timeout: 24 hours minimum

### Service (NEG-based)
- **CRITICAL**: Must have BackendConfig annotation
- Format: `cloud.google.com/backend-config='{"default":"coditect-backend-config"}'`

### BackendConfig
- CDN: Disabled (Socket.IO incompatible)
- Session affinity: CLIENT_IP
- Affinity cookie TTL: 86400 (24 hours)
- Timeout: 86400 (24 hours)
- Health check: Use /health endpoint

### nginx
- WebSocket headers: Upgrade + Connection
- Proxy buffering: off
- Proxy cache: off
- Read/send timeout: 86400

### Application
- Socket.IO server on port 3000
- Path: /socket.io/ (default)
- CORS: Configured for coditect.ai

IV. Team Training & Knowledge Transfer

A. Socket.IO Troubleshooting Workshop [Recommended]

Duration: 2 hours

Agenda:

Hour 1: Technical deep dive
- Socket.IO protocol fundamentals
- GKE load balancer architecture
- NEG vs non-NEG services
- BackendConfig requirements
Hour 2: Hands-on troubleshooting
- Use diagnostic decision tree
- Run automated diagnostic script
- Simulate common failures
- Practice fix application

Materials:

This investigation package (all 10 documents)
Live GKE cluster access
Sample failure scenarios

B. On-Call Playbook [High Priority]

Quick Reference Card:

Socket.IO 400 Errors - On-Call Guide
=====================================

1. CHECK: Is it affecting all users?
   → Yes: P0 incident, page team
   → No: P1, investigate during business hours

2. QUICK TEST:
   curl https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
   → 400: Continue to step 3
   → 200: Issue resolved or intermittent

3. CHECK SESSION AFFINITY:
   gcloud compute backend-services describe \
     k8s1-28b74fc1-coditect-app-coditect-combined-service-8-b2e75de7 \
     --global --format="value(sessionAffinity)"
   → NONE: Apply Service annotation (see runbook)
   → CLIENT_IP: Continue to step 4

4. CHECK SERVICE ANNOTATION:
   kubectl get service -n coditect-app coditect-combined-service \
     -o jsonpath='{.metadata.annotations.cloud\.google\.com/backend-config}'
   → Empty: Add annotation (see runbook)
   → Present: Check WebSocket annotation (step 5)

5. CHECK WEBSOCKET ANNOTATION:
   kubectl get ingress -n coditect-app coditect-production-ingress \
     -o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'
   → Empty: Add annotation (see runbook)
   → Present: Escalate to platform engineering

ESCALATION: Platform Engineering team
RUNBOOK: docs/runbooks/socket-io-troubleshooting.md
DIAGNOSTIC SCRIPT: /home/claude/socketio-diagnostics.sh

V. Summary of Recommended Actions

Immediate (Next 24 hours)

Apply WebSocket annotation to Ingress (P0)
Wait for session affinity propagation completion (P0)
Run automated diagnostic script (socketio-diagnostics.sh) (P0)
Create /health endpoint in nginx (P1)
Validate with browser testing (P0)

Short-term (Next week)

Create Socket.IO monitoring dashboard (P1)
Implement automated health checks (P1)
Document findings in architecture docs (P1)
Create troubleshooting runbook (P1)
Add regression tests to CI/CD (P2)

Medium-term (Next month)

Conduct Socket.IO troubleshooting workshop (P2)
Create on-call playbook (P1)
Review all services for annotation gaps (P2)
Evaluate dedicated WebSocket gateway (P3)
Implement header analysis in diagnostic script (P3)

VI. Success Metrics

Track these to validate improvements:

Metric	Baseline	Target	Timeline
Socket.IO availability	0%	99.9%	1 week
MTTR (Mean Time To Resolution)	4+ hours	<30 min	1 month
False positive incidents	TBD	<1/month	3 months
Team troubleshooting confidence	TBD	90%	3 months

VII. Conclusion

The comprehensive investigation package (files.zip) provides excellent frameworks that complement our actual findings. Key gaps identified:

Service annotation requirement for NEG-based services (not in reference docs)
WebSocket annotation as high-probability fix (85%) still needs testing
Automated diagnostic tools ready for immediate use
Long-term improvements clearly defined and prioritized

Next Steps: Execute immediate actions (P0), then progressively implement short-term and medium-term improvements for production-grade Socket.IO reliability.

Document Created: October 20, 2025 Last Updated: October 20, 2025 13:40 UTC Version: 1.0 Related Documents:

analysis-troubleshooting-guide.md (our investigation)
files.zip contents (reference documentation)

Executive Summary​

I. High-Value Research Pathways (Prioritized)​

A. WebSocket Annotation Testing [P0 - 85% Success Probability]​

B. Automated Diagnostic Script [P0 - Validation Tool]​

C. Health Check Endpoint Investigation [P1 - 70% Fix]​

D. Header Analysis & Tracing [P1 - Diagnostic]​

E. Network Path Tracing [P2 - Advanced Diagnostic]​

II. Long-Term Improvements & Monitoring​

A. Socket.IO Monitoring Dashboard [Recommended]​

B. Automated Health Checks [Recommended]​

C. Regression Testing in CI/CD [Recommended]​

D. Dedicated WebSocket Gateway [Future Consideration]​

III. Documentation & Knowledge Management​

A. Runbook Creation [High Priority]​

B. Incident Report Template [Recommended]​

C. Architecture Documentation Update [Required]​

IV. Team Training & Knowledge Transfer​

A. Socket.IO Troubleshooting Workshop [Recommended]​

B. On-Call Playbook [High Priority]​

V. Summary of Recommended Actions​

Immediate (Next 24 hours)​

Short-term (Next week)​

Medium-term (Next month)​

VI. Success Metrics​

VII. Conclusion​