Additional Research Pathways & Recommendations
Based on Analysis of Investigation Package (files.zip) Date: October 20, 2025 Integration with: analysis-troubleshooting-guide.md
Executive Summary
The comprehensive investigation package (files.zip) contains 8 documents totaling ~150 pages of analysis, procedures, and diagnostic tools. This document identifies additional research pathways and recommended next steps based on comparing:
- Our actual investigation findings (analysis-troubleshooting-guide.md)
- Reference documentation comprehensive diagnostic frameworks (files.zip)
I. High-Value Research Pathways (Prioritized)
A. WebSocket Annotation Testing [P0 - 85% Success Probability]
Source: fix-implementation-guide.md, executive-summary.md
Why This Matters: The reference documentation identifies this as the highest probability fix (85%) but we haven't tested it yet because we were focusing on session affinity.
Implementation:
kubectl annotate ingress -n coditect-app coditect-production-ingress \
cloud.google.com/websocket-max-idle-timeout="86400" \
--overwrite
Expected Impact:
- Enables GKE L7 load balancer to properly handle WebSocket protocol
- Preserves
Upgrade: websocketheaders through load balancer - Supports long-lived WebSocket connections (24 hours)
Testing Procedure:
# Test WebSocket upgrade
curl -v https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket \
-H "Upgrade: websocket" \
-H "Connection: Upgrade" \
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
-H "Sec-WebSocket-Version: 13"
# Expected: HTTP/1.1 101 Switching Protocols
Priority: Apply AFTER session affinity propagates (next fix to test)
B. Automated Diagnostic Script [P0 - Validation Tool]
Source: socketio-diagnostics.sh (400 lines)
Capabilities:
- Phase 1: Environment validation (kubectl, gcloud, curl)
- Phase 2: Header analysis (WebSocket headers through load balancer)
- Phase 3: GKE backend investigation (BackendConfig, session affinity)
- Phase 4: BackendConfig deep analysis (CDN, timeouts, health checks)
- Phase 5: Health check verification (endpoint availability)
- Phase 6: Session affinity testing (multi-request routing)
- Phase 7: WebSocket handshake simulation (full protocol test)
- Summary Report: Consolidated findings with recommendations
- Auto-fix mode:
--fixflag applies recommended fixes
Usage:
cd socket.io-issue
chmod +x socketio-diagnostics.sh
# Run diagnostics only
./socketio-diagnostics.sh
# Run with verbose output
./socketio-diagnostics.sh --verbose
# Run diagnostics and auto-apply fixes
./socketio-diagnostics.sh --fix
# Check results
cat /tmp/socketio-diagnostics-*/summary.txt
Value:
- Comprehensive automated testing (30 minutes runtime)
- Identifies issues we may have missed
- Generates actionable report
- Can auto-apply fixes in production
Priority: Run AFTER applying WebSocket annotation
C. Health Check Endpoint Investigation [P1 - 70% Fix]
Source: diagnostic-decision-tree.md Q8, fix-implementation-guide.md Fix #2
Confirmed Issue:
kubectl exec -n coditect-app deployment/coditect-combined -- \
curl -s -o /dev/null -w "%{http_code}\n" http://localhost/health
# Result: 404 ← Endpoint doesn't exist!
Impact:
- BackendConfig health check points to
/health - 404 responses mark backend as unhealthy
- May cause intermittent connection failures
- Load balancer may route around "unhealthy" pods
Two Options:
Option A: Create /health endpoint in nginx (Recommended)
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
Option B: Change BackendConfig health check path
healthCheck:
requestPath: / # Use existing endpoint instead
Implementation Priority: P1 (implement after WebSocket annotation validates)
D. Header Analysis & Tracing [P1 - Diagnostic]
Source: investigation-runbook.md Phase 2
Purpose: Verify WebSocket headers reach nginx through GKE load balancer
Procedure:
# Enable detailed header logging in nginx
kubectl exec -n coditect-app deployment/coditect-combined -- bash -c '
cat >> /etc/nginx/nginx.conf <<EOF
log_format headers '\''$remote_addr - $remote_user [$time_local] ''
'\''"$request" $status $body_bytes_sent ''
'\''"$http_referer" "$http_user_agent" ''
'\''"Upgrade: $http_upgrade" ''
'\''"Connection: $http_connection"'\'';
EOF
'
# Update server block to use headers format
kubectl exec -n coditect-app deployment/coditect-combined -- \
sed -i 's/access_log .*/access_log \/var\/log\/nginx\/access.log headers;/' \
/etc/nginx/sites-available/default
# Reload nginx
kubectl exec -n coditect-app deployment/coditect-combined -- \
nginx -s reload
# Make test request
curl -v https://coditect.ai/theia/socket.io/?EIO=4&transport=websocket \
-H "Upgrade: websocket" \
-H "Connection: Upgrade"
# Check logs
kubectl logs -n coditect-app deployment/coditect-combined | \
grep "Upgrade:" | tail -5
What to Look For:
- ✅ Headers present → GKE preserving headers (good)
- ❌ Headers missing → GKE stripping headers (need WebSocket annotation)
- ❌ Headers corrupted → GKE transformation issue (deeper investigation)
Priority: Run if WebSocket annotation doesn't fix issue
E. Network Path Tracing [P2 - Advanced Diagnostic]
Source: investigation-runbook.md Phase 4
Purpose: Trace complete request path from browser to backend pod
Tools & Commands:
# 1. Check GKE forwarding rules
gcloud compute forwarding-rules list --format="table(name,IPAddress,target)" | \
grep coditect
# 2. Describe target proxy
gcloud compute target-https-proxies describe \
k8s2-ts-e6wek3rw-coditect-app-coditect-production-ingr-x8lkl70c \
--format="yaml(sslCertificates,urlMap)"
# 3. Check URL map routing
gcloud compute url-maps describe \
k8s2-um-e6wek3rw-coditect-app-coditect-production-ingr-x8lkl70c \
--format="yaml(defaultService,pathMatchers)"
# 4. Verify backend service health
gcloud compute backend-services get-health \
k8s1-28b74fc1-coditect-app-coditect-combined-service-8-b2e75de7 \
--global
# 5. Check certificate status
gcloud compute ssl-certificates describe \
mcrt-1e9fda11-9f24-455a-bd4b-074f776e3282 \
--format="yaml(certificate,expireTime)"
When to Use:
- If other fixes don't resolve issue
- To understand complete request flow
- For documentation and training
- During post-mortem analysis
II. Long-Term Improvements & Monitoring
A. Socket.IO Monitoring Dashboard [Recommended]
Purpose: Proactive detection of Socket.IO issues
Metrics to Track:
1. Connection Success Rate
- socket.io_handshake_success_total / socket.io_handshake_attempts_total
- Target: >99.9%
2. WebSocket Upgrade Rate
- socket.io_websocket_upgrades_total / socket.io_connections_total
- Target: >95% (allow 5% for polling-only clients)
3. Average Session Duration
- histogram_quantile(0.5, socket.io_session_duration_seconds)
- Target: >3600 seconds (1 hour)
4. 400 Error Rate
- socket.io_http_400_errors_total
- Target: 0 errors/hour
5. Backend Health Status
- gke_backend_healthy_instances / gke_backend_total_instances
- Target: 100%
Implementation:
# Create Prometheus ServiceMonitor
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: socket-io-monitor
namespace: coditect-app
spec:
selector:
matchLabels:
app: coditect-combined
endpoints:
- port: metrics
path: /metrics
interval: 30s
EOF
# Create Grafana dashboard (import JSON)
curl -X POST http://grafana:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @socket-io-dashboard.json
B. Automated Health Checks [Recommended]
Purpose: Continuous validation of Socket.IO functionality
External Synthetic Monitoring:
# cron job every 5 minutes
*/5 * * * * /usr/local/bin/socket-io-health-check.sh
# socket-io-health-check.sh
#!/bin/bash
ENDPOINT="https://coditect.ai/theia/socket.io/?EIO=4&transport=polling"
# Test initial handshake
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$ENDPOINT")
if [ "$STATUS" != "200" ]; then
# Alert on failure
curl -X POST https://alerts.example.com/webhook \
-d "{\"service\":\"socket-io\",\"status\":$STATUS,\"severity\":\"critical\"}"
fi
Internal Pod-Level Checks:
# Add to coditect-combined deployment
livenessProbe:
httpGet:
path: /socket.io/?EIO=4&transport=polling
port: 3000
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 5
failureThreshold: 3
C. Regression Testing in CI/CD [Recommended]
Purpose: Prevent future Socket.IO regressions
Test Suite:
// tests/e2e/socket-io.test.ts
describe('Socket.IO Integration', () => {
it('should complete initial handshake', async () => {
const response = await fetch(
'https://coditect.ai/theia/socket.io/?EIO=4&transport=polling'
);
expect(response.status).toBe(200);
const body = await response.text();
const data = JSON.parse(body.substring(1)); // Remove leading '0'
expect(data).toHaveProperty('sid');
expect(data.upgrades).toContain('websocket');
});
it('should maintain session across requests', async () => {
// Initial handshake
const response1 = await fetch(...);
const sid = extractSid(response1);
// Subsequent request with sid
const response2 = await fetch(
`https://coditect.ai/theia/socket.io/?EIO=4&transport=polling&sid=${sid}`
);
expect(response2.status).toBe(200); // Not 400!
});
it('should upgrade to WebSocket', async () => {
const ws = new WebSocket(
'wss://coditect.ai/theia/socket.io/?EIO=4&transport=websocket'
);
await waitForOpen(ws);
expect(ws.readyState).toBe(WebSocket.OPEN);
});
});
CI/CD Integration:
# .github/workflows/socket-io-tests.yml
name: Socket.IO Regression Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
- run: npm install
- run: npm run test:socket-io
- name: Notify on failure
if: failure()
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text":"Socket.IO regression tests failed!"}'
D. Dedicated WebSocket Gateway [Future Consideration]
Source: executive-summary.md Long-term recommendations
Architecture:
Browser
↓
GKE Ingress (websocket.coditect.ai)
↓
WebSocket Gateway (nginx + Socket.IO proxy)
↓ (internal only)
theia Backend (coditect-combined)
Benefits:
- Isolate WebSocket traffic from static content
- Specialized configuration for long-lived connections
- Independent scaling for WebSocket vs HTTP
- Better observability and monitoring
- Easier to troubleshoot
Implementation (future):
# websocket-gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: websocket-gateway
namespace: coditect-app
spec:
replicas: 3
selector:
matchLabels:
app: websocket-gateway
template:
metadata:
labels:
app: websocket-gateway
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
volumeMounts:
- name: config
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
volumes:
- name: config
configMap:
name: websocket-gateway-config
Priority: Consider in Q1 2026 if Socket.IO traffic grows significantly
III. Documentation & Knowledge Management
A. Runbook Creation [High Priority]
Purpose: Standardize Socket.IO troubleshooting procedures
Structure:
# Socket.IO Troubleshooting Runbook
## Quick Diagnostics (5 minutes)
1. Test external endpoint
2. Check GCP backend session affinity
3. Verify Service annotation
4. Check pod health
## Common Issues
1. Session affinity missing → Add Service annotation
2. CDN caching enabled → Disable CDN in BackendConfig
3. WebSocket headers stripped → Add Ingress annotation
4. Health endpoint missing → Create /health in nginx
## Escalation Criteria
- Issue not resolved after 30 minutes
- Multiple services affected
- Customer-reported outages
## Post-Resolution
- Update incident log
- Run regression tests
- Review monitoring dashboards
Location: docs/runbooks/socket-io-troubleshooting.md
B. Incident Report Template [Recommended]
Purpose: Document future Socket.IO issues for pattern analysis
Template:
# Socket.IO Incident Report
**Date**: YYYY-MM-DD
**Duration**: X hours
**Severity**: P0/P1/P2
**Impact**: Description
## Timeline
- HH:MM: Issue detected
- HH:MM: Investigation started
- HH:MM: Root cause identified
- HH:MM: Fix applied
- HH:MM: Validated and resolved
## Root Cause
[Detailed description]
## Resolution
[Fixes applied]
## Lessons Learned
- What went well
- What could be improved
- Action items
## Preventive Measures
[Future safeguards]
C. Architecture Documentation Update [Required]
Add Section: "Socket.IO Configuration Requirements"
Content:
## Socket.IO Configuration Requirements
### GKE Ingress
- WebSocket annotation: `cloud.google.com/websocket-max-idle-timeout="86400"`
- Backend timeout: 24 hours minimum
### Service (NEG-based)
- **CRITICAL**: Must have BackendConfig annotation
- Format: `cloud.google.com/backend-config='{"default":"coditect-backend-config"}'`
### BackendConfig
- CDN: Disabled (Socket.IO incompatible)
- Session affinity: CLIENT_IP
- Affinity cookie TTL: 86400 (24 hours)
- Timeout: 86400 (24 hours)
- Health check: Use /health endpoint
### nginx
- WebSocket headers: Upgrade + Connection
- Proxy buffering: off
- Proxy cache: off
- Read/send timeout: 86400
### Application
- Socket.IO server on port 3000
- Path: /socket.io/ (default)
- CORS: Configured for coditect.ai
IV. Team Training & Knowledge Transfer
A. Socket.IO Troubleshooting Workshop [Recommended]
Duration: 2 hours
Agenda:
-
Hour 1: Technical deep dive
- Socket.IO protocol fundamentals
- GKE load balancer architecture
- NEG vs non-NEG services
- BackendConfig requirements
-
Hour 2: Hands-on troubleshooting
- Use diagnostic decision tree
- Run automated diagnostic script
- Simulate common failures
- Practice fix application
Materials:
- This investigation package (all 10 documents)
- Live GKE cluster access
- Sample failure scenarios
B. On-Call Playbook [High Priority]
Quick Reference Card:
Socket.IO 400 Errors - On-Call Guide
=====================================
1. CHECK: Is it affecting all users?
→ Yes: P0 incident, page team
→ No: P1, investigate during business hours
2. QUICK TEST:
curl https://coditect.ai/theia/socket.io/?EIO=4&transport=polling
→ 400: Continue to step 3
→ 200: Issue resolved or intermittent
3. CHECK SESSION AFFINITY:
gcloud compute backend-services describe \
k8s1-28b74fc1-coditect-app-coditect-combined-service-8-b2e75de7 \
--global --format="value(sessionAffinity)"
→ NONE: Apply Service annotation (see runbook)
→ CLIENT_IP: Continue to step 4
4. CHECK SERVICE ANNOTATION:
kubectl get service -n coditect-app coditect-combined-service \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/backend-config}'
→ Empty: Add annotation (see runbook)
→ Present: Check WebSocket annotation (step 5)
5. CHECK WEBSOCKET ANNOTATION:
kubectl get ingress -n coditect-app coditect-production-ingress \
-o jsonpath='{.metadata.annotations.cloud\.google\.com/websocket-max-idle-timeout}'
→ Empty: Add annotation (see runbook)
→ Present: Escalate to platform engineering
ESCALATION: Platform Engineering team
RUNBOOK: docs/runbooks/socket-io-troubleshooting.md
DIAGNOSTIC SCRIPT: /home/claude/socketio-diagnostics.sh
V. Summary of Recommended Actions
Immediate (Next 24 hours)
- Apply WebSocket annotation to Ingress (P0)
- Wait for session affinity propagation completion (P0)
- Run automated diagnostic script (
socketio-diagnostics.sh) (P0) - Create
/healthendpoint in nginx (P1) - Validate with browser testing (P0)
Short-term (Next week)
- Create Socket.IO monitoring dashboard (P1)
- Implement automated health checks (P1)
- Document findings in architecture docs (P1)
- Create troubleshooting runbook (P1)
- Add regression tests to CI/CD (P2)
Medium-term (Next month)
- Conduct Socket.IO troubleshooting workshop (P2)
- Create on-call playbook (P1)
- Review all services for annotation gaps (P2)
- Evaluate dedicated WebSocket gateway (P3)
- Implement header analysis in diagnostic script (P3)
VI. Success Metrics
Track these to validate improvements:
| Metric | Baseline | Target | Timeline |
|---|---|---|---|
| Socket.IO availability | 0% | 99.9% | 1 week |
| MTTR (Mean Time To Resolution) | 4+ hours | <30 min | 1 month |
| False positive incidents | TBD | <1/month | 3 months |
| Team troubleshooting confidence | TBD | 90% | 3 months |
VII. Conclusion
The comprehensive investigation package (files.zip) provides excellent frameworks that complement our actual findings. Key gaps identified:
- Service annotation requirement for NEG-based services (not in reference docs)
- WebSocket annotation as high-probability fix (85%) still needs testing
- Automated diagnostic tools ready for immediate use
- Long-term improvements clearly defined and prioritized
Next Steps: Execute immediate actions (P0), then progressively implement short-term and medium-term improvements for production-grade Socket.IO reliability.
Document Created: October 20, 2025 Last Updated: October 20, 2025 13:40 UTC Version: 1.0 Related Documents:
- analysis-troubleshooting-guide.md (our investigation)
- files.zip contents (reference documentation)