Skip to main content

Critical Infrastructure Roadmap

Date: 2025-10-06 Priority: URGENT - Core infrastructure blockers for V5 MVP


🚨 Critical Issues to Resolve

1. WebSocket Pod OS Connection Issue

Problem: WebSocket connections to user workspace pods fail due to OS-level networking Status: 🔴 BLOCKING Impact: Real-time IDE communication broken, limits interactive features

Root Cause Analysis Needed:

# Check WebSocket gateway deployment
kubectl get deployment -n coditect-app | grep websocket

# Check pod-to-pod networking
kubectl exec -n coditect-app <websocket-pod> -- ping <workspace-pod-ip>

# Check service mesh / network policy
kubectl get networkpolicies -n coditect-app
kubectl describe svc -n coditect-app | grep -i websocket

Potential Fixes:

  1. Option A: Use Kubernetes Service for pod-to-pod communication

    • Create ClusterIP service for each workspace pod
    • WebSocket gateway connects via service DNS
    • Example: workspace-{user_id}.coditect-app.svc.cluster.local
  2. Option B: Deploy WebSocket gateway as sidecar

    • Run WebSocket proxy in same pod as workspace
    • Communicate via localhost (127.0.0.1)
    • No inter-pod networking needed
  3. Option C: Use Istio/Envoy service mesh

    • Add service mesh for transparent proxying
    • Automatic mTLS and load balancing
    • More complex, but production-grade

Recommended: Option B (sidecar) for simplicity and reliability

Implementation:

# workspace pod with WebSocket sidecar
apiVersion: v1
kind: Pod
metadata:
name: workspace
namespace: user-{user_id}
spec:
containers:
# Main workspace container
- name: workspace
image: gcr.io/serene-voltage-464305-n2/t2-workspace-pod:latest
ports:
- containerPort: 3000 # theia
env:
- name: WEBSOCKET_URL
value: "ws://localhost:8765" # Connect to sidecar

# WebSocket sidecar
- name: websocket-gateway
image: gcr.io/serene-voltage-464305-n2/coditect-websocket:latest
ports:
- containerPort: 8765 # WebSocket server
env:
- name: FDB_CLUSTER_STRING
value: "coditect:production@10.128.0.8:4500"
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: jwt-secret
key: secret

2. JWT Authorization + FoundationDB Integration

Problem: JWT validation not integrated with FDB user lookup Status: 🟡 PARTIAL - JWT generation works, validation needs FDB Impact: Auth works, but sessions not stored/retrieved from FDB

Current Flow (Incomplete):

1. User login → API generates JWT
2. JWT contains {userId, tenantId}
3. Frontend stores JWT in localStorage
4. Frontend sends JWT in Authorization header
5. ❌ API doesn't validate JWT against FDB session
6. ❌ No session persistence in FDB

Target Flow (Complete):

1. User login → API generates JWT
2. API stores session in FDB: tenant_id/session/{session_id}
3. JWT contains {userId, tenantId, sessionId}
4. Frontend sends JWT in Authorization header
5. ✅ API validates JWT signature
6. ✅ API looks up session in FDB
7. ✅ API verifies session is active (not expired/terminated)
8. ✅ Request proceeds with user context

Implementation (Rust/Axum):

// src/middleware/jwt_auth.rs
use actix_web::{dev::ServiceRequest, Error, HttpMessage};
use actix_web_httpauth::extractors::bearer::BearerAuth;
use jsonwebtoken::{decode, DecodingKey, Validation};
use foundationdb::Database;

pub struct JwtClaims {
pub user_id: String,
pub tenant_id: String,
pub session_id: String,
pub exp: usize,
}

pub async fn validate_jwt(
req: ServiceRequest,
credentials: BearerAuth,
fdb: Database,
) -> Result<ServiceRequest, Error> {
let token = credentials.token();

// 1. Verify JWT signature
let secret = std::env::var("JWT_SECRET").expect("JWT_SECRET not set");
let token_data = decode::<JwtClaims>(
token,
&DecodingKey::from_secret(secret.as_bytes()),
&Validation::default(),
)?;

let claims = token_data.claims;

// 2. Look up session in FDB
let fdb_tx = fdb.create_trx()?;
let session_key = format!(
"{}/session/{}",
claims.tenant_id,
claims.session_id
);

let session_data = fdb_tx.get(session_key.as_bytes(), false).await?;

match session_data {
Some(data) => {
let session: Session = serde_json::from_slice(&data)?;

// 3. Verify session is active
if session.status != "active" {
return Err(actix_web::error::ErrorUnauthorized("Session inactive"));
}

// 4. Check session expiration
let now = chrono::Utc::now();
let last_access = chrono::DateTime::parse_from_rfc3339(&session.last_access_at)?;
let session_timeout = chrono::Duration::hours(24);

if now.signed_duration_since(last_access) > session_timeout {
return Err(actix_web::error::ErrorUnauthorized("Session expired"));
}

// 5. Update last access time
let mut updated_session = session.clone();
updated_session.last_access_at = now.to_rfc3339();
fdb_tx.set(
session_key.as_bytes(),
serde_json::to_vec(&updated_session)?.as_slice(),
);
fdb_tx.commit().await?;

// 6. Attach user context to request
req.extensions_mut().insert(UserContext {
user_id: claims.user_id,
tenant_id: claims.tenant_id,
session_id: claims.session_id,
});

Ok(req)
}
None => Err(activ_web::error::ErrorUnauthorized("Session not found"))
}
}

Session Model (FDB):

// src/models/session.rs
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Session {
pub session_id: String,
pub user_id: String,
pub tenant_id: String,
pub workspace_id: String,
pub pod_namespace: Option<String>,
pub pod_name: Option<String>,
pub websocket_connected: bool,
pub status: String, // "active" | "suspended" | "terminated"
pub created_at: String, // ISO 8601
pub last_access_at: String, // ISO 8601
pub metadata: Option<serde_json::Value>,
}

impl Session {
pub fn fdb_key(&self) -> String {
format!("{}/session/{}", self.tenant_id, self.session_id)
}

pub fn is_active(&self) -> bool {
self.status == "active"
}

pub fn is_expired(&self, timeout_hours: i64) -> bool {
let now = chrono::Utc::now();
let last_access = chrono::DateTime::parse_from_rfc3339(&self.last_access_at)
.expect("Invalid last_access_at format");
let timeout = chrono::Duration::hours(timeout_hours);

now.signed_duration_since(last_access) > timeout
}
}

Deliverables:

  • Create JWT validation middleware with FDB lookup
  • Implement session expiration logic
  • Update last_access_at on each request
  • Add session cleanup job (delete expired sessions)

3. Coditect Server + Knowledge Base as a Service (KBaaS)

Problem: Coditect server not deployed, Knowledge Base service not integrated with GCP Status: 🔴 NOT DEPLOYED Impact: No centralized knowledge management, agent coordination limited

Architecture:

┌─────────────────────────────────────────────────────┐
│ Google Cloud Platform │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ GKE Cluster (codi-poc-e2-cluster) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌─────────────────────────┐│ │
│ │ │ theia IDE │ │ Coditect Server ││ │
│ │ │ (Frontend) │ │ (Knowledge Hub) ││ │
│ │ └──────────────┘ └─────────────────────────┘│ │
│ │ ↓ ↓ │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ FoundationDB Cluster (StatefulSet) │ │ │
│ │ │ - User data │ │ │
│ │ │ - Session state │ │ │
│ │ │ - Knowledge base index │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Cloud Storage (Knowledge Artifacts) │ │
│ │ gs://serene-voltage-464305-n2-knowledge/ │ │
│ │ - Documentation embeddings │ │
│ │ - Code snippets │ │
│ │ - Agent templates │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Vertex AI (Embeddings + RAG) │ │
│ │ - text-embedding-004 │ │
│ │ - Context retrieval for agents │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘

Coditect Server Features:

  1. Knowledge Indexing:

    • Index all documentation from V4
    • Create embeddings using Vertex AI
    • Store in FoundationDB + Cloud Storage
  2. Agent Coordination:

    • Central registry of available agents
    • Task distribution and load balancing
    • Agent-to-agent communication (A2A protocol)
  3. Session Management:

    • Centralized session state
    • WebSocket connection registry
    • Pod lifecycle management

Implementation:

// src/coditect-server/main.rs
use actix_web::{web, App, HttpServer};
use foundationdb::Database;
use vertex_ai::EmbeddingClient;

pub struct CoditectServer {
fdb: Database,
embedding_client: EmbeddingClient,
storage_bucket: String,
}

impl CoditectServer {
pub async fn new() -> Result<Self, Box<dyn std::error::Error>> {
// Connect to FoundationDB
let fdb = Database::default()?;

// Initialize Vertex AI
let embedding_client = EmbeddingClient::new(
"serene-voltage-464305-n2",
"text-embedding-004"
).await?;

// Cloud Storage bucket
let storage_bucket = "gs://serene-voltage-464305-n2-knowledge".to_string();

Ok(Self {
fdb,
embedding_client,
storage_bucket,
})
}

// Index documentation
pub async fn index_knowledge(&self, content: &str, metadata: serde_json::Value) -> Result<String, Box<dyn std::error::Error>> {
// 1. Generate embedding
let embedding = self.embedding_client.embed_text(content).await?;

// 2. Store in FDB with embedding
let doc_id = uuid::Uuid::new_v4().to_string();
let fdb_tx = self.fdb.create_trx()?;

let knowledge_entry = KnowledgeEntry {
id: doc_id.clone(),
content: content.to_string(),
embedding: embedding.clone(),
metadata,
created_at: chrono::Utc::now().to_rfc3339(),
};

fdb_tx.set(
format!("knowledge/{}", doc_id).as_bytes(),
serde_json::to_vec(&knowledge_entry)?.as_slice(),
);
fdb_tx.commit().await?;

// 3. Upload to Cloud Storage (backup)
let storage_path = format!("{}/docs/{}.json", self.storage_bucket, doc_id);
// Upload logic here...

Ok(doc_id)
}

// Retrieve relevant knowledge (RAG)
pub async fn retrieve_knowledge(&self, query: &str, top_k: usize) -> Result<Vec<KnowledgeEntry>, Box<dyn std::error::Error>> {
// 1. Generate query embedding
let query_embedding = self.embedding_client.embed_text(query).await?;

// 2. Search FDB for similar embeddings (cosine similarity)
let fdb_tx = self.fdb.create_trx()?;
let knowledge_docs = fdb_tx
.get_range(
"knowledge/".as_bytes(),
"knowledge/\xff".as_bytes(),
foundationdb::options::StreamingMode::WantAll,
false,
)
.await?;

// 3. Rank by similarity
let mut results: Vec<(KnowledgeEntry, f64)> = knowledge_docs
.iter()
.map(|(_, value)| {
let entry: KnowledgeEntry = serde_json::from_slice(value).unwrap();
let similarity = cosine_similarity(&query_embedding, &entry.embedding);
(entry, similarity)
})
.collect();

results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());

Ok(results.into_iter().take(top_k).map(|(entry, _)| entry).collect())
}
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct KnowledgeEntry {
pub id: String,
pub content: String,
pub embedding: Vec<f32>,
pub metadata: serde_json::Value,
pub created_at: String,
}

fn cosine_similarity(a: &[f32], b: &[f32]) -> f64 {
let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
let magnitude_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
let magnitude_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
(dot_product / (magnitude_a * magnitude_b)) as f64
}

Deployment:

# k8s/coditect-server-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: coditect-server
namespace: coditect-app
spec:
replicas: 3
selector:
matchLabels:
app: coditect-server
template:
metadata:
labels:
app: coditect-server
spec:
containers:
- name: coditect-server
image: gcr.io/serene-voltage-464305-n2/coditect-server:latest
ports:
- containerPort: 8080
env:
- name: FDB_CLUSTER_STRING
value: "coditect:production@10.128.0.8:4500"
- name: GCP_PROJECT
value: "serene-voltage-464305-n2"
- name: KNOWLEDGE_BUCKET
value: "gs://serene-voltage-464305-n2-knowledge"
- name: VERTEX_AI_REGION
value: "us-central1"
volumeMounts:
- name: fdb-cluster-config
mountPath: /etc/foundationdb
volumes:
- name: fdb-cluster-config
configMap:
name: fdb-cluster-config
---
apiVersion: v1
kind: Service
metadata:
name: coditect-server
namespace: coditect-app
spec:
selector:
app: coditect-server
ports:
- port: 8080
targetPort: 8080

Deliverables:

  • Build Coditect server container
  • Create Cloud Storage bucket for knowledge artifacts
  • Enable Vertex AI embeddings API
  • Deploy to GKE
  • Index V4 documentation
  • Create agent coordination endpoints

🎯 Priority Task List

Week 1: WebSocket Fix (URGENT)

  1. Diagnose WebSocket Issue:

    # Check current WebSocket gateway deployment
    kubectl get all -n coditect-app | grep websocket

    # Test pod-to-pod connectivity
    kubectl exec -n coditect-app <api-pod> -- ping <workspace-pod-ip>

    # Check logs
    kubectl logs -n coditect-app -l app=coditect-websocket --tail=100
  2. Implement Sidecar Solution:

    • Update workspace pod YAML with WebSocket sidecar
    • Build combined image (workspace + WebSocket)
    • Test localhost communication
    • Deploy to test namespace
    • Verify WebSocket connections work
  3. Update API to use new WebSocket endpoint:

    • Change WebSocket URL from ws://<pod-ip>:8765 to ws://localhost:8765
    • Update frontend WebSocket client
    • Test end-to-end

Week 2: JWT + FDB Integration (HIGH)

  1. Implement JWT Middleware:

    • Create jwt_auth.rs with FDB session lookup
    • Add session expiration logic
    • Update last_access_at on each request
    • Add to API routes
  2. Update Login/Register:

    • Store session in FDB on login
    • Include sessionId in JWT payload
    • Return session data to frontend
  3. Session Cleanup Job:

    • Create CronJob to delete expired sessions
    • Run daily at 2 AM UTC
    • Log cleanup statistics

Week 3-4: Coditect Server + KBaaS (MEDIUM)

  1. Infrastructure Setup:

    • Create Cloud Storage bucket: serene-voltage-464305-n2-knowledge
    • Enable Vertex AI API
    • Create service account with Vertex AI permissions
  2. Build Coditect Server:

    • Implement knowledge indexing
    • Implement RAG retrieval
    • Create agent coordination endpoints
    • Add health check endpoint
  3. Deploy and Index:

    • Deploy to GKE
    • Index all V4 documentation
    • Test knowledge retrieval
    • Integrate with agents

📊 Success Metrics

Week 1 (WebSocket):

  • ✅ WebSocket connections stable (no disconnects)
  • ✅ Pod-to-pod communication working
  • ✅ Real-time messages flowing

Week 2 (JWT/FDB):

  • ✅ JWT validation against FDB sessions
  • ✅ Session expiration enforced
  • ✅ No unauthorized access

Week 3-4 (KBaaS):

  • ✅ Documentation fully indexed
  • ✅ RAG retrieval working (< 500ms)
  • ✅ Agents using knowledge base
  • ✅ 95%+ relevance on queries

  • V5 Migration Plan: docs/v4-analysis/v5-migration-plan-serena-reuse.md
  • Current Deployment: docs/v4-analysis/current-deployment-status.md
  • V4 Frontend Integration: docs/v4-analysis/v4-theia-frontend-integration.md
  • V4 CLAUDE.md: coditect-v4/CLAUDE.md

Next Steps:

  1. Fix WebSocket pod OS connection (sidecar pattern)
  2. Integrate JWT validation with FDB sessions
  3. Deploy Coditect server with KBaaS to GKE