CLAUDE.md

This file provides guidance to Claude Code and other AI assistants when working with this codebase.

🚀 Starting a New Session?

LATEST STATUS: ✅ Build #18 OPERATIONAL - Pods Healthy, Ready for MVP Scaling (Build #18 Success Report)

Build #18 (2025-10-28): ✅ DEPLOYED AND OPERATIONAL
- ✅ CrashLoopBackOff Fixed: Permission denied errors resolved (non-root execution working)
- ✅ All Pods Healthy: coditect-combined-0 and coditect-combined-1 serving traffic
- ✅ Services Running: theia IDE, CODI2 monitoring, File monitor, NGINX all operational
- ✅ Image: 8449bd02-7a28-4de2-8e26-7618396b3c2f
- ✅ Security: Non-root user (coditect, UID 1001, GID 1001)
- 📄 Fix Applied: Changed log directories from /var/log/* to /app/logs/* (user-writable)
⚠️ MVP SCALING REQUIREMENT (MVP Scaling Analysis):
- Current: 3 pods = 3-6 user capacity
- Required: 10-15 pods for 20 user pilot/beta testing
- Cost Impact: $150/month → $500-750/month
- Action: Scale to 10 pods + deploy HPA before MVP launch
- Command: kubectl scale statefulset/coditect-combined --replicas=10 -n coditect-app
📊 Comprehensive Status Report (2025-10-28T00:50:00Z Status Report):
- Complete Build #18 analysis and deployment results
- MVP capacity planning and cost analysis
- Risk assessment: Capacity shortage blocking MVP without scaling
- Next actions: Scale immediately, deploy HPA, test with beta users
Previous Achievements:
- Build #18 Features: Multi-llm CLI suite (7 providers), zsh + oh-my-zsh, Fixed extensions (38 VSIX), Complete branding
- Multi-llm: Claude CLI, OpenAI CLI, Aider, Shell-GPT, Grok CLI, Anthropic SDK, Gemini CLI
- StatefulSet Migration: Complete with persistent storage (100 GB workspace + 10 GB config per pod)
- Starter Configuration: 10-20 user capacity tier ready (4 vCPU, 8 GB RAM per pod)
Socket.IO Investigation (Oct 20, 2025):
- ✅ Root Cause #1 - CDN Caching: GCP CDN was caching Socket.IO requests with stale session IDs → FIXED
  - Solution Applied: Disabled CDN in BackendConfig (k8s/backend-config-no-cdn.yaml)
  - Validation: CDN headers removed from responses
- ⏳ Root Cause #2 - Session Affinity Missing: GCP backend service shows SESSION_AFFINITY: NONE → IN PROGRESS
  - Problem: Service missing BackendConfig annotation (NEG requirement)
  - Solution Applied: Annotated Service with BackendConfig reference
  - Status: Propagating to GCP load balancer (2-5 minutes)
- 📋 Additional Fixes Recommended (from reference docs):
  - P0: WebSocket annotation to Ingress (85% success probability)
  - P0: Health check endpoint creation (70% success probability)
  - P1: Increased backend timeout and connection draining optimization
  - P0: Add PVCs for workspace persistence ✨ NEW (prevents data loss)
🗄️ Hybrid Storage Architecture (ADR-028, Part 2) (2025-10-28):
- ✅ Problem Identified: Pod-local storage causes data loss on scale-down events
- ✅ Solution Designed: Hybrid Storage = Shared Base (Docker layers) + User PVCs (10 GB each)
- ✅ Cost Savings: $7.05/user/month (96% cheaper than NFS $205/month)
- ✅ Performance: <1ms latency with GCE Persistent Disk SSD (15K-30K IOPS)
- ✅ Scalability: 10 user slots per pod, dynamic assignment via backend routing
- 📋 Implementation Timeline: 30-38 hours (6 phases: Image layers, PVC provisioning, StatefulSet slots, routing, testing, backups)
- 📊 Daily Backups: VolumeSnapshots with 7-day retention for disaster recovery
- 🔧 Status: Design complete, ready for implementation (Phase 1: Modify Dockerfile)
Comprehensive Documentation:
- docs/11-analysis/socket-io-theia-persistence-insights.md - START HERE - Critical fixes for data loss + Socket.IO (P0 priority)
- docs/11-analysis/theia-gke-scaling-research-summary.md - theia pod persistence, scaling 1-50 → 10k users
- docs/11-analysis/theia-instance-running-on-gcp-gke-kubernetes.md - Full 60KB Perplexity research (PVCs, session timeout, StatefulSets)
- socket.io-issue/analysis-troubleshooting-guide.md - Complete Socket.IO investigation findings, diagnostic commands, fix procedures
- socket.io-issue/additional-research-pathways.md - Next steps and long-term improvements
- socket.io-issue/README.md - Investigation package index (~150 pages of analysis)
- socket.io-issue/executive-summary.md - High-level overview with 5 prioritized fixes
- socket.io-issue/diagnostic-decision-tree.md - Structured troubleshooting logic (15 questions)
- socket.io-issue/fix-implementation-guide.md - Detailed fix procedures with validation tests
- socket.io-issue/socketio-diagnostics.sh - Automated diagnostic script (400 lines)
Next Steps (Priority Order):
1. 🔴 P0 CRITICAL: Add PVCs to deployment (stop data loss immediately) - 30 min
2. 🔴 P0 CRITICAL: Disable theia Cloud session timeout - 15 min
3. Validate session affinity propagation (check GCP backend service)
4. Apply WebSocket annotation to Ingress - 15 min
5. Create /health and /ready endpoints in nginx - 30 min
6. Run automated diagnostic script for comprehensive validation
7. Test at https://coditect.ai/theia after all fixes applied
8. Verify persistence: Create file in pod → delete pod → check file still exists
Architecture: See "Production Architecture (GKE)" section below

IMPORTANT: See docs/reading-order.md for the complete list of documents to read before beginning work.

Critical first reads:

docs/10-execution-plans/2025-10-27-build-18-final-configuration.md - START HERE - Build #18 Final Configuration (non-root, multi-llm, security hardening)
docs/10-execution-plans/2025-10-20-build-23-theia-localhost-fix-checkpoint.md - Build #23 checkpoint (localhost:3000 fix, 5-build saga)
thoughts/shared/research/2025-10-19t21-26-15z-sprint-2-3-checkpoint.md - Sprint 2-3 checkpoint (validation + .coditect migration)
docs/10-execution-plans/2025-10-19-sprint-2-validation-and-sprint-3-plan.md - Sprint 2 validation tests + Sprint 3 implementation plan
docs/07-adr/adr-027-coditect-skills-container-integration.md - .claude → .coditect migration architectural decision (3-phase plan)
docs/11-analysis/multi-cli-configuration-discovery-patterns.md - Multi-CLI compatibility research (9 tools analyzed)
phased-deployment-checklist.md - Current sprint status with checkboxes
docs/testing/testing-strategy.md - Complete testing methodology and framework
.claude/CLAUDE.md - Autonomous Development Mode & Context Engineering
docs/DEFINITIVE-V5-architecture.md - Complete V5 system design
docs/11-analysis/monitor-codi-container-provisioning-strategy.md - Container provisioning strategy (HYBRID approach)

Build #17 Session Exports (2025-10-27):

docs/09-sessions/2025-10-27-EXPORT-BUILD17-SESSION1.txt - Session 1: Build #17 deployment readiness (Docker SUCCESS, kubectl fix, all fixes applied)
docs/09-sessions/2025-10-27-EXPORT-BUILD17-SESSION2.txt - Session 2: Architecture evolution documentation (ADR-028 theia IDE, ADR-029 StatefulSet migration)

🛠️ Quick Reference Commands

# Development
npm run dev                              # Start dev server
npm run build                            # Production build
npm run type-check                       # TypeScript check

# Backend (Rust)
cd backend
cargo build --release                    # Build backend
cargo test                              # Run unit tests

# GKE Deployment (PRODUCTION)
kubectl get pods -n coditect-app         # View all pods
kubectl get services -n coditect-app     # View services
kubectl port-forward -n coditect-app service/coditect-api-v5-service 8080:80  # Test V5 API locally
kubectl logs -f deployment/coditect-api-v5 -n coditect-app  # View V5 logs

# Testing & Quality Assurance
./scripts/test-runner.sh                 # Run comprehensive test suite
npm run test                             # Frontend unit tests
cargo test                              # Backend unit tests (from backend/)

# File Monitor (Rust-based audit logging)
scripts/monitor/start-file-monitor.sh    # Start daemon (use --poll on WSL2!)
tail -f .coditect/logs/events.log        # View JSON events

# Combined Build & Deploy (Frontend + theia)
npm run build                            # Build frontend first (creates dist/)
gcloud builds submit --config cloudbuild-combined.yaml .  # Deploy to GKE

⚠️ Note: All production services run on GKE, not Cloud Run

⚡ Skills - Check FIRST Before Reinventing

CRITICAL: Before implementing ANY workflow, check if a skill already exists!

Quick Skill Lookup

Step 1: Check Registry

# View all available skills (14 total)
cat .claude/skills/REGISTRY.json | jq '.skills[] | {name, description, tags}'

Step 2: Use Skill If Match Found

# Example: Deployment workflow
cd .claude/skills/build-deploy-workflow
./core/deploy.sh --build-num=20 --changes="Feature X"

# Example: Git commit
cd .claude/skills/git-workflow-automation
./core/git-helper.sh --commit --message="Fix bug" --type=fix

# Example: Code editor (autonomous modification)
"Use code-editor skill to implement user profile editing with backend + frontend"

Available Production Skills (5 High-Value)

Skill	Use When	Time Saved / Token Efficiency	Integration
code-editor ✨	Multi-file code modifications (3+ files)	30-40% token reduction	Orchestrator Phase 3
build-deploy-workflow	Building & deploying to GKE	40 min (45→5)	Deployment automation
gcp-resource-cleanup	Cleaning legacy GCP resources	28 min (30→2)	Cost optimization
git-workflow-automation	Git operations, commits, PRs	8 min (10→2)	Git workflows
cross-file-documentation-update	Syncing 4 doc files	13 min (15→2)	Documentation sync

Additional Technical Skills (9 Total)

deployment-archeology - Find previous successful deployments
foundationdb-queries - FDB query patterns, tenant isolation
rust-backend-patterns - Actix-web patterns for T2
search-strategies - Grep/Glob optimization
framework-patterns - Event-driven, FSM patterns
evaluation-framework - llm-as-judge patterns
production-patterns - Circuit breakers, error handling
communication-protocols - Multi-agent handoff
google-cloud-build - Cloud Build optimization
internal-comms - Team communication
multi-agent-workflow - Token management, orchestration

Token Efficiency Strategy

Why Skills First:

❌ Reinventing solution: 5,000-10,000 tokens
✅ Using existing skill: 1,500-2,500 tokens (70-80% reduction)

Pattern:

Check .claude/skills/REGISTRY.json for matching skill
Load full SKILL.md only if match found (progressive disclosure)
Execute skill's proven workflow
Fall back to agent/custom solution only if no skill exists

Registry Update (after adding new skills):

python3 .claude/scripts/build-skill-registry.py

See: .claude/skills/README.md for complete skill documentation

🔍 Deployment Archeology - Finding Previous Successful Builds

When deployments fail, use this process to find and restore working configurations:

Quick Process

# 1. Get deployment creation date
kubectl get deployment <NAME> -n coditect-app -o jsonpath='{.metadata.creationTimestamp}'

# 2. Find successful builds on that date
gcloud builds list --filter="createTime>='YYYY-MM-DDT00:00:00Z'" --limit=20

# 3. Analyze successful build config
gcloud builds describe <BUILD_ID> --format="yaml(steps,options)"

# 4. Check git history for archived files
git log --all --full-history -- <FILENAME>

# 5. Restore from archive if needed
cp docs/99-archive/deployment-obsolete/<FILE> ./<FILE>

Example: Combined Service Recovery (Oct 18, 2025)

Problem: New dockerfile.combined failing to build

Solution Found:

Deployment created: 2025-10-13T09:58:29Z
Successful build: 6e95a4d9 at 09:50:07Z (8 min before)
Used file: dockerfile.local-test (archived, not dockerfile.combined!)
Machine: E2_HIGHCPU_32 (32 CPUs, NOT 8 CPUs)
Node memory: 8GB heap (NODE_OPTIONS=--max_old_space_size=8192)
Deploy method: gke-deploy (NOT kubectl)

Key Files:

Working Dockerfile: dockerfile.local-test (restored from docs/99-archive/)
Cloud Build config: cloudbuild-combined.yaml (updated with proven settings)
Pre-requisite: Frontend must be built first (npm run build creates dist/)

See: .claude/skills/deployment-archeology/SKILL.md for complete process

🔧 API URL Configuration - Production Best Practices

Critical Issue Fixed (Oct 20, 2025): Frontend was calling http://localhost:8080 instead of /api/v5

The Problem

Deployed frontend bundle contained wrong API URL:

// Deployed bundle (WRONG)
VITE_API_URL=http://localhost:8080

Root Cause: .env file being included in Docker build despite .dockerignore, and Vite processing it at build time.

The Solution Journey (12 Attempts)

Attempt	Strategy	Result	Why It Failed/Worked
#9	Added `ENV VITE_API_URL=/api/v5` to Dockerfile	❌ Failed	`.env` processed by Vite before ENV set
#10	Added `RUN rm -f .env*` + ENV	❌ Failed	Docker layer cache reused old build
#11	Fixed `.env` source file to `/api/v5`	❌ Failed	Docker cache didn't detect .env content change
#12	Hardcoded `/api/v5` in TypeScript source	✅ SUCCESS	Source change invalidates cache

Production Best Practice

File: src/services/api-client.ts

BEFORE (Environment-Dependent):

// Relies on environment variables - can break in deployment
const API_BASE_URL = import.meta.env.VITE_API_URL || '/api/v5';

AFTER (Production-Ready):

// API base URL - hardcoded for production reliability
// Relative path works in all environments (dev, staging, prod)
const API_BASE_URL = '/api/v5';

Why This Works:

✅ Source code change invalidates Docker layer cache
✅ No dependency on environment variables
✅ Relative path /api/v5 works in ALL environments
✅ Clear and explicit - no hidden configuration
✅ Eliminates entire class of deployment issues

Lessons Learned

Avoid .env files for production config - Too easy for them to be included in builds
Hardcode relative paths in source - More reliable than environment variables
Docker layer caching - ENV changes and file deletions don't invalidate previous layers
Source code changes - Only way to guarantee Docker cache invalidation

Verification

# Check deployed bundle (should NOT contain VITE_API_URL)
curl -s https://coditect.ai/assets/*.js | grep -o 'VITE_API_URL[^,}]*'
# Expected: Nothing (hardcoded, no env var in bundle)

# Or check for hardcoded path
curl -s https://coditect.ai/assets/*.js | grep -o '/api/v5'
# Expected: /api/v5

Build #12 Details

Status: SUCCESS (10m20s)
Build ID: 13e4134c-818e-4192-9963-c7dce7a02265
Image: us-central1-docker.pkg.dev/serene-voltage-464305-n2/coditect/coditect-combined:13e4134c-818e-4192-9963-c7dce7a02265
Change: Hardcoded /api/v5 in src/services/api-client.ts:6
Pending: Deploy and verify

Project Overview

Coditect AI IDE (T2) - Browser-based IDE built on Eclipse theia with:

16+ local llm models via LM Studio (no cloud)
Multi-agent system (MCP + A2A protocols)
Multi-session architecture for parallel work
FoundationDB persistence + OPFS browser cache
Privacy-first: local-only processing

Key Architectural Decision

Foundation: Eclipse theia (EPL 2.0) - NOT building from scratch

✅ Saves 6-12 months development time ✅ Free commercial use (no license fees) ✅ VS Code extension compatible ✅ Monaco editor + terminal included

Critical: We're customizing theia with extensions, not building a new IDE.

Architecture Evolution

Comprehensive documentation of design evolution:

ADR-028: theia IDE Integration Evolution (2025-10-27)
- Why we adopted Eclipse theia - Decision rationale and comparison with custom IDE
- Technical integration - 68 theia packages, custom branding, VS Code extensions
- Benefits realized - 9-10 months time savings, $225K cost savings, 3x more features at MVP
- Current architecture - Combined deployment with frontend + theia on GKE
- Future roadmap - MCP integration, multi-llm chat panel, collaborative editing
- Lessons learned - Production patterns and best practices
ADR-029: StatefulSet Persistent Storage Migration (2025-10-27)
- The problem - Data loss on pod restarts (critical user impact)
- The solution - Migration from Deployment → StatefulSet with persistent volumes
- Storage architecture - 100 GB workspace + 10 GB config per user (PVCs)
- Capacity planning - Starter (10-20 users), Production (50-100), Enterprise (500+)
- Cost analysis - $4-8/user/month storage costs (competitive with Gitpod, GitHub Codespaces)
- Migration process - 4-step deployment with zero downtime
- Lessons learned - Kubernetes best practices for stateful applications

🏗️ Production Architecture (GKE)

ALL services run on Google Kubernetes Engine (GKE), NOT Cloud Run:

┌─────────────────────────────────────────────────┐
│  GKE Ingress (34.8.51.57)                        │
│  - coditect.ai, www.coditect.ai, api.coditect.ai│
└──────────────┬──────────────────────────────────┘
               │
       ┌───────┴────────┐
       │                │
┌──────▼────────┐  ┌───▼──────────────────┐
│ coditect-     │  │ coditect-api-v5      │ ← NEW Rust API
│ combined      │  │ (3 pods)             │
│ (3 pods)      │  │ - Actix-web + FDB    │
│ ├─ V5 React   │  │ - JWT auth           │
│ ├─ theia IDE  │  │ - Port 8080          │
│ └─ NGINX      │  └──────────────────────┘
└───────────────┘
       │
┌──────▼──────────────────────────────────┐
│ FoundationDB                             │
│ - foundationdb-0/1/2 (StatefulSet)      │
│ - fdb-proxy (2 pods)                     │
│ - Internal LB: 10.128.0.10:4500          │
└──────────────────────────────────────────┘

Current Deployment Status (Oct 19, 2025):

✅ coditect-api-v5 (11d old) - V5 Rust backend with JWT auth - WORKING
✅ coditect-combined (5d old) - Frontend + theia with MCP SDK fix - WORKING
- 3/3 pods Running, health checks passing
- Bundled backend resolves ESM/CJS incompatibility
- Health check endpoint: / with 60s timeout
❌ coditect-api-v2 (19d old) - LEGACY V2 API, TO BE DELETED in Sprint 3
❌ Cloud Run deployment - MISTAKEN deployment, TO BE DELETED

Sprint 3 Goals:

Integrate frontend with V5 Rust API (replace V2 API calls)
Enable LM Studio multi-llm features (16+ models)
Delete legacy V2 API and Cloud Run deployment
End-to-end testing with real user workflows

🗄️ Storage Architecture - Hybrid Approach

Problem: Pod-local storage tied workspace files to specific pods → data loss on scale-down

Solution: Hybrid Storage = Shared Base (Docker image layers) + User-Specific PVCs

Architecture Overview

┌──────────────────────────────────────────────────────┐
│  Docker Image Layers (Shared Base - Read-Only)       │
│  ├─ System dependencies (git, npm, python, etc.)     │
│  ├─ .coditect configs (5 agents, 2 skills, 15 tools) │
│  ├─ Multi-llm CLIs (7 providers: Claude, OpenAI...)  │
│  └─ theia extensions (38 VSIX plugins)               │
└──────────────────────────────────────────────────────┘
                          ↓
┌──────────────────────────────────────────────────────┐
│  StatefulSet with Pre-Attached PVC Slots              │
│                                                       │
│  Pod-0 (10 user slots)                                │
│  ├─ /workspace/slot-0 → workspace-user-001 (10 GB)   │
│  ├─ /workspace/slot-1 → workspace-user-002 (10 GB)   │
│  ├─ ...                                               │
│  └─ /workspace/slot-9 → workspace-user-010 (10 GB)   │
│                                                       │
│  Pod-1 (10 user slots)                                │
│  ├─ /workspace/slot-0 → workspace-user-011 (10 GB)   │
│  ├─ ...                                               │
│  └─ /workspace/slot-9 → workspace-user-020 (10 GB)   │
│                                                       │
│  Pod-2 (10 user slots)                                │
│  ├─ ...                                               │
└──────────────────────────────────────────────────────┘

Key Components

1. Shared Base (Docker Image Layers)

Size: ~5 GB compressed (baked into image)
Content: System tools, .coditect configs, theia extensions, multi-llm CLIs
Storage: Pulled once per node, shared across all pods
Cost: $0 (no PVC charges, included in image)

2. User-Specific PVCs

Size: 10 GB per user (GCE Persistent Disk SSD)
Performance: <1ms latency, 15K-30K IOPS
Access Mode: ReadWriteOnce (single pod mounting)
Lifecycle: Independent of pods - persists across scale-down/up
Cost: $0.20/GB/month × 10 GB = $2.00/user/month

3. Pre-Attached PVC Slots

Slots per pod: 10 (configurable)
Total capacity: 3 pods × 10 slots = 30 concurrent users (minimum)
Scaling: HPA adds pods as needed (max 30 pods = 300 users)
Assignment: Backend routes users to pods with free slots via Kubernetes API

User Experience

Login Flow:

User logs in → Backend checks for existing assignment
If exists: Route to assigned pod + slot
If new: Find pod with free slot → Create assignment → Attach PVC
User workspace mounted at /workspace (transparent to user)

Persistence Guarantee:

User creates file in workspace → Written to user's PVC
Pod scales down → PVC remains (not deleted)
Pod scales up → Backend finds user's PVC → Mounts to new pod + slot
User logs in again → Same files, same state (100% persistence)

Cost Comparison

Option	Storage Type	Monthly Cost (20 users)	Latency	POSIX
Hybrid ✅	Image + PVCs	$141 ($7.05/user)	<1ms	✅
User PVCs	10 GB PVCs	$500 ($25/user)	<1ms	✅
Google Cloud Storage	GCS buckets	$600 ($30/user)	50-100ms	❌
NFS	Filestore	$4,100 ($205/user)	1-5ms	✅

Cost Breakdown (Hybrid):

Shared base: $0 (Docker image layers, no PVC)
User PVCs: $2.00/user/month (10 GB SSD)
Pod compute: $4.80/user/month (4 vCPU, 8 GB RAM, 2 users/pod)
Backup snapshots: $0.26/user/month (10 GB × $0.026/GB/month)
Total: $7.05/user/month

Savings: 96% cheaper than NFS, 72% cheaper than plain user PVCs

Implementation Phases

See: docs/07-adr/adr-028-part-2-hybrid-storage-decision-implementation.md for complete implementation plan

Timeline: 30-38 hours (4-5 days)

Phase	Task	Duration	Status
1	Modify Dockerfile (image layers)	2h	⏳ Pending
2	User PVC provisioning script	3h	⏳ Pending
3	StatefulSet with pre-attached slots	6-8h	⏳ Pending
4	Session-based routing logic	12-16h	⏳ Pending
5	Testing & validation	5-6h	⏳ Pending
6	Backup strategy (VolumeSnapshots)	2-3h	⏳ Pending

Backup & Disaster Recovery

Daily Backups (VolumeSnapshots):

Schedule: 2 AM daily (CronJob)
Retention: 7 days
Scope: All user PVCs
Recovery: kubectl create pvc --from-snapshot=<snapshot-name>

Cost: $0.026/GB/month × 10 GB × 7 snapshots = $1.82/user total ($0.26/user/month amortized)

Problem Analysis: docs/07-adr/ADR-028-PART-1-HYBRID-STORAGE-PROBLEM-analysis.md
Implementation Plan: docs/07-adr/adr-028-part-2-hybrid-storage-decision-implementation.md
Initial Analysis: docs/11-analysis/2025-10-28-persistent-storage-dynamic-pods.md
HPA Configuration: k8s/coditect-combined-hpa.yaml

Kubernetes Deployment YAMLs:

Current Deployment (✅ WORKING): k8s/theia-statefulset.yaml - StatefulSet with 50Gi workspace per pod
Hybrid Deployment (🚧 TESTING): k8s/theia-statefulset-hybrid.yaml - Reduced to 10Gi workspace (Phase 1)

🔑 Critical Architecture Insights

Non-obvious patterns that drive the entire codebase:

1. Eclipse theia ≠ VS Code

theia is a framework for building VS Code-like IDEs
Uses dependency injection (InversifyJS)
Extensions register via ContainerModule.bind()
Look for: *-module.ts and *-contribution.ts, NOT main.ts

2. MCP Tools vs A2A Messages

MCP: AI agent → external tool (llm, file ops, database)
A2A: AI agent → AI agent (coordination, delegation)
Example: Code gen agent uses MCP to call LM Studio, then A2A to request review

3. Session Isolation Architecture

Every action happens in session context (sessionId)
Sessions ≠ browser tabs (logical workspaces)
Multiple sessions can exist in same tab
FDB pattern: All keys prefixed tenant_id/session_id/...

4. OPFS vs FoundationDB Split

FoundationDB: Source of truth (sessions, files, agent state)
OPFS: Browser cache for offline/performance
Sync pattern: Write to FDB, cache to OPFS, read OPFS with FDB fallback
Critical: NEVER use OPFS as primary storage

5. Agent Execution Model

Agents are stateless - state lives in sessionId context
Agent "memory" = FDB queries with session filters
Sub-agents are specialized skill modules, not child processes

6. V4 Reference Usage

V4 = custom web app with K8s pods
T2 = theia extensions (different architecture)
V4 useful for: FDB patterns, agent logic, MCP/A2A examples
V4 NOT useful for: UI, file ops, IDE features (theia has these)

Technology Stack

Core

Component	Technology	Version	Purpose
IDE Framework	Eclipse theia	1.45+	Foundation
Frontend	React + TypeScript	18 + 5.3	UI layer
editor	Monaco editor	0.45	Code editing
terminal	xterm.js	5.3	terminal
UI	Chakra UI	2.8	Components
State	Zustand	4.4	State mgmt
DB	FoundationDB	7.1+	Persistence
Browser Storage	OPFS	-	Cache

Protocols

Protocol	Purpose
MCP	Model Context Protocol (Anthropic) - Tools/Resources
A2A	Agent2Agent (Google/Linux) - Agent collaboration
LM Studio API	OpenAI-compatible local llm
Claude Code API	Anthropic AI assistant

📁 Project Structure

See: docs/project-structure.md for complete directory tree.

Quick overview:

/workspace/PROJECTS/t2/
├── .claude/          # Claude Code config (6 agents, 24 commands, 2 submodules)
├── docs/             # All documentation (see DOCUMENTATION-index.md)
├── src/              # V5 Frontend (React + theia)
├── backend/          # Rust backend (Actix-web, deployed to GCP)
├── .theia/           # theia config (16+ models, 4 MCP servers, 3 agents)
└── archive/          # V4 reference materials (submodules)

⚠️ Important:

src/ = V5 Frontend (ACTIVE)
backend/ = V5 Backend (ACTIVE)
archive/v4-reference/ = V4 Reference (NOT ACTIVE - reference only)

Development Workflows

See: docs/development-guide.md for detailed code examples and workflows.

Common Tasks

Add New Agent - Extend Agent base class, use MCP for tools, A2A for collaboration
Add theia Extension - Use ContainerModule.bind() with dependency injection
Add MCP Tool - Register via server.setRequestHandler()

When Working on Code

If asked to build IDE features:

STOP - Check if theia already has it
✅ File explorer, editor tabs, terminal, settings → theia has it
Build as theia extension, don't reinvent

If asked about persistence:

Use fdbService (primary) or opfsService (cache)
Don't create new persistence

🤖 Using Specialized Agents

The .claude/agents/ directory contains 12 specialized sub-agents that can be invoked for specific tasks. These agents work alongside you to handle focused responsibilities.

Agent Categories:

8 Original agents (codebase analysis, research, organization)
4 TDD-focused agents ✨ NEW (2025-10-20) (validation, quality gates, research)

When to Use Agents

Invoke agents proactively when tasks match their specializations:

# Example: Analyzing codebase implementation
"Use codebase-analyzer to understand how authentication works in auth.rs"

# Example: Finding files and locations
"Use codebase-locator to find all session management files"

# Example: Organizing project structure
"Use project-organizer to clean up the root directory"

# Example: TDD validation
"Use tdd-validator to run tests before marking task complete"

# Example: Quality gate validation
"Use quality-gate to validate security, performance, and accessibility"

Available Agents

Agent	Purpose	When to Invoke
orchestrator	Multi-agent coordination	Complex workflows (full-stack features, security audits)
codebase-analyzer	Analyze implementation details	Understanding HOW code works
codebase-locator	Find code locations	Searching for specific components
codebase-pattern-finder	Identify patterns	Finding similar implementations
thoughts-analyzer	Analyze decision-making	Understanding thought processes
thoughts-locator	Find documentation	Searching thoughts/ directory
web-search-researcher	Web research	Gathering external information
project-organizer	Maintain clean structure	Organizing files/directories
tdd-validator ✨	TDD validation	Before marking tasks complete, enforcing RED-GREEN-REFACTOR
quality-gate ✨	Comprehensive quality check	Pre-deployment validation (security, performance, accessibility)
completion-gate ✨	Binary COMPLETE/INCOMPLETE	Evidence-based task completion validation
research-agent ✨	Technical research	Implementation decisions, library comparisons, best practices

Project Organizer Agent (NEW)

Primary responsibility: Maintain production-ready directory structure.

Use this agent when:

Root directory is cluttered with research docs, session exports, or analysis files
Need to reorganize files into proper subdirectories
Want to audit project structure for production readiness
Cleaning up after long research/implementation sessions

Example usage:

# Clean up cluttered root directory
"Use project-organizer to analyze the root directory and move files to proper locations"

# Audit project structure
"Use project-organizer to check if our directory structure follows production standards"

# Organize after session
"Use project-organizer to move session exports and research docs to appropriate folders"

What it does:

✅ Analyzes directory structure
✅ Categorizes misplaced files (session exports, research docs, status reports)
✅ Creates organization plan with target locations
✅ Executes moves using git mv (preserves history)
✅ Commits changes with descriptive messages

Agent capabilities:

Knows production-ready directory structure for T2 project
Follows organizational rules (see .claude/agents/project-organizer.md)
Uses git mv to preserve file history
Creates target directories if needed
Groups related moves in atomic commits

Organizational rules (enforced by agent):

✅ Root should contain: package.json, tsconfig.json, vite.config.ts,
   README.md, CLAUDE.md, docker files, k8s manifests

❌ Root should NOT contain: Research docs, session exports, status reports,
   analysis docs, implementation plans, checkpoint files

→ Target locations:
  - Session exports     → docs/09-sessions/
  - Research documents  → docs/11-analysis/
  - Status reports      → docs/10-execution-plans/
  - Implementation plans → docs/10-execution-plans/
  - Development guides  → docs/01-getting-started/
  - Reference materials → docs/reference/
  - Sprint checkpoints  → thoughts/shared/research/

Workflow with project-organizer:

Agent analyzes root directory
Agent creates organization plan (table of moves)
Agent presents plan for approval
Upon approval, agent executes moves with git mv
Agent commits changes and pushes to repository

See: .claude/agents/project-organizer.md for complete rules and categorization logic.

How to Invoke Agents

Direct invocation:

"Use [agent-name] to [specific task]"

Parallel invocation (multiple agents):

"Use codebase-locator and codebase-analyzer in parallel to find and understand the authentication system"

Agent coordination:

"Use project-organizer to clean root, then use codebase-analyzer to verify no broken imports"

Agent Best Practices

Be specific - Give agents clear, focused tasks
Use right agent - Match task to agent specialization
Review results - Agents return reports for you to act on
Combine agents - Use multiple agents for complex workflows

See: .claude/CLAUDE.md for complete agent documentation and autonomous development mode.

Architecture Decision Records (ADRs)

Always read relevant ADRs before making changes.

Most critical:

ADR	Decision	When to Read
ADR-014	Eclipse theia	READ THIS FIRST
ADR-010	MCP Protocol	Tool/resource work
ADR-013	Agentic Architecture	Agent system work
ADR-004	FoundationDB	Persistence changes
ADR-007	Multi-Session	Session features

Full list: See docs/07-adr/ (24 ADRs total)

⚠️ Common Pitfalls

Top 5 mistakes that break architecture:

Building features theia has → Search theia docs first
Using global state → Use dependency injection (@inject)
Calling llms directly → Use MCP (mcpClient.callTool)
Copying V4 UI → Use theia widgets, not V4 components
OPFS as primary storage → Write to FDB first, cache to OPFS

Full list: See docs/development-guide.md#troubleshooting (10 total)

Environment Setup

# Configure Claude Code output token limit
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192

Docker (Recommended)

docker-compose up -d                     # Start all services
# Access: http://localhost:3000

Local Development

npm install
npm run dev
# Access: http://localhost:5173

LM Studio Configuration

Host: host.docker.internal (Docker) or localhost (local)
Port: 1234
API: OpenAI-compatible at /v1
Models: 16+ available (qwen, llama, deepseek, etc.)

See: docs/01-getting-started/development-modes.md for deployment modes (Container-Only, Volume Mount, Remote SSH, Native Desktop)

File Monitor (Audit Logging)

Rust-based file system monitoring for compliance.

# Start daemon (use --poll on WSL2!)
scripts/monitor/start-file-monitor.sh

# View events
tail -f .coditect/logs/events.log

See: docs/file-monitor/dual-log-configuration.md

Git Workflow

Strategy:

Create meaningful commits
Use conventional commit format
Reference issues/ADRs
Keep commits atomic

Repository: https://github.com/coditect-ai/LM-Studio-multiple-llm-IDE.git

See: docs/git-workflow.md for detailed git configuration, conventional commits, hooks, and troubleshooting.

Important Constraints

What NOT to Do

❌ Don't build IDE features from scratch → Use theia ❌ Don't create custom persistence → Use fdbService/opfsService ❌ Don't make agents without sessions → Always pass sessionId ❌ Don't bypass MCP → Use mcpClient for llm access ❌ Don't commit secrets → Use .env variables

What TO Do

✅ Build theia extensions for new IDE features ✅ Use existing services (llmService, mcpClient, a2aClient) ✅ Follow agent patterns (extend Agent base class) ✅ Make everything session-aware (pass sessionId) ✅ Use MCP for tools, A2A for agents ✅ Document major decisions (create ADRs)

Success Criteria

When implementing features:

✅ Works in theia - Extensions integrate properly ✅ Session-aware - All state tied to sessionId ✅ Uses MCP/A2A - Protocols used correctly ✅ Privacy-first - No cloud calls (except optional Claude) ✅ Well-documented - ADRs for major decisions ✅ Type-safe - Full TypeScript coverage

Remember

We're building on theia, not from scratch
Use MCP for tools, A2A for agents
Everything is session-aware
Privacy-first: local llms only
Document major decisions as ADRs
Eclipse theia = EPL 2.0 = Free commercial use

When in doubt:

Read ADR-014 (theia decision)
Check if theia already has the feature
Build as extension, not standalone
Follow existing patterns in codebase

📖 Quick Links

Reading Order: docs/reading-order.md
Project Structure: docs/project-structure.md
Development Guide: docs/development-guide.md
Git Workflow: docs/git-workflow.md
Documentation Index: docs/DOCUMENTATION-index.md
Architecture: docs/DEFINITIVE-V5-architecture.md
ADRs: docs/07-adr/

🚀 Starting a New Session?​

🛠️ Quick Reference Commands​

⚡ Skills - Check FIRST Before Reinventing​

Quick Skill Lookup​

Available Production Skills (5 High-Value)​

Additional Technical Skills (9 Total)​

Token Efficiency Strategy​

🔍 Deployment Archeology - Finding Previous Successful Builds​

Quick Process​

Example: Combined Service Recovery (Oct 18, 2025)​

🔧 API URL Configuration - Production Best Practices​

The Problem​

The Solution Journey (12 Attempts)​

Production Best Practice​

Lessons Learned​

Verification​

Build #12 Details​

Project Overview​

Key Architectural Decision​

Architecture Evolution​

🏗️ Production Architecture (GKE)​

🗄️ Storage Architecture - Hybrid Approach​

Architecture Overview​

Key Components​

User Experience​

Cost Comparison​

Implementation Phases​

Backup & Disaster Recovery​

Related Documentation​

🔑 Critical Architecture Insights​

1. Eclipse theia ≠ VS Code​

2. MCP Tools vs A2A Messages​

3. Session Isolation Architecture​

4. OPFS vs FoundationDB Split​

5. Agent Execution Model​

6. V4 Reference Usage​

Technology Stack​

Core​

Protocols​

📁 Project Structure​

Development Workflows​

Common Tasks​

When Working on Code​

🤖 Using Specialized Agents​

When to Use Agents​

Available Agents​

Project Organizer Agent (NEW)​

How to Invoke Agents​

Agent Best Practices​

Architecture Decision Records (ADRs)​

⚠️ Common Pitfalls​

Environment Setup​

Docker (Recommended)​

Local Development​

LM Studio Configuration​

File Monitor (Audit Logging)​

Git Workflow​

Important Constraints​

What NOT to Do​

What TO Do​

Success Criteria​

Remember​

📖 Quick Links​