CLAUDE.md
This file provides guidance to Claude Code and other AI assistants when working with this codebase.
🚀 Starting a New Session?
LATEST STATUS: ✅ Build #18 OPERATIONAL - Pods Healthy, Ready for MVP Scaling (Build #18 Success Report)
- Build #18 (2025-10-28): ✅ DEPLOYED AND OPERATIONAL
- ✅ CrashLoopBackOff Fixed: Permission denied errors resolved (non-root execution working)
- ✅ All Pods Healthy: coditect-combined-0 and coditect-combined-1 serving traffic
- ✅ Services Running: theia IDE, CODI2 monitoring, File monitor, NGINX all operational
- ✅ Image:
8449bd02-7a28-4de2-8e26-7618396b3c2f - ✅ Security: Non-root user (coditect, UID 1001, GID 1001)
- 📄 Fix Applied: Changed log directories from
/var/log/*to/app/logs/*(user-writable)
- ⚠️ MVP SCALING REQUIREMENT (MVP Scaling Analysis):
- Current: 3 pods = 3-6 user capacity
- Required: 10-15 pods for 20 user pilot/beta testing
- Cost Impact: $150/month → $500-750/month
- Action: Scale to 10 pods + deploy HPA before MVP launch
- Command:
kubectl scale statefulset/coditect-combined --replicas=10 -n coditect-app
- 📊 Comprehensive Status Report (2025-10-28T00:50:00Z Status Report):
- Complete Build #18 analysis and deployment results
- MVP capacity planning and cost analysis
- Risk assessment: Capacity shortage blocking MVP without scaling
- Next actions: Scale immediately, deploy HPA, test with beta users
- Previous Achievements:
- Build #18 Features: Multi-llm CLI suite (7 providers), zsh + oh-my-zsh, Fixed extensions (38 VSIX), Complete branding
- Multi-llm: Claude CLI, OpenAI CLI, Aider, Shell-GPT, Grok CLI, Anthropic SDK, Gemini CLI
- StatefulSet Migration: Complete with persistent storage (100 GB workspace + 10 GB config per pod)
- Starter Configuration: 10-20 user capacity tier ready (4 vCPU, 8 GB RAM per pod)
- Socket.IO Investigation (Oct 20, 2025):
- ✅ Root Cause #1 - CDN Caching: GCP CDN was caching Socket.IO requests with stale session IDs → FIXED
- Solution Applied: Disabled CDN in BackendConfig (
k8s/backend-config-no-cdn.yaml) - Validation: CDN headers removed from responses
- Solution Applied: Disabled CDN in BackendConfig (
- ⏳ Root Cause #2 - Session Affinity Missing: GCP backend service shows
SESSION_AFFINITY: NONE→ IN PROGRESS- Problem: Service missing BackendConfig annotation (NEG requirement)
- Solution Applied: Annotated Service with BackendConfig reference
- Status: Propagating to GCP load balancer (2-5 minutes)
- 📋 Additional Fixes Recommended (from reference docs):
- P0: WebSocket annotation to Ingress (85% success probability)
- P0: Health check endpoint creation (70% success probability)
- P1: Increased backend timeout and connection draining optimization
- P0: Add PVCs for workspace persistence ✨ NEW (prevents data loss)
- ✅ Root Cause #1 - CDN Caching: GCP CDN was caching Socket.IO requests with stale session IDs → FIXED
- 🗄️ Hybrid Storage Architecture (ADR-028, Part 2) (2025-10-28):
- ✅ Problem Identified: Pod-local storage causes data loss on scale-down events
- ✅ Solution Designed: Hybrid Storage = Shared Base (Docker layers) + User PVCs (10 GB each)
- ✅ Cost Savings: $7.05/user/month (96% cheaper than NFS $205/month)
- ✅ Performance: <1ms latency with GCE Persistent Disk SSD (15K-30K IOPS)
- ✅ Scalability: 10 user slots per pod, dynamic assignment via backend routing
- 📋 Implementation Timeline: 30-38 hours (6 phases: Image layers, PVC provisioning, StatefulSet slots, routing, testing, backups)
- 📊 Daily Backups: VolumeSnapshots with 7-day retention for disaster recovery
- 🔧 Status: Design complete, ready for implementation (Phase 1: Modify Dockerfile)
- Comprehensive Documentation:
docs/11-analysis/socket-io-theia-persistence-insights.md- START HERE - Critical fixes for data loss + Socket.IO (P0 priority)docs/11-analysis/theia-gke-scaling-research-summary.md- theia pod persistence, scaling 1-50 → 10k usersdocs/11-analysis/theia-instance-running-on-gcp-gke-kubernetes.md- Full 60KB Perplexity research (PVCs, session timeout, StatefulSets)socket.io-issue/analysis-troubleshooting-guide.md- Complete Socket.IO investigation findings, diagnostic commands, fix proceduressocket.io-issue/additional-research-pathways.md- Next steps and long-term improvementssocket.io-issue/README.md- Investigation package index (~150 pages of analysis)socket.io-issue/executive-summary.md- High-level overview with 5 prioritized fixessocket.io-issue/diagnostic-decision-tree.md- Structured troubleshooting logic (15 questions)socket.io-issue/fix-implementation-guide.md- Detailed fix procedures with validation testssocket.io-issue/socketio-diagnostics.sh- Automated diagnostic script (400 lines)
- Next Steps (Priority Order):
- 🔴 P0 CRITICAL: Add PVCs to deployment (stop data loss immediately) - 30 min
- 🔴 P0 CRITICAL: Disable theia Cloud session timeout - 15 min
- Validate session affinity propagation (check GCP backend service)
- Apply WebSocket annotation to Ingress - 15 min
- Create
/healthand/readyendpoints in nginx - 30 min - Run automated diagnostic script for comprehensive validation
- Test at https://coditect.ai/theia after all fixes applied
- Verify persistence: Create file in pod → delete pod → check file still exists
- Architecture: See "Production Architecture (GKE)" section below
IMPORTANT: See docs/reading-order.md for the complete list of documents to read before beginning work.
Critical first reads:
docs/10-execution-plans/2025-10-27-build-18-final-configuration.md- START HERE - Build #18 Final Configuration (non-root, multi-llm, security hardening)docs/10-execution-plans/2025-10-20-build-23-theia-localhost-fix-checkpoint.md- Build #23 checkpoint (localhost:3000 fix, 5-build saga)thoughts/shared/research/2025-10-19t21-26-15z-sprint-2-3-checkpoint.md- Sprint 2-3 checkpoint (validation + .coditect migration)docs/10-execution-plans/2025-10-19-sprint-2-validation-and-sprint-3-plan.md- Sprint 2 validation tests + Sprint 3 implementation plandocs/07-adr/adr-027-coditect-skills-container-integration.md- .claude → .coditect migration architectural decision (3-phase plan)docs/11-analysis/multi-cli-configuration-discovery-patterns.md- Multi-CLI compatibility research (9 tools analyzed)phased-deployment-checklist.md- Current sprint status with checkboxesdocs/testing/testing-strategy.md- Complete testing methodology and framework.claude/CLAUDE.md- Autonomous Development Mode & Context Engineeringdocs/DEFINITIVE-V5-architecture.md- Complete V5 system designdocs/11-analysis/monitor-codi-container-provisioning-strategy.md- Container provisioning strategy (HYBRID approach)
Build #17 Session Exports (2025-10-27):
docs/09-sessions/2025-10-27-EXPORT-BUILD17-SESSION1.txt- Session 1: Build #17 deployment readiness (Docker SUCCESS, kubectl fix, all fixes applied)docs/09-sessions/2025-10-27-EXPORT-BUILD17-SESSION2.txt- Session 2: Architecture evolution documentation (ADR-028 theia IDE, ADR-029 StatefulSet migration)
🛠️ Quick Reference Commands
# Development
npm run dev # Start dev server
npm run build # Production build
npm run type-check # TypeScript check
# Backend (Rust)
cd backend
cargo build --release # Build backend
cargo test # Run unit tests
# GKE Deployment (PRODUCTION)
kubectl get pods -n coditect-app # View all pods
kubectl get services -n coditect-app # View services
kubectl port-forward -n coditect-app service/coditect-api-v5-service 8080:80 # Test V5 API locally
kubectl logs -f deployment/coditect-api-v5 -n coditect-app # View V5 logs
# Testing & Quality Assurance
./scripts/test-runner.sh # Run comprehensive test suite
npm run test # Frontend unit tests
cargo test # Backend unit tests (from backend/)
# File Monitor (Rust-based audit logging)
scripts/monitor/start-file-monitor.sh # Start daemon (use --poll on WSL2!)
tail -f .coditect/logs/events.log # View JSON events
# Combined Build & Deploy (Frontend + theia)
npm run build # Build frontend first (creates dist/)
gcloud builds submit --config cloudbuild-combined.yaml . # Deploy to GKE
⚠️ Note: All production services run on GKE, not Cloud Run
⚡ Skills - Check FIRST Before Reinventing
CRITICAL: Before implementing ANY workflow, check if a skill already exists!
Quick Skill Lookup
Step 1: Check Registry
# View all available skills (14 total)
cat .claude/skills/REGISTRY.json | jq '.skills[] | {name, description, tags}'
Step 2: Use Skill If Match Found
# Example: Deployment workflow
cd .claude/skills/build-deploy-workflow
./core/deploy.sh --build-num=20 --changes="Feature X"
# Example: Git commit
cd .claude/skills/git-workflow-automation
./core/git-helper.sh --commit --message="Fix bug" --type=fix
# Example: Code editor (autonomous modification)
"Use code-editor skill to implement user profile editing with backend + frontend"
Available Production Skills (5 High-Value)
| Skill | Use When | Time Saved / Token Efficiency | Integration |
|---|---|---|---|
| code-editor ✨ | Multi-file code modifications (3+ files) | 30-40% token reduction | Orchestrator Phase 3 |
| build-deploy-workflow | Building & deploying to GKE | 40 min (45→5) | Deployment automation |
| gcp-resource-cleanup | Cleaning legacy GCP resources | 28 min (30→2) | Cost optimization |
| git-workflow-automation | Git operations, commits, PRs | 8 min (10→2) | Git workflows |
| cross-file-documentation-update | Syncing 4 doc files | 13 min (15→2) | Documentation sync |
Additional Technical Skills (9 Total)
- deployment-archeology - Find previous successful deployments
- foundationdb-queries - FDB query patterns, tenant isolation
- rust-backend-patterns - Actix-web patterns for T2
- search-strategies - Grep/Glob optimization
- framework-patterns - Event-driven, FSM patterns
- evaluation-framework - llm-as-judge patterns
- production-patterns - Circuit breakers, error handling
- communication-protocols - Multi-agent handoff
- google-cloud-build - Cloud Build optimization
- internal-comms - Team communication
- multi-agent-workflow - Token management, orchestration
Token Efficiency Strategy
Why Skills First:
- ❌ Reinventing solution: 5,000-10,000 tokens
- ✅ Using existing skill: 1,500-2,500 tokens (70-80% reduction)
Pattern:
- Check
.claude/skills/REGISTRY.jsonfor matching skill - Load full SKILL.md only if match found (progressive disclosure)
- Execute skill's proven workflow
- Fall back to agent/custom solution only if no skill exists
Registry Update (after adding new skills):
python3 .claude/scripts/build-skill-registry.py
See: .claude/skills/README.md for complete skill documentation
🔍 Deployment Archeology - Finding Previous Successful Builds
When deployments fail, use this process to find and restore working configurations:
Quick Process
# 1. Get deployment creation date
kubectl get deployment <NAME> -n coditect-app -o jsonpath='{.metadata.creationTimestamp}'
# 2. Find successful builds on that date
gcloud builds list --filter="createTime>='YYYY-MM-DDT00:00:00Z'" --limit=20
# 3. Analyze successful build config
gcloud builds describe <BUILD_ID> --format="yaml(steps,options)"
# 4. Check git history for archived files
git log --all --full-history -- <FILENAME>
# 5. Restore from archive if needed
cp docs/99-archive/deployment-obsolete/<FILE> ./<FILE>
Example: Combined Service Recovery (Oct 18, 2025)
Problem: New dockerfile.combined failing to build
Solution Found:
- Deployment created: 2025-10-13T09:58:29Z
- Successful build: 6e95a4d9 at 09:50:07Z (8 min before)
- Used file:
dockerfile.local-test(archived, not dockerfile.combined!) - Machine: E2_HIGHCPU_32 (32 CPUs, NOT 8 CPUs)
- Node memory: 8GB heap (
NODE_OPTIONS=--max_old_space_size=8192) - Deploy method: gke-deploy (NOT kubectl)
Key Files:
- Working Dockerfile:
dockerfile.local-test(restored fromdocs/99-archive/) - Cloud Build config:
cloudbuild-combined.yaml(updated with proven settings) - Pre-requisite: Frontend must be built first (
npm run buildcreatesdist/)
See: .claude/skills/deployment-archeology/SKILL.md for complete process
🔧 API URL Configuration - Production Best Practices
Critical Issue Fixed (Oct 20, 2025): Frontend was calling http://localhost:8080 instead of /api/v5
The Problem
Deployed frontend bundle contained wrong API URL:
// Deployed bundle (WRONG)
VITE_API_URL=http://localhost:8080
Root Cause: .env file being included in Docker build despite .dockerignore, and Vite processing it at build time.
The Solution Journey (12 Attempts)
| Attempt | Strategy | Result | Why It Failed/Worked |
|---|---|---|---|
| #9 | Added ENV VITE_API_URL=/api/v5 to Dockerfile | ❌ Failed | .env processed by Vite before ENV set |
| #10 | Added RUN rm -f .env* + ENV | ❌ Failed | Docker layer cache reused old build |
| #11 | Fixed .env source file to /api/v5 | ❌ Failed | Docker cache didn't detect .env content change |
| #12 | Hardcoded /api/v5 in TypeScript source | ✅ SUCCESS | Source change invalidates cache |
Production Best Practice
File: src/services/api-client.ts
BEFORE (Environment-Dependent):
// Relies on environment variables - can break in deployment
const API_BASE_URL = import.meta.env.VITE_API_URL || '/api/v5';
AFTER (Production-Ready):
// API base URL - hardcoded for production reliability
// Relative path works in all environments (dev, staging, prod)
const API_BASE_URL = '/api/v5';
Why This Works:
- ✅ Source code change invalidates Docker layer cache
- ✅ No dependency on environment variables
- ✅ Relative path
/api/v5works in ALL environments - ✅ Clear and explicit - no hidden configuration
- ✅ Eliminates entire class of deployment issues
Lessons Learned
- Avoid .env files for production config - Too easy for them to be included in builds
- Hardcode relative paths in source - More reliable than environment variables
- Docker layer caching - ENV changes and file deletions don't invalidate previous layers
- Source code changes - Only way to guarantee Docker cache invalidation
Verification
# Check deployed bundle (should NOT contain VITE_API_URL)
curl -s https://coditect.ai/assets/*.js | grep -o 'VITE_API_URL[^,}]*'
# Expected: Nothing (hardcoded, no env var in bundle)
# Or check for hardcoded path
curl -s https://coditect.ai/assets/*.js | grep -o '/api/v5'
# Expected: /api/v5
Build #12 Details
- Status: SUCCESS (10m20s)
- Build ID:
13e4134c-818e-4192-9963-c7dce7a02265 - Image:
us-central1-docker.pkg.dev/serene-voltage-464305-n2/coditect/coditect-combined:13e4134c-818e-4192-9963-c7dce7a02265 - Change: Hardcoded
/api/v5insrc/services/api-client.ts:6 - Pending: Deploy and verify
Project Overview
Coditect AI IDE (T2) - Browser-based IDE built on Eclipse theia with:
- 16+ local llm models via LM Studio (no cloud)
- Multi-agent system (MCP + A2A protocols)
- Multi-session architecture for parallel work
- FoundationDB persistence + OPFS browser cache
- Privacy-first: local-only processing
Key Architectural Decision
Foundation: Eclipse theia (EPL 2.0) - NOT building from scratch
✅ Saves 6-12 months development time ✅ Free commercial use (no license fees) ✅ VS Code extension compatible ✅ Monaco editor + terminal included
Critical: We're customizing theia with extensions, not building a new IDE.
Architecture Evolution
Comprehensive documentation of design evolution:
-
ADR-028: theia IDE Integration Evolution (2025-10-27)
- Why we adopted Eclipse theia - Decision rationale and comparison with custom IDE
- Technical integration - 68 theia packages, custom branding, VS Code extensions
- Benefits realized - 9-10 months time savings, $225K cost savings, 3x more features at MVP
- Current architecture - Combined deployment with frontend + theia on GKE
- Future roadmap - MCP integration, multi-llm chat panel, collaborative editing
- Lessons learned - Production patterns and best practices
-
ADR-029: StatefulSet Persistent Storage Migration (2025-10-27)
- The problem - Data loss on pod restarts (critical user impact)
- The solution - Migration from Deployment → StatefulSet with persistent volumes
- Storage architecture - 100 GB workspace + 10 GB config per user (PVCs)
- Capacity planning - Starter (10-20 users), Production (50-100), Enterprise (500+)
- Cost analysis - $4-8/user/month storage costs (competitive with Gitpod, GitHub Codespaces)
- Migration process - 4-step deployment with zero downtime
- Lessons learned - Kubernetes best practices for stateful applications
🏗️ Production Architecture (GKE)
ALL services run on Google Kubernetes Engine (GKE), NOT Cloud Run:
┌─────────────────────────────────────────────────┐
│ GKE Ingress (34.8.51.57) │
│ - coditect.ai, www.coditect.ai, api.coditect.ai│
└──────────────┬──────────────────────────────────┘
│
┌───────┴────────┐
│ │
┌──────▼────────┐ ┌───▼──────────────────┐
│ coditect- │ │ coditect-api-v5 │ ← NEW Rust API
│ combined │ │ (3 pods) │
│ (3 pods) │ │ - Actix-web + FDB │
│ ├─ V5 React │ │ - JWT auth │
│ ├─ theia IDE │ │ - Port 8080 │
│ └─ NGINX │ └──────────────────────┘
└───────────────┘
│
┌──────▼──────────────────────────────────┐
│ FoundationDB │
│ - foundationdb-0/1/2 (StatefulSet) │
│ - fdb-proxy (2 pods) │
│ - Internal LB: 10.128.0.10:4500 │
└──────────────────────────────────────────┘
Current Deployment Status (Oct 19, 2025):
- ✅ coditect-api-v5 (11d old) - V5 Rust backend with JWT auth - WORKING
- ✅ coditect-combined (5d old) - Frontend + theia with MCP SDK fix - WORKING
- 3/3 pods Running, health checks passing
- Bundled backend resolves ESM/CJS incompatibility
- Health check endpoint:
/with 60s timeout
- ❌ coditect-api-v2 (19d old) - LEGACY V2 API, TO BE DELETED in Sprint 3
- ❌ Cloud Run deployment - MISTAKEN deployment, TO BE DELETED
Sprint 3 Goals:
- Integrate frontend with V5 Rust API (replace V2 API calls)
- Enable LM Studio multi-llm features (16+ models)
- Delete legacy V2 API and Cloud Run deployment
- End-to-end testing with real user workflows
🗄️ Storage Architecture - Hybrid Approach
Problem: Pod-local storage tied workspace files to specific pods → data loss on scale-down
Solution: Hybrid Storage = Shared Base (Docker image layers) + User-Specific PVCs
Architecture Overview
┌──────────────────────────────────────────────────────┐
│ Docker Image Layers (Shared Base - Read-Only) │
│ ├─ System dependencies (git, npm, python, etc.) │
│ ├─ .coditect configs (5 agents, 2 skills, 15 tools) │
│ ├─ Multi-llm CLIs (7 providers: Claude, OpenAI...) │
│ └─ theia extensions (38 VSIX plugins) │
└──────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────┐
│ StatefulSet with Pre-Attached PVC Slots │
│ │
│ Pod-0 (10 user slots) │
│ ├─ /workspace/slot-0 → workspace-user-001 (10 GB) │
│ ├─ /workspace/slot-1 → workspace-user-002 (10 GB) │
│ ├─ ... │
│ └─ /workspace/slot-9 → workspace-user-010 (10 GB) │
│ │
│ Pod-1 (10 user slots) │
│ ├─ /workspace/slot-0 → workspace-user-011 (10 GB) │
│ ├─ ... │
│ └─ /workspace/slot-9 → workspace-user-020 (10 GB) │
│ │
│ Pod-2 (10 user slots) │
│ ├─ ... │
└──────────────────────────────────────────────────────┘
Key Components
1. Shared Base (Docker Image Layers)
- Size: ~5 GB compressed (baked into image)
- Content: System tools, .coditect configs, theia extensions, multi-llm CLIs
- Storage: Pulled once per node, shared across all pods
- Cost: $0 (no PVC charges, included in image)
2. User-Specific PVCs
- Size: 10 GB per user (GCE Persistent Disk SSD)
- Performance: <1ms latency, 15K-30K IOPS
- Access Mode: ReadWriteOnce (single pod mounting)
- Lifecycle: Independent of pods - persists across scale-down/up
- Cost: $0.20/GB/month × 10 GB = $2.00/user/month
3. Pre-Attached PVC Slots
- Slots per pod: 10 (configurable)
- Total capacity: 3 pods × 10 slots = 30 concurrent users (minimum)
- Scaling: HPA adds pods as needed (max 30 pods = 300 users)
- Assignment: Backend routes users to pods with free slots via Kubernetes API
User Experience
Login Flow:
- User logs in → Backend checks for existing assignment
- If exists: Route to assigned pod + slot
- If new: Find pod with free slot → Create assignment → Attach PVC
- User workspace mounted at
/workspace(transparent to user)
Persistence Guarantee:
- User creates file in workspace → Written to user's PVC
- Pod scales down → PVC remains (not deleted)
- Pod scales up → Backend finds user's PVC → Mounts to new pod + slot
- User logs in again → Same files, same state (100% persistence)
Cost Comparison
| Option | Storage Type | Monthly Cost (20 users) | Latency | POSIX |
|---|---|---|---|---|
| Hybrid ✅ | Image + PVCs | $141 ($7.05/user) | <1ms | ✅ |
| User PVCs | 10 GB PVCs | $500 ($25/user) | <1ms | ✅ |
| Google Cloud Storage | GCS buckets | $600 ($30/user) | 50-100ms | ❌ |
| NFS | Filestore | $4,100 ($205/user) | 1-5ms | ✅ |
Cost Breakdown (Hybrid):
- Shared base: $0 (Docker image layers, no PVC)
- User PVCs: $2.00/user/month (10 GB SSD)
- Pod compute: $4.80/user/month (4 vCPU, 8 GB RAM, 2 users/pod)
- Backup snapshots: $0.26/user/month (10 GB × $0.026/GB/month)
- Total: $7.05/user/month
Savings: 96% cheaper than NFS, 72% cheaper than plain user PVCs
Implementation Phases
See: docs/07-adr/adr-028-part-2-hybrid-storage-decision-implementation.md for complete implementation plan
Timeline: 30-38 hours (4-5 days)
| Phase | Task | Duration | Status |
|---|---|---|---|
| 1 | Modify Dockerfile (image layers) | 2h | ⏳ Pending |
| 2 | User PVC provisioning script | 3h | ⏳ Pending |
| 3 | StatefulSet with pre-attached slots | 6-8h | ⏳ Pending |
| 4 | Session-based routing logic | 12-16h | ⏳ Pending |
| 5 | Testing & validation | 5-6h | ⏳ Pending |
| 6 | Backup strategy (VolumeSnapshots) | 2-3h | ⏳ Pending |
Backup & Disaster Recovery
Daily Backups (VolumeSnapshots):
- Schedule: 2 AM daily (CronJob)
- Retention: 7 days
- Scope: All user PVCs
- Recovery:
kubectl create pvc --from-snapshot=<snapshot-name>
Cost: $0.026/GB/month × 10 GB × 7 snapshots = $1.82/user total ($0.26/user/month amortized)
Related Documentation
- Problem Analysis:
docs/07-adr/ADR-028-PART-1-HYBRID-STORAGE-PROBLEM-analysis.md - Implementation Plan:
docs/07-adr/adr-028-part-2-hybrid-storage-decision-implementation.md - Initial Analysis:
docs/11-analysis/2025-10-28-persistent-storage-dynamic-pods.md - HPA Configuration:
k8s/coditect-combined-hpa.yaml
Kubernetes Deployment YAMLs:
- Current Deployment (✅ WORKING):
k8s/theia-statefulset.yaml- StatefulSet with 50Gi workspace per pod - Hybrid Deployment (🚧 TESTING):
k8s/theia-statefulset-hybrid.yaml- Reduced to 10Gi workspace (Phase 1)
🔑 Critical Architecture Insights
Non-obvious patterns that drive the entire codebase:
1. Eclipse theia ≠ VS Code
- theia is a framework for building VS Code-like IDEs
- Uses dependency injection (InversifyJS)
- Extensions register via
ContainerModule.bind() - Look for:
*-module.tsand*-contribution.ts, NOT main.ts
2. MCP Tools vs A2A Messages
- MCP: AI agent → external tool (llm, file ops, database)
- A2A: AI agent → AI agent (coordination, delegation)
- Example: Code gen agent uses MCP to call LM Studio, then A2A to request review
3. Session Isolation Architecture
- Every action happens in session context (sessionId)
- Sessions ≠ browser tabs (logical workspaces)
- Multiple sessions can exist in same tab
- FDB pattern: All keys prefixed
tenant_id/session_id/...
4. OPFS vs FoundationDB Split
- FoundationDB: Source of truth (sessions, files, agent state)
- OPFS: Browser cache for offline/performance
- Sync pattern: Write to FDB, cache to OPFS, read OPFS with FDB fallback
- Critical: NEVER use OPFS as primary storage
5. Agent Execution Model
- Agents are stateless - state lives in sessionId context
- Agent "memory" = FDB queries with session filters
- Sub-agents are specialized skill modules, not child processes
6. V4 Reference Usage
- V4 = custom web app with K8s pods
- T2 = theia extensions (different architecture)
- V4 useful for: FDB patterns, agent logic, MCP/A2A examples
- V4 NOT useful for: UI, file ops, IDE features (theia has these)
Technology Stack
Core
| Component | Technology | Version | Purpose |
|---|---|---|---|
| IDE Framework | Eclipse theia | 1.45+ | Foundation |
| Frontend | React + TypeScript | 18 + 5.3 | UI layer |
| editor | Monaco editor | 0.45 | Code editing |
| terminal | xterm.js | 5.3 | terminal |
| UI | Chakra UI | 2.8 | Components |
| State | Zustand | 4.4 | State mgmt |
| DB | FoundationDB | 7.1+ | Persistence |
| Browser Storage | OPFS | - | Cache |
Protocols
| Protocol | Purpose |
|---|---|
| MCP | Model Context Protocol (Anthropic) - Tools/Resources |
| A2A | Agent2Agent (Google/Linux) - Agent collaboration |
| LM Studio API | OpenAI-compatible local llm |
| Claude Code API | Anthropic AI assistant |
📁 Project Structure
See: docs/project-structure.md for complete directory tree.
Quick overview:
/workspace/PROJECTS/t2/
├── .claude/ # Claude Code config (6 agents, 24 commands, 2 submodules)
├── docs/ # All documentation (see DOCUMENTATION-index.md)
├── src/ # V5 Frontend (React + theia)
├── backend/ # Rust backend (Actix-web, deployed to GCP)
├── .theia/ # theia config (16+ models, 4 MCP servers, 3 agents)
└── archive/ # V4 reference materials (submodules)
⚠️ Important:
src/= V5 Frontend (ACTIVE)backend/= V5 Backend (ACTIVE)archive/v4-reference/= V4 Reference (NOT ACTIVE - reference only)
Development Workflows
See: docs/development-guide.md for detailed code examples and workflows.
Common Tasks
- Add New Agent - Extend
Agentbase class, use MCP for tools, A2A for collaboration - Add theia Extension - Use
ContainerModule.bind()with dependency injection - Add MCP Tool - Register via
server.setRequestHandler()
When Working on Code
If asked to build IDE features:
- STOP - Check if theia already has it
- ✅ File explorer, editor tabs, terminal, settings → theia has it
- Build as theia extension, don't reinvent
If asked about persistence:
- Use
fdbService(primary) oropfsService(cache) - Don't create new persistence
🤖 Using Specialized Agents
The .claude/agents/ directory contains 12 specialized sub-agents that can be invoked for specific tasks. These agents work alongside you to handle focused responsibilities.
Agent Categories:
- 8 Original agents (codebase analysis, research, organization)
- 4 TDD-focused agents ✨ NEW (2025-10-20) (validation, quality gates, research)
When to Use Agents
Invoke agents proactively when tasks match their specializations:
# Example: Analyzing codebase implementation
"Use codebase-analyzer to understand how authentication works in auth.rs"
# Example: Finding files and locations
"Use codebase-locator to find all session management files"
# Example: Organizing project structure
"Use project-organizer to clean up the root directory"
# Example: TDD validation
"Use tdd-validator to run tests before marking task complete"
# Example: Quality gate validation
"Use quality-gate to validate security, performance, and accessibility"
Available Agents
| Agent | Purpose | When to Invoke |
|---|---|---|
| orchestrator | Multi-agent coordination | Complex workflows (full-stack features, security audits) |
| codebase-analyzer | Analyze implementation details | Understanding HOW code works |
| codebase-locator | Find code locations | Searching for specific components |
| codebase-pattern-finder | Identify patterns | Finding similar implementations |
| thoughts-analyzer | Analyze decision-making | Understanding thought processes |
| thoughts-locator | Find documentation | Searching thoughts/ directory |
| web-search-researcher | Web research | Gathering external information |
| project-organizer | Maintain clean structure | Organizing files/directories |
| tdd-validator ✨ | TDD validation | Before marking tasks complete, enforcing RED-GREEN-REFACTOR |
| quality-gate ✨ | Comprehensive quality check | Pre-deployment validation (security, performance, accessibility) |
| completion-gate ✨ | Binary COMPLETE/INCOMPLETE | Evidence-based task completion validation |
| research-agent ✨ | Technical research | Implementation decisions, library comparisons, best practices |
Project Organizer Agent (NEW)
Primary responsibility: Maintain production-ready directory structure.
Use this agent when:
- Root directory is cluttered with research docs, session exports, or analysis files
- Need to reorganize files into proper subdirectories
- Want to audit project structure for production readiness
- Cleaning up after long research/implementation sessions
Example usage:
# Clean up cluttered root directory
"Use project-organizer to analyze the root directory and move files to proper locations"
# Audit project structure
"Use project-organizer to check if our directory structure follows production standards"
# Organize after session
"Use project-organizer to move session exports and research docs to appropriate folders"
What it does:
- ✅ Analyzes directory structure
- ✅ Categorizes misplaced files (session exports, research docs, status reports)
- ✅ Creates organization plan with target locations
- ✅ Executes moves using
git mv(preserves history) - ✅ Commits changes with descriptive messages
Agent capabilities:
- Knows production-ready directory structure for T2 project
- Follows organizational rules (see
.claude/agents/project-organizer.md) - Uses
git mvto preserve file history - Creates target directories if needed
- Groups related moves in atomic commits
Organizational rules (enforced by agent):
✅ Root should contain: package.json, tsconfig.json, vite.config.ts,
README.md, CLAUDE.md, docker files, k8s manifests
❌ Root should NOT contain: Research docs, session exports, status reports,
analysis docs, implementation plans, checkpoint files
→ Target locations:
- Session exports → docs/09-sessions/
- Research documents → docs/11-analysis/
- Status reports → docs/10-execution-plans/
- Implementation plans → docs/10-execution-plans/
- Development guides → docs/01-getting-started/
- Reference materials → docs/reference/
- Sprint checkpoints → thoughts/shared/research/
Workflow with project-organizer:
- Agent analyzes root directory
- Agent creates organization plan (table of moves)
- Agent presents plan for approval
- Upon approval, agent executes moves with
git mv - Agent commits changes and pushes to repository
See: .claude/agents/project-organizer.md for complete rules and categorization logic.
How to Invoke Agents
Direct invocation:
"Use [agent-name] to [specific task]"
Parallel invocation (multiple agents):
"Use codebase-locator and codebase-analyzer in parallel to find and understand the authentication system"
Agent coordination:
"Use project-organizer to clean root, then use codebase-analyzer to verify no broken imports"
Agent Best Practices
- Be specific - Give agents clear, focused tasks
- Use right agent - Match task to agent specialization
- Review results - Agents return reports for you to act on
- Combine agents - Use multiple agents for complex workflows
See: .claude/CLAUDE.md for complete agent documentation and autonomous development mode.
Architecture Decision Records (ADRs)
Always read relevant ADRs before making changes.
Most critical:
| ADR | Decision | When to Read |
|---|---|---|
| ADR-014 | Eclipse theia | READ THIS FIRST |
| ADR-010 | MCP Protocol | Tool/resource work |
| ADR-013 | Agentic Architecture | Agent system work |
| ADR-004 | FoundationDB | Persistence changes |
| ADR-007 | Multi-Session | Session features |
Full list: See docs/07-adr/ (24 ADRs total)
⚠️ Common Pitfalls
Top 5 mistakes that break architecture:
- Building features theia has → Search theia docs first
- Using global state → Use dependency injection (
@inject) - Calling llms directly → Use MCP (
mcpClient.callTool) - Copying V4 UI → Use theia widgets, not V4 components
- OPFS as primary storage → Write to FDB first, cache to OPFS
Full list: See docs/development-guide.md#troubleshooting (10 total)
Environment Setup
# Configure Claude Code output token limit
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192
Docker (Recommended)
docker-compose up -d # Start all services
# Access: http://localhost:3000
Local Development
npm install
npm run dev
# Access: http://localhost:5173
LM Studio Configuration
- Host:
host.docker.internal(Docker) orlocalhost(local) - Port:
1234 - API: OpenAI-compatible at
/v1 - Models: 16+ available (qwen, llama, deepseek, etc.)
See: docs/01-getting-started/development-modes.md for deployment modes (Container-Only, Volume Mount, Remote SSH, Native Desktop)
File Monitor (Audit Logging)
Rust-based file system monitoring for compliance.
# Start daemon (use --poll on WSL2!)
scripts/monitor/start-file-monitor.sh
# View events
tail -f .coditect/logs/events.log
See: docs/file-monitor/dual-log-configuration.md
Git Workflow
Strategy:
- Create meaningful commits
- Use conventional commit format
- Reference issues/ADRs
- Keep commits atomic
Repository: https://github.com/coditect-ai/LM-Studio-multiple-llm-IDE.git
See: docs/git-workflow.md for detailed git configuration, conventional commits, hooks, and troubleshooting.
Important Constraints
What NOT to Do
❌ Don't build IDE features from scratch → Use theia ❌ Don't create custom persistence → Use fdbService/opfsService ❌ Don't make agents without sessions → Always pass sessionId ❌ Don't bypass MCP → Use mcpClient for llm access ❌ Don't commit secrets → Use .env variables
What TO Do
✅ Build theia extensions for new IDE features ✅ Use existing services (llmService, mcpClient, a2aClient) ✅ Follow agent patterns (extend Agent base class) ✅ Make everything session-aware (pass sessionId) ✅ Use MCP for tools, A2A for agents ✅ Document major decisions (create ADRs)
Success Criteria
When implementing features:
✅ Works in theia - Extensions integrate properly ✅ Session-aware - All state tied to sessionId ✅ Uses MCP/A2A - Protocols used correctly ✅ Privacy-first - No cloud calls (except optional Claude) ✅ Well-documented - ADRs for major decisions ✅ Type-safe - Full TypeScript coverage
Remember
- We're building on theia, not from scratch
- Use MCP for tools, A2A for agents
- Everything is session-aware
- Privacy-first: local llms only
- Document major decisions as ADRs
- Eclipse theia = EPL 2.0 = Free commercial use
When in doubt:
- Read ADR-014 (theia decision)
- Check if theia already has the feature
- Build as extension, not standalone
- Follow existing patterns in codebase
📖 Quick Links
- Reading Order:
docs/reading-order.md - Project Structure:
docs/project-structure.md - Development Guide:
docs/development-guide.md - Git Workflow:
docs/git-workflow.md - Documentation Index:
docs/DOCUMENTATION-index.md - Architecture:
docs/DEFINITIVE-V5-architecture.md - ADRs:
docs/07-adr/