ADR-102: Universal Semantic Index Architecture
Status
ACCEPTED - 2026-01-23
Context
CODITECT currently has fragmented indexing:
- Session messages: Indexed with embeddings (143K+ records in
embeddingstable) - Framework components: Indexed in
platform.db(2,331 components) but NO embeddings - Customer code/docs: NOT indexed at all
This creates several problems:
/cxq --semanticonly searches past conversations, not available tools- Customers cannot search their own codebase semantically
- No change detection means re-indexing everything on each run
- No unified search across sessions + components + customer content
Current State
┌─────────────────────────────────────────────────────────────┐
│ CURRENT FRAGMENTED STATE │
├─────────────────────────────────────────────────────────────┤
│ │
│ context.db platform.db │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ messages (287K) │ │ components (2.3K)│ │
│ │ embeddings (143K)│ │ NO EMBEDDINGS │ │
│ │ decisions │ └──────────────────┘ │
│ │ tool_analytics │ │
│ └──────────────────┘ Customer Code/Docs │
│ ┌──────────────────┐ │
│ │ NOT INDEXED │ │
│ │ NO EMBEDDINGS │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Decision
Implement a Universal Semantic Index with:
- Hash-based change tracking for all content types
- Unified embedding storage with content type classification
- Customer-extensible indexing for code, docs, and custom content
- Incremental updates - only re-embed changed content
Architecture
┌─────────────────────────────────────────────────────────────┐
│ UNIVERSAL SEMANTIC INDEX (USI) │
├─────────────────────────────────────────────────────────────┤
│ │
│ context.db │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ unified_embeddings │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ id | content_hash | content_type | source_path │ │ │
│ │ │ embedding | model | chunk_index | metadata │ │ │
│ │ │ created_at | updated_at │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ content_type enum: │ │
│ │ • message - Session messages (existing) │ │
│ │ • component - Agents, commands, skills, hooks │ │
│ │ • document - ADRs, guides, references │ │
│ │ • code - Customer source files │ │
│ │ • custom - Customer-defined content │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ content_hashes │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ file_path | content_hash | file_hash | mtime │ │ │
│ │ │ content_type | indexed_at | chunk_count │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Hash-Based Change Detection
def compute_content_hash(file_path: Path) -> str:
"""Compute SHA256 hash of file content."""
with open(file_path, 'rb') as f:
return hashlib.sha256(f.read()).hexdigest()
def needs_reindex(file_path: Path, db_hash: str) -> bool:
"""Check if file needs re-indexing."""
current_hash = compute_content_hash(file_path)
return current_hash != db_hash
Customer Content Integration
Customers can index their own content via configuration:
~/.coditect/config/index-config.json:
{
"customer_indexes": [
{
"name": "my-project-code",
"type": "code",
"paths": ["~/projects/my-app/src"],
"patterns": ["**/*.py", "**/*.ts", "**/*.js"],
"exclude": ["**/node_modules/**", "**/__pycache__/**"],
"chunk_size": 1000,
"chunk_overlap": 200
},
{
"name": "my-project-docs",
"type": "document",
"paths": ["~/projects/my-app/docs"],
"patterns": ["**/*.md"],
"exclude": []
},
{
"name": "company-wiki",
"type": "custom",
"paths": ["/shared/wiki"],
"patterns": ["**/*.md", "**/*.txt"],
"metadata": {"source": "confluence-export"}
}
],
"embedding_model": "all-MiniLM-L6-v2",
"chunk_size_default": 500,
"chunk_overlap_default": 100
}
Search API
# Unified search across all content types
results = cxq_search(
query="kubernetes deployment patterns",
content_types=["component", "document", "code"], # Filter by type
limit=20,
threshold=0.7
)
# Returns results from:
# - Agents that mention kubernetes
# - ADRs about deployment
# - Customer code with k8s patterns
# - Past sessions discussing kubernetes
CLI Integration
# Index framework components (run during /cx)
/cx # Auto-indexes changed components
# Index customer project
/cx --index-project ~/my-app # Index code + docs
# Search across everything
/cxq --semantic "authentication flow"
/cxq --semantic "error handling" --type code
/cxq --semantic "deployment" --type component,document
# Manage indexes
/index --status # Show index health
/index --rebuild component # Force rebuild specific type
/index --add-path ~/new-project # Add new index path
Customer Use Cases
1. Developer Onboarding
New developer joins team, asks Claude:
"How does authentication work in this codebase?"
Claude searches:
- Customer's auth code files → finds
auth.service.ts,middleware/auth.py - Customer's docs → finds
docs/AUTH-GUIDE.md - CODITECT components → finds
authentication-authorizationskill - Past sessions → finds previous discussions about auth bugs
2. Code Discovery
Developer asks:
"Find all API rate limiting implementations"
Claude searches:
- Customer code → finds rate limiter middleware
- CODITECT patterns → finds
rate-limiting-patternsskill - ADRs → finds decisions about rate limit values
3. Architecture Research
Architect asks:
"What patterns do we use for event-driven systems?"
Claude searches:
- Customer code → finds Kafka consumers, event handlers
- CODITECT components → finds
event-driven-architectureskill - ADRs → finds
ADR-045-event-sourcing-decision.md - Past sessions → finds previous architecture discussions
4. Bug Investigation
Developer asks:
"Has anyone seen this TypeError before?"
Claude searches:
error_solutionstable → finds past fixes- Past sessions → finds similar error discussions
- Customer code → finds related error handlers
Database Schema
-- Unified embedding storage
CREATE TABLE unified_embeddings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
content_hash TEXT NOT NULL, -- SHA256 of content
content_type TEXT NOT NULL, -- message|component|document|code|custom
source_path TEXT, -- File path or message ID
source_name TEXT, -- Human-readable name
chunk_index INTEGER DEFAULT 0, -- For multi-chunk documents
chunk_total INTEGER DEFAULT 1, -- Total chunks
content_preview TEXT, -- First 500 chars
embedding BLOB NOT NULL, -- 384-dim vector (all-MiniLM-L6-v2)
model TEXT NOT NULL, -- Embedding model name
metadata TEXT, -- JSON: extra info
customer_index TEXT, -- Customer index name (null for framework)
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
UNIQUE(content_hash, chunk_index)
);
-- Hash tracking for change detection
CREATE TABLE content_hashes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_path TEXT UNIQUE NOT NULL,
content_hash TEXT NOT NULL, -- SHA256 of content
file_size INTEGER,
mtime REAL, -- File modification time
content_type TEXT NOT NULL,
customer_index TEXT, -- Customer index name
chunk_count INTEGER DEFAULT 1,
indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,
last_checked TEXT DEFAULT CURRENT_TIMESTAMP
);
-- Customer index configuration
CREATE TABLE customer_indexes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL,
config TEXT NOT NULL, -- JSON configuration
file_count INTEGER DEFAULT 0,
embedding_count INTEGER DEFAULT 0,
last_indexed TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
-- Indexes for fast lookup
CREATE INDEX idx_unified_embeddings_type ON unified_embeddings(content_type);
CREATE INDEX idx_unified_embeddings_hash ON unified_embeddings(content_hash);
CREATE INDEX idx_unified_embeddings_customer ON unified_embeddings(customer_index);
CREATE INDEX idx_content_hashes_type ON content_hashes(content_type);
CREATE INDEX idx_content_hashes_customer ON content_hashes(customer_index);
Implementation Phases
Phase 1: Hash-Based Change Tracking (H.5.7.1)
- Add
content_hashestable - Implement hash computation and comparison
- Integrate with
/cxpipeline - Skip unchanged files during indexing
Phase 2: Component Embeddings (H.5.7.2)
- Add
unified_embeddingstable - Index all 2,331 framework components
- Integrate with
/cxq --semantic - Add
--type componentfilter
Phase 3: Customer Code Indexing (H.5.7.3)
- Add
customer_indexestable - Implement
index-config.jsonparsing - Add
/cx --index-projectcommand - Chunking strategy for large files
Phase 4: Unified Search (H.5.7.4)
- Hybrid search across all content types
- Relevance scoring with type weighting
- Result grouping by content type
- Performance optimization
Consequences
Positive
- Semantic search everywhere - Find agents, skills, code, docs by meaning
- Incremental updates - Only re-embed changed content (10-100x faster)
- Customer extensibility - Index any codebase or doc collection
- Unified experience - One search across all knowledge sources
- Better agent discovery -
/whichcan use semantic matching
Negative
- Storage increase - ~1KB per embedding × content count
- Initial indexing time - ~5-10 minutes for full framework
- Embedding model dependency - Requires sentence-transformers
- Complexity - More tables, more code paths
Mitigations
- Storage: Embeddings compress well; 384-dim vectors are compact
- Initial time: One-time cost; incremental updates are fast
- Dependency: Already required for semantic search; graceful fallback
- Complexity: Clear separation of concerns; phased rollout
Performance Estimates
| Operation | Time | Notes |
|---|---|---|
| Hash check (1 file) | <1ms | O(1) file read + SHA256 |
| Hash check (2,331 files) | ~2s | Parallel file I/O |
| Embed 1 component | ~50ms | GPU: ~5ms |
| Embed 2,331 components | ~2 min | Initial only |
| Incremental embed (10 changed) | ~500ms | Typical /cx run |
| Semantic search | ~100ms | Vector similarity + ranking |
Security Considerations
- Customer code isolation - Each customer index is separate
- Path validation - Prevent indexing outside allowed paths
- Sensitive file exclusion - Default exclude for .env, secrets, credentials
- Hash-only tracking - Content not stored, only embeddings
Related
- ADR-079: Trajectory Visualization System (tool analytics)
- ADR-089: Two-Database Architecture (context.db vs platform.db)
- ADR-101: Database Usage Patterns (query patterns)
- ADR-103: Four-Database Separation Architecture (supersedes database layout)
- MCP Server:
mcp-semantic-search(current implementation) - Task: H.5.7 - Universal Semantic Index
Implementation Note
This ADR defines the semantic indexing concepts. ADR-103 defines WHERE the indexes live:
- Framework embeddings →
platform-index.db - Customer session embeddings →
context.db - Customer project embeddings →
projects.db(unified, cloud-integrated)
Cloud Integration (ADR-103)
Projects in projects.db are identified by globally unique UUIDs assigned by CODITECT cloud:
| Field | Purpose |
|---|---|
project_uuid | CODITECT cloud UUID (primary identifier) |
github_repo_url | Auto-detected from git remote |
tenant_id | Multi-tenant organization isolation |
team_id | Team-scoped access control |
parent_project_uuid | Monorepo/submodule hierarchy |
All embeddings in projects.db reference project_uuid, not local auto-increment IDs, enabling cloud sync with auth.coditect.ai.
References
- Sentence Transformers - Embedding model
- all-MiniLM-L6-v2 - 384-dim embeddings
- Reciprocal Rank Fusion - Hybrid scoring