Skip to main content

ADR-102: Universal Semantic Index Architecture

Status

ACCEPTED - 2026-01-23

Context

CODITECT currently has fragmented indexing:

  • Session messages: Indexed with embeddings (143K+ records in embeddings table)
  • Framework components: Indexed in platform.db (2,331 components) but NO embeddings
  • Customer code/docs: NOT indexed at all

This creates several problems:

  1. /cxq --semantic only searches past conversations, not available tools
  2. Customers cannot search their own codebase semantically
  3. No change detection means re-indexing everything on each run
  4. No unified search across sessions + components + customer content

Current State

┌─────────────────────────────────────────────────────────────┐
│ CURRENT FRAGMENTED STATE │
├─────────────────────────────────────────────────────────────┤
│ │
│ context.db platform.db │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ messages (287K) │ │ components (2.3K)│ │
│ │ embeddings (143K)│ │ NO EMBEDDINGS │ │
│ │ decisions │ └──────────────────┘ │
│ │ tool_analytics │ │
│ └──────────────────┘ Customer Code/Docs │
│ ┌──────────────────┐ │
│ │ NOT INDEXED │ │
│ │ NO EMBEDDINGS │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

Decision

Implement a Universal Semantic Index with:

  1. Hash-based change tracking for all content types
  2. Unified embedding storage with content type classification
  3. Customer-extensible indexing for code, docs, and custom content
  4. Incremental updates - only re-embed changed content

Architecture

┌─────────────────────────────────────────────────────────────┐
│ UNIVERSAL SEMANTIC INDEX (USI) │
├─────────────────────────────────────────────────────────────┤
│ │
│ context.db │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ unified_embeddings │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ id | content_hash | content_type | source_path │ │ │
│ │ │ embedding | model | chunk_index | metadata │ │ │
│ │ │ created_at | updated_at │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ content_type enum: │ │
│ │ • message - Session messages (existing) │ │
│ │ • component - Agents, commands, skills, hooks │ │
│ │ • document - ADRs, guides, references │ │
│ │ • code - Customer source files │ │
│ │ • custom - Customer-defined content │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ content_hashes │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ file_path | content_hash | file_hash | mtime │ │ │
│ │ │ content_type | indexed_at | chunk_count │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

Hash-Based Change Detection

def compute_content_hash(file_path: Path) -> str:
"""Compute SHA256 hash of file content."""
with open(file_path, 'rb') as f:
return hashlib.sha256(f.read()).hexdigest()

def needs_reindex(file_path: Path, db_hash: str) -> bool:
"""Check if file needs re-indexing."""
current_hash = compute_content_hash(file_path)
return current_hash != db_hash

Customer Content Integration

Customers can index their own content via configuration:

~/.coditect/config/index-config.json:

{
"customer_indexes": [
{
"name": "my-project-code",
"type": "code",
"paths": ["~/projects/my-app/src"],
"patterns": ["**/*.py", "**/*.ts", "**/*.js"],
"exclude": ["**/node_modules/**", "**/__pycache__/**"],
"chunk_size": 1000,
"chunk_overlap": 200
},
{
"name": "my-project-docs",
"type": "document",
"paths": ["~/projects/my-app/docs"],
"patterns": ["**/*.md"],
"exclude": []
},
{
"name": "company-wiki",
"type": "custom",
"paths": ["/shared/wiki"],
"patterns": ["**/*.md", "**/*.txt"],
"metadata": {"source": "confluence-export"}
}
],
"embedding_model": "all-MiniLM-L6-v2",
"chunk_size_default": 500,
"chunk_overlap_default": 100
}

Search API

# Unified search across all content types
results = cxq_search(
query="kubernetes deployment patterns",
content_types=["component", "document", "code"], # Filter by type
limit=20,
threshold=0.7
)

# Returns results from:
# - Agents that mention kubernetes
# - ADRs about deployment
# - Customer code with k8s patterns
# - Past sessions discussing kubernetes

CLI Integration

# Index framework components (run during /cx)
/cx # Auto-indexes changed components

# Index customer project
/cx --index-project ~/my-app # Index code + docs

# Search across everything
/cxq --semantic "authentication flow"
/cxq --semantic "error handling" --type code
/cxq --semantic "deployment" --type component,document

# Manage indexes
/index --status # Show index health
/index --rebuild component # Force rebuild specific type
/index --add-path ~/new-project # Add new index path

Customer Use Cases

1. Developer Onboarding

New developer joins team, asks Claude:

"How does authentication work in this codebase?"

Claude searches:

  • Customer's auth code files → finds auth.service.ts, middleware/auth.py
  • Customer's docs → finds docs/AUTH-GUIDE.md
  • CODITECT components → finds authentication-authorization skill
  • Past sessions → finds previous discussions about auth bugs

2. Code Discovery

Developer asks:

"Find all API rate limiting implementations"

Claude searches:

  • Customer code → finds rate limiter middleware
  • CODITECT patterns → finds rate-limiting-patterns skill
  • ADRs → finds decisions about rate limit values

3. Architecture Research

Architect asks:

"What patterns do we use for event-driven systems?"

Claude searches:

  • Customer code → finds Kafka consumers, event handlers
  • CODITECT components → finds event-driven-architecture skill
  • ADRs → finds ADR-045-event-sourcing-decision.md
  • Past sessions → finds previous architecture discussions

4. Bug Investigation

Developer asks:

"Has anyone seen this TypeError before?"

Claude searches:

  • error_solutions table → finds past fixes
  • Past sessions → finds similar error discussions
  • Customer code → finds related error handlers

Database Schema

-- Unified embedding storage
CREATE TABLE unified_embeddings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
content_hash TEXT NOT NULL, -- SHA256 of content
content_type TEXT NOT NULL, -- message|component|document|code|custom
source_path TEXT, -- File path or message ID
source_name TEXT, -- Human-readable name
chunk_index INTEGER DEFAULT 0, -- For multi-chunk documents
chunk_total INTEGER DEFAULT 1, -- Total chunks
content_preview TEXT, -- First 500 chars
embedding BLOB NOT NULL, -- 384-dim vector (all-MiniLM-L6-v2)
model TEXT NOT NULL, -- Embedding model name
metadata TEXT, -- JSON: extra info
customer_index TEXT, -- Customer index name (null for framework)
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
UNIQUE(content_hash, chunk_index)
);

-- Hash tracking for change detection
CREATE TABLE content_hashes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_path TEXT UNIQUE NOT NULL,
content_hash TEXT NOT NULL, -- SHA256 of content
file_size INTEGER,
mtime REAL, -- File modification time
content_type TEXT NOT NULL,
customer_index TEXT, -- Customer index name
chunk_count INTEGER DEFAULT 1,
indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,
last_checked TEXT DEFAULT CURRENT_TIMESTAMP
);

-- Customer index configuration
CREATE TABLE customer_indexes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL,
config TEXT NOT NULL, -- JSON configuration
file_count INTEGER DEFAULT 0,
embedding_count INTEGER DEFAULT 0,
last_indexed TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);

-- Indexes for fast lookup
CREATE INDEX idx_unified_embeddings_type ON unified_embeddings(content_type);
CREATE INDEX idx_unified_embeddings_hash ON unified_embeddings(content_hash);
CREATE INDEX idx_unified_embeddings_customer ON unified_embeddings(customer_index);
CREATE INDEX idx_content_hashes_type ON content_hashes(content_type);
CREATE INDEX idx_content_hashes_customer ON content_hashes(customer_index);

Implementation Phases

Phase 1: Hash-Based Change Tracking (H.5.7.1)

  • Add content_hashes table
  • Implement hash computation and comparison
  • Integrate with /cx pipeline
  • Skip unchanged files during indexing

Phase 2: Component Embeddings (H.5.7.2)

  • Add unified_embeddings table
  • Index all 2,331 framework components
  • Integrate with /cxq --semantic
  • Add --type component filter

Phase 3: Customer Code Indexing (H.5.7.3)

  • Add customer_indexes table
  • Implement index-config.json parsing
  • Add /cx --index-project command
  • Chunking strategy for large files

Phase 4: Unified Search (H.5.7.4)

  • Hybrid search across all content types
  • Relevance scoring with type weighting
  • Result grouping by content type
  • Performance optimization

Consequences

Positive

  1. Semantic search everywhere - Find agents, skills, code, docs by meaning
  2. Incremental updates - Only re-embed changed content (10-100x faster)
  3. Customer extensibility - Index any codebase or doc collection
  4. Unified experience - One search across all knowledge sources
  5. Better agent discovery - /which can use semantic matching

Negative

  1. Storage increase - ~1KB per embedding × content count
  2. Initial indexing time - ~5-10 minutes for full framework
  3. Embedding model dependency - Requires sentence-transformers
  4. Complexity - More tables, more code paths

Mitigations

  • Storage: Embeddings compress well; 384-dim vectors are compact
  • Initial time: One-time cost; incremental updates are fast
  • Dependency: Already required for semantic search; graceful fallback
  • Complexity: Clear separation of concerns; phased rollout

Performance Estimates

OperationTimeNotes
Hash check (1 file)<1msO(1) file read + SHA256
Hash check (2,331 files)~2sParallel file I/O
Embed 1 component~50msGPU: ~5ms
Embed 2,331 components~2 minInitial only
Incremental embed (10 changed)~500msTypical /cx run
Semantic search~100msVector similarity + ranking

Security Considerations

  1. Customer code isolation - Each customer index is separate
  2. Path validation - Prevent indexing outside allowed paths
  3. Sensitive file exclusion - Default exclude for .env, secrets, credentials
  4. Hash-only tracking - Content not stored, only embeddings
  • ADR-079: Trajectory Visualization System (tool analytics)
  • ADR-089: Two-Database Architecture (context.db vs platform.db)
  • ADR-101: Database Usage Patterns (query patterns)
  • ADR-103: Four-Database Separation Architecture (supersedes database layout)
  • MCP Server: mcp-semantic-search (current implementation)
  • Task: H.5.7 - Universal Semantic Index

Implementation Note

This ADR defines the semantic indexing concepts. ADR-103 defines WHERE the indexes live:

  • Framework embeddingsplatform-index.db
  • Customer session embeddingscontext.db
  • Customer project embeddingsprojects.db (unified, cloud-integrated)

Cloud Integration (ADR-103)

Projects in projects.db are identified by globally unique UUIDs assigned by CODITECT cloud:

FieldPurpose
project_uuidCODITECT cloud UUID (primary identifier)
github_repo_urlAuto-detected from git remote
tenant_idMulti-tenant organization isolation
team_idTeam-scoped access control
parent_project_uuidMonorepo/submodule hierarchy

All embeddings in projects.db reference project_uuid, not local auto-increment IDs, enabling cloud sync with auth.coditect.ai.

References