ADR-102: Universal Semantic Index Architecture

Status

ACCEPTED - 2026-01-23

Context

CODITECT currently has fragmented indexing:

Session messages: Indexed with embeddings (143K+ records in embeddings table)
Framework components: Indexed in platform.db (2,331 components) but NO embeddings
Customer code/docs: NOT indexed at all

This creates several problems:

/cxq --semantic only searches past conversations, not available tools
Customers cannot search their own codebase semantically
No change detection means re-indexing everything on each run
No unified search across sessions + components + customer content

Current State

┌─────────────────────────────────────────────────────────────┐
│                    CURRENT FRAGMENTED STATE                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  context.db                    platform.db                   │
│  ┌──────────────────┐         ┌──────────────────┐          │
│  │ messages (287K)  │         │ components (2.3K)│          │
│  │ embeddings (143K)│         │ NO EMBEDDINGS    │          │
│  │ decisions        │         └──────────────────┘          │
│  │ tool_analytics   │                                        │
│  └──────────────────┘         Customer Code/Docs             │
│                               ┌──────────────────┐          │
│                               │ NOT INDEXED      │          │
│                               │ NO EMBEDDINGS    │          │
│                               └──────────────────┘          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Decision

Implement a Universal Semantic Index with:

Hash-based change tracking for all content types
Unified embedding storage with content type classification
Customer-extensible indexing for code, docs, and custom content
Incremental updates - only re-embed changed content

Architecture

┌─────────────────────────────────────────────────────────────┐
│              UNIVERSAL SEMANTIC INDEX (USI)                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  context.db                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                 unified_embeddings                    │   │
│  │  ┌─────────────────────────────────────────────────┐ │   │
│  │  │ id | content_hash | content_type | source_path │ │   │
│  │  │ embedding | model | chunk_index | metadata     │ │   │
│  │  │ created_at | updated_at                        │ │   │
│  │  └─────────────────────────────────────────────────┘ │   │
│  │                                                       │   │
│  │  content_type enum:                                   │   │
│  │  • message     - Session messages (existing)          │   │
│  │  • component   - Agents, commands, skills, hooks      │   │
│  │  • document    - ADRs, guides, references             │   │
│  │  • code        - Customer source files                │   │
│  │  • custom      - Customer-defined content             │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                 content_hashes                        │   │
│  │  ┌─────────────────────────────────────────────────┐ │   │
│  │  │ file_path | content_hash | file_hash | mtime   │ │   │
│  │  │ content_type | indexed_at | chunk_count        │ │   │
│  │  └─────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Hash-Based Change Detection

def compute_content_hash(file_path: Path) -> str:
    """Compute SHA256 hash of file content."""
    with open(file_path, 'rb') as f:
        return hashlib.sha256(f.read()).hexdigest()

def needs_reindex(file_path: Path, db_hash: str) -> bool:
    """Check if file needs re-indexing."""
    current_hash = compute_content_hash(file_path)
    return current_hash != db_hash

Customer Content Integration

Customers can index their own content via configuration:

~/.coditect/config/index-config.json:

{
  "customer_indexes": [
    {
      "name": "my-project-code",
      "type": "code",
      "paths": ["~/projects/my-app/src"],
      "patterns": ["**/*.py", "**/*.ts", "**/*.js"],
      "exclude": ["**/node_modules/**", "**/__pycache__/**"],
      "chunk_size": 1000,
      "chunk_overlap": 200
    },
    {
      "name": "my-project-docs",
      "type": "document",
      "paths": ["~/projects/my-app/docs"],
      "patterns": ["**/*.md"],
      "exclude": []
    },
    {
      "name": "company-wiki",
      "type": "custom",
      "paths": ["/shared/wiki"],
      "patterns": ["**/*.md", "**/*.txt"],
      "metadata": {"source": "confluence-export"}
    }
  ],
  "embedding_model": "all-MiniLM-L6-v2",
  "chunk_size_default": 500,
  "chunk_overlap_default": 100
}

Search API

# Unified search across all content types
results = cxq_search(
    query="kubernetes deployment patterns",
    content_types=["component", "document", "code"],  # Filter by type
    limit=20,
    threshold=0.7
)

# Returns results from:
# - Agents that mention kubernetes
# - ADRs about deployment
# - Customer code with k8s patterns
# - Past sessions discussing kubernetes

CLI Integration

# Index framework components (run during /cx)
/cx                              # Auto-indexes changed components

# Index customer project
/cx --index-project ~/my-app     # Index code + docs

# Search across everything
/cxq --semantic "authentication flow"
/cxq --semantic "error handling" --type code
/cxq --semantic "deployment" --type component,document

# Manage indexes
/index --status                  # Show index health
/index --rebuild component       # Force rebuild specific type
/index --add-path ~/new-project  # Add new index path

Customer Use Cases

1. Developer Onboarding

New developer joins team, asks Claude:

"How does authentication work in this codebase?"

Claude searches:

Customer's auth code files → finds auth.service.ts, middleware/auth.py
Customer's docs → finds docs/AUTH-GUIDE.md
CODITECT components → finds authentication-authorization skill
Past sessions → finds previous discussions about auth bugs

2. Code Discovery

Developer asks:

"Find all API rate limiting implementations"

Claude searches:

Customer code → finds rate limiter middleware
CODITECT patterns → finds rate-limiting-patterns skill
ADRs → finds decisions about rate limit values

3. Architecture Research

Architect asks:

"What patterns do we use for event-driven systems?"

Claude searches:

Customer code → finds Kafka consumers, event handlers
CODITECT components → finds event-driven-architecture skill
ADRs → finds ADR-045-event-sourcing-decision.md
Past sessions → finds previous architecture discussions

4. Bug Investigation

Developer asks:

"Has anyone seen this TypeError before?"

Claude searches:

error_solutions table → finds past fixes
Past sessions → finds similar error discussions
Customer code → finds related error handlers

Database Schema

-- Unified embedding storage
CREATE TABLE unified_embeddings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content_hash TEXT NOT NULL,           -- SHA256 of content
    content_type TEXT NOT NULL,           -- message|component|document|code|custom
    source_path TEXT,                     -- File path or message ID
    source_name TEXT,                     -- Human-readable name
    chunk_index INTEGER DEFAULT 0,        -- For multi-chunk documents
    chunk_total INTEGER DEFAULT 1,        -- Total chunks
    content_preview TEXT,                 -- First 500 chars
    embedding BLOB NOT NULL,              -- 384-dim vector (all-MiniLM-L6-v2)
    model TEXT NOT NULL,                  -- Embedding model name
    metadata TEXT,                        -- JSON: extra info
    customer_index TEXT,                  -- Customer index name (null for framework)
    created_at TEXT DEFAULT CURRENT_TIMESTAMP,
    updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(content_hash, chunk_index)
);

-- Hash tracking for change detection
CREATE TABLE content_hashes (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    file_path TEXT UNIQUE NOT NULL,
    content_hash TEXT NOT NULL,           -- SHA256 of content
    file_size INTEGER,
    mtime REAL,                           -- File modification time
    content_type TEXT NOT NULL,
    customer_index TEXT,                  -- Customer index name
    chunk_count INTEGER DEFAULT 1,
    indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,
    last_checked TEXT DEFAULT CURRENT_TIMESTAMP
);

-- Customer index configuration
CREATE TABLE customer_indexes (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL,
    config TEXT NOT NULL,                 -- JSON configuration
    file_count INTEGER DEFAULT 0,
    embedding_count INTEGER DEFAULT 0,
    last_indexed TEXT,
    created_at TEXT DEFAULT CURRENT_TIMESTAMP
);

-- Indexes for fast lookup
CREATE INDEX idx_unified_embeddings_type ON unified_embeddings(content_type);
CREATE INDEX idx_unified_embeddings_hash ON unified_embeddings(content_hash);
CREATE INDEX idx_unified_embeddings_customer ON unified_embeddings(customer_index);
CREATE INDEX idx_content_hashes_type ON content_hashes(content_type);
CREATE INDEX idx_content_hashes_customer ON content_hashes(customer_index);

Implementation Phases

Phase 1: Hash-Based Change Tracking (H.5.7.1)

Add content_hashes table
Implement hash computation and comparison
Integrate with /cx pipeline
Skip unchanged files during indexing

Phase 2: Component Embeddings (H.5.7.2)

Add unified_embeddings table
Index all 2,331 framework components
Integrate with /cxq --semantic
Add --type component filter

Phase 3: Customer Code Indexing (H.5.7.3)

Add customer_indexes table
Implement index-config.json parsing
Add /cx --index-project command
Chunking strategy for large files

Phase 4: Unified Search (H.5.7.4)

Hybrid search across all content types
Relevance scoring with type weighting
Result grouping by content type
Performance optimization

Consequences

Positive

Semantic search everywhere - Find agents, skills, code, docs by meaning
Incremental updates - Only re-embed changed content (10-100x faster)
Customer extensibility - Index any codebase or doc collection
Unified experience - One search across all knowledge sources
Better agent discovery - /which can use semantic matching

Negative

Storage increase - ~1KB per embedding × content count
Initial indexing time - ~5-10 minutes for full framework
Embedding model dependency - Requires sentence-transformers
Complexity - More tables, more code paths

Mitigations

Storage: Embeddings compress well; 384-dim vectors are compact
Initial time: One-time cost; incremental updates are fast
Dependency: Already required for semantic search; graceful fallback
Complexity: Clear separation of concerns; phased rollout

Performance Estimates

Operation	Time	Notes
Hash check (1 file)	<1ms	O(1) file read + SHA256
Hash check (2,331 files)	~2s	Parallel file I/O
Embed 1 component	~50ms	GPU: ~5ms
Embed 2,331 components	~2 min	Initial only
Incremental embed (10 changed)	~500ms	Typical /cx run
Semantic search	~100ms	Vector similarity + ranking

Security Considerations

Customer code isolation - Each customer index is separate
Path validation - Prevent indexing outside allowed paths
Sensitive file exclusion - Default exclude for .env, secrets, credentials
Hash-only tracking - Content not stored, only embeddings

ADR-079: Trajectory Visualization System (tool analytics)
ADR-089: Two-Database Architecture (context.db vs platform.db)
ADR-101: Database Usage Patterns (query patterns)
ADR-103: Four-Database Separation Architecture (supersedes database layout)
MCP Server: mcp-semantic-search (current implementation)
Task: H.5.7 - Universal Semantic Index

Implementation Note

This ADR defines the semantic indexing concepts. ADR-103 defines WHERE the indexes live:

Framework embeddings → platform-index.db
Customer session embeddings → context.db
Customer project embeddings → projects.db (unified, cloud-integrated)

Cloud Integration (ADR-103)

Projects in projects.db are identified by globally unique UUIDs assigned by CODITECT cloud:

Field	Purpose
`project_uuid`	CODITECT cloud UUID (primary identifier)
`github_repo_url`	Auto-detected from git remote
`tenant_id`	Multi-tenant organization isolation
`team_id`	Team-scoped access control
`parent_project_uuid`	Monorepo/submodule hierarchy

All embeddings in projects.db reference project_uuid, not local auto-increment IDs, enabling cloud sync with auth.coditect.ai.

References

Sentence Transformers - Embedding model
all-MiniLM-L6-v2 - 384-dim embeddings
Reciprocal Rank Fusion - Hybrid scoring

Status​

Context​

Current State​

Decision​

Architecture​

Hash-Based Change Detection​

Customer Content Integration​

Search API​

CLI Integration​

Customer Use Cases​

1. Developer Onboarding​

2. Code Discovery​

3. Architecture Research​

4. Bug Investigation​

Database Schema​

Implementation Phases​

Phase 1: Hash-Based Change Tracking (H.5.7.1)​

Phase 2: Component Embeddings (H.5.7.2)​

Phase 3: Customer Code Indexing (H.5.7.3)​

Phase 4: Unified Search (H.5.7.4)​

Consequences​

Positive​

Negative​

Mitigations​

Performance Estimates​

Security Considerations​

Related​

Implementation Note​

Cloud Integration (ADR-103)​

References​

Status

Context

Current State

Decision

Architecture

Hash-Based Change Detection

Customer Content Integration

Search API

CLI Integration

Customer Use Cases

1. Developer Onboarding

2. Code Discovery

3. Architecture Research

4. Bug Investigation

Database Schema

Implementation Phases

Phase 1: Hash-Based Change Tracking (H.5.7.1)

Phase 2: Component Embeddings (H.5.7.2)

Phase 3: Customer Code Indexing (H.5.7.3)

Phase 4: Unified Search (H.5.7.4)

Consequences

Positive

Negative

Mitigations

Performance Estimates

Security Considerations

Related

Implementation Note

Cloud Integration (ADR-103)

References