ADR-022: Codebase Management Indexing and Search Expansion

Status: Proposed Date: 2025-12-18 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Expand /cx and /cxq to support full codebase indexing, symbol extraction, and cross-reference search

Context and Problem Statement

The current /cx and /cxq system successfully indexes:

76,485 messages from Claude Code sessions
134,153 decisions extracted from conversations
645,855 code patterns from assistant responses
3,880 documents across multi-workspace federated catalog

However, the actual codebase is not indexed. Developers cannot:

Search code symbols - Find function definitions, class declarations, or variable usages
Track dependencies - Understand import relationships between files
Cross-reference code - Find where a function is called or a class is instantiated
Navigate code semantically - Search by concept ("authentication handler") not just text
Understand codebase structure - Get architectural overview of modules and packages

The Problem: How do we expand /cxq to become a comprehensive codebase intelligence platform while maintaining performance?

Decision Drivers

Technical Requirements

ID	Requirement	Priority
R1	Index source files across all 64 submodules	P0
R2	Extract symbols (functions, classes, variables, imports)	P0
R3	Track file dependencies and import graphs	P1
R4	Semantic code search (concept-based, not just text)	P1
R5	Cross-reference lookup (find usages, go to definition)	P1
R6	Language-agnostic support (Python, TypeScript, Rust, Go)	P1
R7	Incremental indexing (only changed files)	P0
R8	Sub-second query performance for 100K+ symbols	P0
R9	Integration with existing doc_index and messages tables	P0
R10	Workspace-scoped queries (per-submodule isolation)	P0

Scale Projections

Metric	Current	Projected (Codebase)	Growth Factor
Documents	3,880	3,880	1x
Source Files	0	~15,000	-
Symbols	0	~500,000	-
Dependencies	0	~100,000	-
Cross-References	0	~2,000,000	-
Database Size	791 MB	~3-5 GB	4-6x

Proposed Schema Expansion

ADR-118 Compliance: These tables belong in sessions.db (Tier 3) as regenerable codebase index data.

New Tables

-- =============================================================================
-- SOURCE FILE INDEX
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE source_files (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    file_hash TEXT UNIQUE NOT NULL,          -- SHA256 of workspace_id + relative_path
    workspace_id TEXT NOT NULL,               -- From workspace_registry
    workspace_name TEXT,                      -- Human-readable workspace
    relative_path TEXT NOT NULL,              -- Path from workspace root
    absolute_path TEXT,                       -- Full path for direct access
    language TEXT NOT NULL,                   -- python, typescript, rust, go, etc.
    file_type TEXT,                           -- source, test, config, script
    line_count INTEGER DEFAULT 0,
    byte_size INTEGER DEFAULT 0,
    content_hash TEXT,                        -- SHA256 of file content
    last_modified TEXT,                       -- File mtime
    indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

    -- Parsed metadata
    has_imports BOOLEAN DEFAULT FALSE,
    has_exports BOOLEAN DEFAULT FALSE,
    has_classes BOOLEAN DEFAULT FALSE,
    has_functions BOOLEAN DEFAULT FALSE,
    import_count INTEGER DEFAULT 0,
    export_count INTEGER DEFAULT 0,
    class_count INTEGER DEFAULT 0,
    function_count INTEGER DEFAULT 0,

    FOREIGN KEY (workspace_id) REFERENCES workspace_registry(workspace_id),
    UNIQUE(workspace_id, relative_path)
);

CREATE INDEX idx_source_workspace ON source_files(workspace_id);
CREATE INDEX idx_source_language ON source_files(language);
CREATE INDEX idx_source_type ON source_files(file_type);

-- FTS for file path and content search
CREATE VIRTUAL TABLE source_files_fts USING fts5(
    relative_path,
    language,
    content='source_files',
    content_rowid='id'
);

-- =============================================================================
-- SYMBOL INDEX (Functions, Classes, Variables, Constants)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE symbols (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    symbol_hash TEXT UNIQUE NOT NULL,         -- SHA256 of file_hash + name + line
    file_id INTEGER NOT NULL,                 -- FK to source_files
    workspace_id TEXT NOT NULL,               -- Denormalized for fast queries

    -- Symbol identification
    name TEXT NOT NULL,                       -- Symbol name
    qualified_name TEXT,                      -- Full qualified name (module.class.method)
    symbol_type TEXT NOT NULL,                -- function, class, method, variable, constant, interface, type

    -- Location
    line_start INTEGER NOT NULL,
    line_end INTEGER,
    column_start INTEGER,
    column_end INTEGER,

    -- Metadata
    visibility TEXT,                          -- public, private, protected, internal
    is_async BOOLEAN DEFAULT FALSE,
    is_static BOOLEAN DEFAULT FALSE,
    is_abstract BOOLEAN DEFAULT FALSE,
    is_exported BOOLEAN DEFAULT FALSE,

    -- Documentation
    docstring TEXT,                           -- Extracted docstring/JSDoc
    signature TEXT,                           -- Function signature
    return_type TEXT,                         -- Return type annotation
    parameters TEXT,                          -- JSON array of parameters

    -- Parent relationships
    parent_symbol_id INTEGER,                 -- For methods inside classes

    indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (file_id) REFERENCES source_files(id) ON DELETE CASCADE,
    FOREIGN KEY (parent_symbol_id) REFERENCES symbols(id)
);

CREATE INDEX idx_symbols_file ON symbols(file_id);
CREATE INDEX idx_symbols_workspace ON symbols(workspace_id);
CREATE INDEX idx_symbols_name ON symbols(name);
CREATE INDEX idx_symbols_type ON symbols(symbol_type);
CREATE INDEX idx_symbols_qualified ON symbols(qualified_name);

-- FTS for symbol search
CREATE VIRTUAL TABLE symbols_fts USING fts5(
    name,
    qualified_name,
    docstring,
    signature,
    content='symbols',
    content_rowid='id'
);

-- =============================================================================
-- DEPENDENCY GRAPH (Imports and Exports)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE dependencies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    dep_hash TEXT UNIQUE NOT NULL,            -- SHA256 of from_file + to_target + import_name

    -- Source file
    from_file_id INTEGER NOT NULL,            -- FK to source_files
    from_workspace_id TEXT NOT NULL,

    -- Target (can be file or external package)
    to_file_id INTEGER,                       -- FK to source_files (NULL if external)
    to_module TEXT NOT NULL,                  -- Module/package path
    to_workspace_id TEXT,                     -- NULL if external

    -- Import details
    import_type TEXT NOT NULL,                -- import, from_import, require, use, include
    import_name TEXT,                         -- Specific import (e.g., "Dict" from "typing")
    alias TEXT,                               -- Import alias (as X)
    is_external BOOLEAN DEFAULT FALSE,        -- External package vs internal
    is_type_only BOOLEAN DEFAULT FALSE,       -- TypeScript type-only imports

    line_number INTEGER,

    indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (from_file_id) REFERENCES source_files(id) ON DELETE CASCADE,
    FOREIGN KEY (to_file_id) REFERENCES source_files(id) ON DELETE SET NULL
);

CREATE INDEX idx_deps_from ON dependencies(from_file_id);
CREATE INDEX idx_deps_to ON dependencies(to_file_id);
CREATE INDEX idx_deps_module ON dependencies(to_module);
CREATE INDEX idx_deps_external ON dependencies(is_external);

-- =============================================================================
-- CROSS-REFERENCES (Symbol Usages)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE cross_references (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ref_hash TEXT UNIQUE NOT NULL,            -- SHA256 of symbol_id + file_id + line

    -- The symbol being referenced
    symbol_id INTEGER NOT NULL,               -- FK to symbols
    symbol_name TEXT NOT NULL,                -- Denormalized for fast display

    -- Where it's referenced
    file_id INTEGER NOT NULL,                 -- FK to source_files
    workspace_id TEXT NOT NULL,
    line_number INTEGER NOT NULL,
    column_number INTEGER,

    -- Reference type
    ref_type TEXT NOT NULL,                   -- call, instantiation, read, write, type_annotation

    -- Context
    context_line TEXT,                        -- The line of code containing the reference

    indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (symbol_id) REFERENCES symbols(id) ON DELETE CASCADE,
    FOREIGN KEY (file_id) REFERENCES source_files(id) ON DELETE CASCADE
);

CREATE INDEX idx_xref_symbol ON cross_references(symbol_id);
CREATE INDEX idx_xref_file ON cross_references(file_id);
CREATE INDEX idx_xref_workspace ON cross_references(workspace_id);
CREATE INDEX idx_xref_type ON cross_references(ref_type);

-- =============================================================================
-- CODE EMBEDDINGS (Semantic Search)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE code_embeddings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_type TEXT NOT NULL,                -- file, symbol, docstring
    source_id INTEGER NOT NULL,               -- FK to source_files or symbols
    workspace_id TEXT NOT NULL,

    -- Embedding data
    embedding BLOB NOT NULL,                  -- numpy array as bytes
    model TEXT NOT NULL,                      -- e.g., 'all-MiniLM-L6-v2'
    dimensions INTEGER NOT NULL,              -- 384 for MiniLM

    -- Source text (for regeneration)
    source_text TEXT,                         -- The text that was embedded

    created_at TEXT DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_code_emb_source ON code_embeddings(source_type, source_id);
CREATE INDEX idx_code_emb_workspace ON code_embeddings(workspace_id);

CLI Extensions

New /cxq Commands

# Index codebase for a workspace
/cxq --index-code                           # Index current workspace
/cxq --index-code --workspace NAME          # Index specific workspace
/cxq --index-code --all-workspaces          # Index all workspaces
/cxq --rebuild-code                         # Full rebuild

# Search code
/cxq --code "query"                         # Search symbols and files
/cxq --code "query" --type function         # Filter by symbol type
/cxq --code "query" --language python       # Filter by language
/cxq --code "query" --workspace NAME        # Scope to workspace

# Symbol lookup
/cxq --symbol "ClassName.method"            # Find symbol definition
/cxq --find-usages "function_name"          # Find all usages
/cxq --go-to-definition "symbol"            # Show definition location

# Dependency analysis
/cxq --imports FILE                         # Show file's imports
/cxq --importers FILE                       # Show who imports this file
/cxq --dependency-graph                     # Generate dependency graph
/cxq --external-deps                        # List external packages

# Code statistics
/cxq --code-stats                           # Codebase statistics
/cxq --code-stats --workspace NAME          # Per-workspace stats

Language Support

Phase 1 (Core Languages)

Language	Parser	Symbol Types
Python	`ast` (stdlib)	functions, classes, methods, variables, imports
TypeScript/JavaScript	`@typescript-eslint/parser`	functions, classes, interfaces, types, imports
Rust	`tree-sitter-rust`	functions, structs, traits, impl blocks, mods
Go	`go/parser`	functions, structs, interfaces, packages

Phase 2 (Extended)

Language	Parser
Java	tree-sitter-java
C/C++	tree-sitter-c, tree-sitter-cpp
Ruby	tree-sitter-ruby
Shell	tree-sitter-bash

Parser Strategy

# Language detection by extension
LANGUAGE_MAP = {
    '.py': 'python',
    '.pyi': 'python',
    '.ts': 'typescript',
    '.tsx': 'typescript',
    '.js': 'javascript',
    '.jsx': 'javascript',
    '.rs': 'rust',
    '.go': 'go',
    '.java': 'java',
    '.rb': 'ruby',
    '.sh': 'shell',
    '.bash': 'shell',
}

# Parser selection
def get_parser(language: str):
    if language == 'python':
        return PythonASTParser()
    elif language in ('typescript', 'javascript'):
        return TreeSitterParser('typescript')
    elif language == 'rust':
        return TreeSitterParser('rust')
    elif language == 'go':
        return TreeSitterParser('go')
    else:
        return GenericParser()  # Regex-based fallback

Performance Considerations

Indexing Performance

Operation	Target	Strategy
Initial index (15K files)	< 5 minutes	Parallel parsing, batch inserts
Incremental update (1 file)	< 1 second	Content hash comparison
Symbol extraction (per file)	< 100ms	Native parsers, no disk I/O
Embedding generation (per file)	< 500ms	Batch processing, GPU if available

Query Performance

Query Type	Target	Strategy
Symbol search	< 100ms	FTS5 + indexes
Find usages	< 200ms	Indexed cross_references
Dependency graph	< 500ms	Recursive CTE + caching
Semantic search	< 300ms	Vector similarity + LIMIT

Optimization Strategies

Batch Inserts - Use INSERT INTO ... VALUES with 1000 rows per batch
Write-Ahead Logging - PRAGMA journal_mode=WAL for concurrent reads
Memory-Mapped I/O - PRAGMA mmap_size=268435456 (256MB)
Covering Indexes - Include frequently queried columns in indexes
Partial Indexes - Index only relevant subsets (e.g., exported symbols)

Integration Points

With Existing Tables (ADR-118 Compliant)

┌─────────────────────────────────────────────────────────────────┐
│                 SESSIONS.DB (Tier 3 - Regenerable)              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  EXISTING (Sessions)                   NEW (Codebase)           │
│  ├── messages (76K)                    ├── source_files (15K)   │
│  ├── tool_analytics                    ├── symbols (500K)       │
│  ├── token_economics      ←──────────→ ├── dependencies (100K)  │
│  └── workspace_registry   ←──────────→ ├── cross_references (2M)│
│                                        └── code_embeddings (50K)│
│                                                                 │
│  LINKAGES:                                                      │
│  • code_patterns.source_file → source_files.id                 │
│  • All tables share workspace_id from workspace_registry        │
│  • doc_index and source_files unified search                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

With Multi-Workspace System

All new tables include workspace_id for isolation
Queries support --workspace and --all-workspaces filters
Worktrees automatically register and index their code

Alternatives Considered

A1: Separate Database for Code

Pros: Isolation, different optimization profile Cons: Complex joins, two databases to manage, sync issues Decision: Rejected - unified database preferred

A2: Language Server Protocol (LSP) Integration

Pros: Mature tooling, real-time updates Cons: Requires running servers, memory overhead, no offline indexing Decision: Deferred to Phase 2 as enhancement

A3: External Search Engine (Elasticsearch/Meilisearch)

Pros: Faster at scale, better fuzzy matching Cons: External dependency, deployment complexity Decision: Rejected for MVP - reconsider at 1M+ symbols

Implementation Phases

Phase 1: Core Indexing (Week 1-2)

Phase 2: Cross-References (Week 3)

Implement cross-reference extraction
Find usages functionality
Go to definition

Phase 3: Dependencies (Week 4)

Import/export tracking
Dependency graph generation
External package detection

Phase 4: Semantic Search (Week 5)

Code embeddings
Concept-based search
Similar code detection

Success Metrics

Metric	Target	Measurement
Index coverage	100% of source files	Files indexed / total files
Symbol accuracy	> 95%	Manual validation sample
Query latency P95	< 200ms	Performance monitoring
Incremental index time	< 2s per file	Benchmark
User adoption	> 50% of /cxq queries	Usage analytics

ADR-020 - Context Extraction (/cx)
ADR-021 - Context Query System (/cxq)
ADR-023 - SQLite Scalability Analysis
ADR-025 - Comprehensive Entry Schema
ADR-118 - Four-Tier Database Architecture

References

Decision: Approved for implementation Migrated: 2026-02-03 (ADR-150) Review Date: 2026-06-18

Context and Problem Statement​

Decision Drivers​

Technical Requirements​

Scale Projections​

Proposed Schema Expansion​

New Tables​

CLI Extensions​

New /cxq Commands​

Language Support​

Phase 1 (Core Languages)​

Phase 2 (Extended)​

Parser Strategy​

Performance Considerations​

Indexing Performance​

Query Performance​

Optimization Strategies​

Integration Points​

With Existing Tables (ADR-118 Compliant)​

With Multi-Workspace System​

Alternatives Considered​

A1: Separate Database for Code​

A2: Language Server Protocol (LSP) Integration​

A3: External Search Engine (Elasticsearch/Meilisearch)​

Implementation Phases​

Phase 1: Core Indexing (Week 1-2)​

Phase 2: Cross-References (Week 3)​

Phase 3: Dependencies (Week 4)​

Phase 4: Semantic Search (Week 5)​

Success Metrics​

Related ADRs​

References​