ADR-022: Codebase Management Indexing and Search Expansion
Status: Proposed Date: 2025-12-18 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Expand /cx and /cxq to support full codebase indexing, symbol extraction, and cross-reference search
Context and Problem Statement
The current /cx and /cxq system successfully indexes:
- 76,485 messages from Claude Code sessions
- 134,153 decisions extracted from conversations
- 645,855 code patterns from assistant responses
- 3,880 documents across multi-workspace federated catalog
However, the actual codebase is not indexed. Developers cannot:
- Search code symbols - Find function definitions, class declarations, or variable usages
- Track dependencies - Understand import relationships between files
- Cross-reference code - Find where a function is called or a class is instantiated
- Navigate code semantically - Search by concept ("authentication handler") not just text
- Understand codebase structure - Get architectural overview of modules and packages
The Problem: How do we expand /cxq to become a comprehensive codebase intelligence platform while maintaining performance?
Decision Drivers
Technical Requirements
| ID | Requirement | Priority |
|---|---|---|
| R1 | Index source files across all 64 submodules | P0 |
| R2 | Extract symbols (functions, classes, variables, imports) | P0 |
| R3 | Track file dependencies and import graphs | P1 |
| R4 | Semantic code search (concept-based, not just text) | P1 |
| R5 | Cross-reference lookup (find usages, go to definition) | P1 |
| R6 | Language-agnostic support (Python, TypeScript, Rust, Go) | P1 |
| R7 | Incremental indexing (only changed files) | P0 |
| R8 | Sub-second query performance for 100K+ symbols | P0 |
| R9 | Integration with existing doc_index and messages tables | P0 |
| R10 | Workspace-scoped queries (per-submodule isolation) | P0 |
Scale Projections
| Metric | Current | Projected (Codebase) | Growth Factor |
|---|---|---|---|
| Documents | 3,880 | 3,880 | 1x |
| Source Files | 0 | ~15,000 | - |
| Symbols | 0 | ~500,000 | - |
| Dependencies | 0 | ~100,000 | - |
| Cross-References | 0 | ~2,000,000 | - |
| Database Size | 791 MB | ~3-5 GB | 4-6x |
Proposed Schema Expansion
ADR-118 Compliance: These tables belong in
sessions.db(Tier 3) as regenerable codebase index data.
New Tables
-- =============================================================================
-- SOURCE FILE INDEX
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE source_files (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_hash TEXT UNIQUE NOT NULL, -- SHA256 of workspace_id + relative_path
workspace_id TEXT NOT NULL, -- From workspace_registry
workspace_name TEXT, -- Human-readable workspace
relative_path TEXT NOT NULL, -- Path from workspace root
absolute_path TEXT, -- Full path for direct access
language TEXT NOT NULL, -- python, typescript, rust, go, etc.
file_type TEXT, -- source, test, config, script
line_count INTEGER DEFAULT 0,
byte_size INTEGER DEFAULT 0,
content_hash TEXT, -- SHA256 of file content
last_modified TEXT, -- File mtime
indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,
-- Parsed metadata
has_imports BOOLEAN DEFAULT FALSE,
has_exports BOOLEAN DEFAULT FALSE,
has_classes BOOLEAN DEFAULT FALSE,
has_functions BOOLEAN DEFAULT FALSE,
import_count INTEGER DEFAULT 0,
export_count INTEGER DEFAULT 0,
class_count INTEGER DEFAULT 0,
function_count INTEGER DEFAULT 0,
FOREIGN KEY (workspace_id) REFERENCES workspace_registry(workspace_id),
UNIQUE(workspace_id, relative_path)
);
CREATE INDEX idx_source_workspace ON source_files(workspace_id);
CREATE INDEX idx_source_language ON source_files(language);
CREATE INDEX idx_source_type ON source_files(file_type);
-- FTS for file path and content search
CREATE VIRTUAL TABLE source_files_fts USING fts5(
relative_path,
language,
content='source_files',
content_rowid='id'
);
-- =============================================================================
-- SYMBOL INDEX (Functions, Classes, Variables, Constants)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE symbols (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symbol_hash TEXT UNIQUE NOT NULL, -- SHA256 of file_hash + name + line
file_id INTEGER NOT NULL, -- FK to source_files
workspace_id TEXT NOT NULL, -- Denormalized for fast queries
-- Symbol identification
name TEXT NOT NULL, -- Symbol name
qualified_name TEXT, -- Full qualified name (module.class.method)
symbol_type TEXT NOT NULL, -- function, class, method, variable, constant, interface, type
-- Location
line_start INTEGER NOT NULL,
line_end INTEGER,
column_start INTEGER,
column_end INTEGER,
-- Metadata
visibility TEXT, -- public, private, protected, internal
is_async BOOLEAN DEFAULT FALSE,
is_static BOOLEAN DEFAULT FALSE,
is_abstract BOOLEAN DEFAULT FALSE,
is_exported BOOLEAN DEFAULT FALSE,
-- Documentation
docstring TEXT, -- Extracted docstring/JSDoc
signature TEXT, -- Function signature
return_type TEXT, -- Return type annotation
parameters TEXT, -- JSON array of parameters
-- Parent relationships
parent_symbol_id INTEGER, -- For methods inside classes
indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (file_id) REFERENCES source_files(id) ON DELETE CASCADE,
FOREIGN KEY (parent_symbol_id) REFERENCES symbols(id)
);
CREATE INDEX idx_symbols_file ON symbols(file_id);
CREATE INDEX idx_symbols_workspace ON symbols(workspace_id);
CREATE INDEX idx_symbols_name ON symbols(name);
CREATE INDEX idx_symbols_type ON symbols(symbol_type);
CREATE INDEX idx_symbols_qualified ON symbols(qualified_name);
-- FTS for symbol search
CREATE VIRTUAL TABLE symbols_fts USING fts5(
name,
qualified_name,
docstring,
signature,
content='symbols',
content_rowid='id'
);
-- =============================================================================
-- DEPENDENCY GRAPH (Imports and Exports)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE dependencies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
dep_hash TEXT UNIQUE NOT NULL, -- SHA256 of from_file + to_target + import_name
-- Source file
from_file_id INTEGER NOT NULL, -- FK to source_files
from_workspace_id TEXT NOT NULL,
-- Target (can be file or external package)
to_file_id INTEGER, -- FK to source_files (NULL if external)
to_module TEXT NOT NULL, -- Module/package path
to_workspace_id TEXT, -- NULL if external
-- Import details
import_type TEXT NOT NULL, -- import, from_import, require, use, include
import_name TEXT, -- Specific import (e.g., "Dict" from "typing")
alias TEXT, -- Import alias (as X)
is_external BOOLEAN DEFAULT FALSE, -- External package vs internal
is_type_only BOOLEAN DEFAULT FALSE, -- TypeScript type-only imports
line_number INTEGER,
indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (from_file_id) REFERENCES source_files(id) ON DELETE CASCADE,
FOREIGN KEY (to_file_id) REFERENCES source_files(id) ON DELETE SET NULL
);
CREATE INDEX idx_deps_from ON dependencies(from_file_id);
CREATE INDEX idx_deps_to ON dependencies(to_file_id);
CREATE INDEX idx_deps_module ON dependencies(to_module);
CREATE INDEX idx_deps_external ON dependencies(is_external);
-- =============================================================================
-- CROSS-REFERENCES (Symbol Usages)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE cross_references (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ref_hash TEXT UNIQUE NOT NULL, -- SHA256 of symbol_id + file_id + line
-- The symbol being referenced
symbol_id INTEGER NOT NULL, -- FK to symbols
symbol_name TEXT NOT NULL, -- Denormalized for fast display
-- Where it's referenced
file_id INTEGER NOT NULL, -- FK to source_files
workspace_id TEXT NOT NULL,
line_number INTEGER NOT NULL,
column_number INTEGER,
-- Reference type
ref_type TEXT NOT NULL, -- call, instantiation, read, write, type_annotation
-- Context
context_line TEXT, -- The line of code containing the reference
indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (symbol_id) REFERENCES symbols(id) ON DELETE CASCADE,
FOREIGN KEY (file_id) REFERENCES source_files(id) ON DELETE CASCADE
);
CREATE INDEX idx_xref_symbol ON cross_references(symbol_id);
CREATE INDEX idx_xref_file ON cross_references(file_id);
CREATE INDEX idx_xref_workspace ON cross_references(workspace_id);
CREATE INDEX idx_xref_type ON cross_references(ref_type);
-- =============================================================================
-- CODE EMBEDDINGS (Semantic Search)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE code_embeddings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_type TEXT NOT NULL, -- file, symbol, docstring
source_id INTEGER NOT NULL, -- FK to source_files or symbols
workspace_id TEXT NOT NULL,
-- Embedding data
embedding BLOB NOT NULL, -- numpy array as bytes
model TEXT NOT NULL, -- e.g., 'all-MiniLM-L6-v2'
dimensions INTEGER NOT NULL, -- 384 for MiniLM
-- Source text (for regeneration)
source_text TEXT, -- The text that was embedded
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_code_emb_source ON code_embeddings(source_type, source_id);
CREATE INDEX idx_code_emb_workspace ON code_embeddings(workspace_id);
CLI Extensions
New /cxq Commands
# Index codebase for a workspace
/cxq --index-code # Index current workspace
/cxq --index-code --workspace NAME # Index specific workspace
/cxq --index-code --all-workspaces # Index all workspaces
/cxq --rebuild-code # Full rebuild
# Search code
/cxq --code "query" # Search symbols and files
/cxq --code "query" --type function # Filter by symbol type
/cxq --code "query" --language python # Filter by language
/cxq --code "query" --workspace NAME # Scope to workspace
# Symbol lookup
/cxq --symbol "ClassName.method" # Find symbol definition
/cxq --find-usages "function_name" # Find all usages
/cxq --go-to-definition "symbol" # Show definition location
# Dependency analysis
/cxq --imports FILE # Show file's imports
/cxq --importers FILE # Show who imports this file
/cxq --dependency-graph # Generate dependency graph
/cxq --external-deps # List external packages
# Code statistics
/cxq --code-stats # Codebase statistics
/cxq --code-stats --workspace NAME # Per-workspace stats
Language Support
Phase 1 (Core Languages)
| Language | Parser | Symbol Types |
|---|---|---|
| Python | ast (stdlib) | functions, classes, methods, variables, imports |
| TypeScript/JavaScript | @typescript-eslint/parser | functions, classes, interfaces, types, imports |
| Rust | tree-sitter-rust | functions, structs, traits, impl blocks, mods |
| Go | go/parser | functions, structs, interfaces, packages |
Phase 2 (Extended)
| Language | Parser |
|---|---|
| Java | tree-sitter-java |
| C/C++ | tree-sitter-c, tree-sitter-cpp |
| Ruby | tree-sitter-ruby |
| Shell | tree-sitter-bash |
Parser Strategy
# Language detection by extension
LANGUAGE_MAP = {
'.py': 'python',
'.pyi': 'python',
'.ts': 'typescript',
'.tsx': 'typescript',
'.js': 'javascript',
'.jsx': 'javascript',
'.rs': 'rust',
'.go': 'go',
'.java': 'java',
'.rb': 'ruby',
'.sh': 'shell',
'.bash': 'shell',
}
# Parser selection
def get_parser(language: str):
if language == 'python':
return PythonASTParser()
elif language in ('typescript', 'javascript'):
return TreeSitterParser('typescript')
elif language == 'rust':
return TreeSitterParser('rust')
elif language == 'go':
return TreeSitterParser('go')
else:
return GenericParser() # Regex-based fallback
Performance Considerations
Indexing Performance
| Operation | Target | Strategy |
|---|---|---|
| Initial index (15K files) | < 5 minutes | Parallel parsing, batch inserts |
| Incremental update (1 file) | < 1 second | Content hash comparison |
| Symbol extraction (per file) | < 100ms | Native parsers, no disk I/O |
| Embedding generation (per file) | < 500ms | Batch processing, GPU if available |
Query Performance
| Query Type | Target | Strategy |
|---|---|---|
| Symbol search | < 100ms | FTS5 + indexes |
| Find usages | < 200ms | Indexed cross_references |
| Dependency graph | < 500ms | Recursive CTE + caching |
| Semantic search | < 300ms | Vector similarity + LIMIT |
Optimization Strategies
- Batch Inserts - Use
INSERT INTO ... VALUESwith 1000 rows per batch - Write-Ahead Logging -
PRAGMA journal_mode=WALfor concurrent reads - Memory-Mapped I/O -
PRAGMA mmap_size=268435456(256MB) - Covering Indexes - Include frequently queried columns in indexes
- Partial Indexes - Index only relevant subsets (e.g., exported symbols)
Integration Points
With Existing Tables (ADR-118 Compliant)
┌─────────────────────────────────────────────────────────────────┐
│ SESSIONS.DB (Tier 3 - Regenerable) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ EXISTING (Sessions) NEW (Codebase) │
│ ├── messages (76K) ├── source_files (15K) │
│ ├── tool_analytics ├── symbols (500K) │
│ ├── token_economics ←──────────→ ├── dependencies (100K) │
│ └── workspace_registry ←──────────→ ├── cross_references (2M)│
│ └── code_embeddings (50K)│
│ │
│ LINKAGES: │
│ • code_patterns.source_file → source_files.id │
│ • All tables share workspace_id from workspace_registry │
│ • doc_index and source_files unified search │
│ │
└─────────────────────────────────────────────────────────────────┘
With Multi-Workspace System
- All new tables include
workspace_idfor isolation - Queries support
--workspaceand--all-workspacesfilters - Worktrees automatically register and index their code
Alternatives Considered
A1: Separate Database for Code
Pros: Isolation, different optimization profile Cons: Complex joins, two databases to manage, sync issues Decision: Rejected - unified database preferred
A2: Language Server Protocol (LSP) Integration
Pros: Mature tooling, real-time updates Cons: Requires running servers, memory overhead, no offline indexing Decision: Deferred to Phase 2 as enhancement
A3: External Search Engine (Elasticsearch/Meilisearch)
Pros: Faster at scale, better fuzzy matching Cons: External dependency, deployment complexity Decision: Rejected for MVP - reconsider at 1M+ symbols
Implementation Phases
Phase 1: Core Indexing (Week 1-2)
- Create schema migration
- Implement Python parser
- Implement TypeScript parser
- Basic symbol search
- File-level indexing
Phase 2: Cross-References (Week 3)
- Implement cross-reference extraction
- Find usages functionality
- Go to definition
Phase 3: Dependencies (Week 4)
- Import/export tracking
- Dependency graph generation
- External package detection
Phase 4: Semantic Search (Week 5)
- Code embeddings
- Concept-based search
- Similar code detection
Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Index coverage | 100% of source files | Files indexed / total files |
| Symbol accuracy | > 95% | Manual validation sample |
| Query latency P95 | < 200ms | Performance monitoring |
| Incremental index time | < 2s per file | Benchmark |
| User adoption | > 50% of /cxq queries | Usage analytics |
Related ADRs
- ADR-020 - Context Extraction (
/cx) - ADR-021 - Context Query System (
/cxq) - ADR-023 - SQLite Scalability Analysis
- ADR-025 - Comprehensive Entry Schema
- ADR-118 - Four-Tier Database Architecture
References
Decision: Approved for implementation Migrated: 2026-02-03 (ADR-150) Review Date: 2026-06-18