Skip to main content

ADR-022: Codebase Management Indexing and Search Expansion

Status: Proposed Date: 2025-12-18 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Expand /cx and /cxq to support full codebase indexing, symbol extraction, and cross-reference search


Context and Problem Statement

The current /cx and /cxq system successfully indexes:

  • 76,485 messages from Claude Code sessions
  • 134,153 decisions extracted from conversations
  • 645,855 code patterns from assistant responses
  • 3,880 documents across multi-workspace federated catalog

However, the actual codebase is not indexed. Developers cannot:

  1. Search code symbols - Find function definitions, class declarations, or variable usages
  2. Track dependencies - Understand import relationships between files
  3. Cross-reference code - Find where a function is called or a class is instantiated
  4. Navigate code semantically - Search by concept ("authentication handler") not just text
  5. Understand codebase structure - Get architectural overview of modules and packages

The Problem: How do we expand /cxq to become a comprehensive codebase intelligence platform while maintaining performance?


Decision Drivers

Technical Requirements

IDRequirementPriority
R1Index source files across all 64 submodulesP0
R2Extract symbols (functions, classes, variables, imports)P0
R3Track file dependencies and import graphsP1
R4Semantic code search (concept-based, not just text)P1
R5Cross-reference lookup (find usages, go to definition)P1
R6Language-agnostic support (Python, TypeScript, Rust, Go)P1
R7Incremental indexing (only changed files)P0
R8Sub-second query performance for 100K+ symbolsP0
R9Integration with existing doc_index and messages tablesP0
R10Workspace-scoped queries (per-submodule isolation)P0

Scale Projections

MetricCurrentProjected (Codebase)Growth Factor
Documents3,8803,8801x
Source Files0~15,000-
Symbols0~500,000-
Dependencies0~100,000-
Cross-References0~2,000,000-
Database Size791 MB~3-5 GB4-6x

Proposed Schema Expansion

ADR-118 Compliance: These tables belong in sessions.db (Tier 3) as regenerable codebase index data.

New Tables

-- =============================================================================
-- SOURCE FILE INDEX
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE source_files (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_hash TEXT UNIQUE NOT NULL, -- SHA256 of workspace_id + relative_path
workspace_id TEXT NOT NULL, -- From workspace_registry
workspace_name TEXT, -- Human-readable workspace
relative_path TEXT NOT NULL, -- Path from workspace root
absolute_path TEXT, -- Full path for direct access
language TEXT NOT NULL, -- python, typescript, rust, go, etc.
file_type TEXT, -- source, test, config, script
line_count INTEGER DEFAULT 0,
byte_size INTEGER DEFAULT 0,
content_hash TEXT, -- SHA256 of file content
last_modified TEXT, -- File mtime
indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

-- Parsed metadata
has_imports BOOLEAN DEFAULT FALSE,
has_exports BOOLEAN DEFAULT FALSE,
has_classes BOOLEAN DEFAULT FALSE,
has_functions BOOLEAN DEFAULT FALSE,
import_count INTEGER DEFAULT 0,
export_count INTEGER DEFAULT 0,
class_count INTEGER DEFAULT 0,
function_count INTEGER DEFAULT 0,

FOREIGN KEY (workspace_id) REFERENCES workspace_registry(workspace_id),
UNIQUE(workspace_id, relative_path)
);

CREATE INDEX idx_source_workspace ON source_files(workspace_id);
CREATE INDEX idx_source_language ON source_files(language);
CREATE INDEX idx_source_type ON source_files(file_type);

-- FTS for file path and content search
CREATE VIRTUAL TABLE source_files_fts USING fts5(
relative_path,
language,
content='source_files',
content_rowid='id'
);

-- =============================================================================
-- SYMBOL INDEX (Functions, Classes, Variables, Constants)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE symbols (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symbol_hash TEXT UNIQUE NOT NULL, -- SHA256 of file_hash + name + line
file_id INTEGER NOT NULL, -- FK to source_files
workspace_id TEXT NOT NULL, -- Denormalized for fast queries

-- Symbol identification
name TEXT NOT NULL, -- Symbol name
qualified_name TEXT, -- Full qualified name (module.class.method)
symbol_type TEXT NOT NULL, -- function, class, method, variable, constant, interface, type

-- Location
line_start INTEGER NOT NULL,
line_end INTEGER,
column_start INTEGER,
column_end INTEGER,

-- Metadata
visibility TEXT, -- public, private, protected, internal
is_async BOOLEAN DEFAULT FALSE,
is_static BOOLEAN DEFAULT FALSE,
is_abstract BOOLEAN DEFAULT FALSE,
is_exported BOOLEAN DEFAULT FALSE,

-- Documentation
docstring TEXT, -- Extracted docstring/JSDoc
signature TEXT, -- Function signature
return_type TEXT, -- Return type annotation
parameters TEXT, -- JSON array of parameters

-- Parent relationships
parent_symbol_id INTEGER, -- For methods inside classes

indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

FOREIGN KEY (file_id) REFERENCES source_files(id) ON DELETE CASCADE,
FOREIGN KEY (parent_symbol_id) REFERENCES symbols(id)
);

CREATE INDEX idx_symbols_file ON symbols(file_id);
CREATE INDEX idx_symbols_workspace ON symbols(workspace_id);
CREATE INDEX idx_symbols_name ON symbols(name);
CREATE INDEX idx_symbols_type ON symbols(symbol_type);
CREATE INDEX idx_symbols_qualified ON symbols(qualified_name);

-- FTS for symbol search
CREATE VIRTUAL TABLE symbols_fts USING fts5(
name,
qualified_name,
docstring,
signature,
content='symbols',
content_rowid='id'
);

-- =============================================================================
-- DEPENDENCY GRAPH (Imports and Exports)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE dependencies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
dep_hash TEXT UNIQUE NOT NULL, -- SHA256 of from_file + to_target + import_name

-- Source file
from_file_id INTEGER NOT NULL, -- FK to source_files
from_workspace_id TEXT NOT NULL,

-- Target (can be file or external package)
to_file_id INTEGER, -- FK to source_files (NULL if external)
to_module TEXT NOT NULL, -- Module/package path
to_workspace_id TEXT, -- NULL if external

-- Import details
import_type TEXT NOT NULL, -- import, from_import, require, use, include
import_name TEXT, -- Specific import (e.g., "Dict" from "typing")
alias TEXT, -- Import alias (as X)
is_external BOOLEAN DEFAULT FALSE, -- External package vs internal
is_type_only BOOLEAN DEFAULT FALSE, -- TypeScript type-only imports

line_number INTEGER,

indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

FOREIGN KEY (from_file_id) REFERENCES source_files(id) ON DELETE CASCADE,
FOREIGN KEY (to_file_id) REFERENCES source_files(id) ON DELETE SET NULL
);

CREATE INDEX idx_deps_from ON dependencies(from_file_id);
CREATE INDEX idx_deps_to ON dependencies(to_file_id);
CREATE INDEX idx_deps_module ON dependencies(to_module);
CREATE INDEX idx_deps_external ON dependencies(is_external);

-- =============================================================================
-- CROSS-REFERENCES (Symbol Usages)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE cross_references (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ref_hash TEXT UNIQUE NOT NULL, -- SHA256 of symbol_id + file_id + line

-- The symbol being referenced
symbol_id INTEGER NOT NULL, -- FK to symbols
symbol_name TEXT NOT NULL, -- Denormalized for fast display

-- Where it's referenced
file_id INTEGER NOT NULL, -- FK to source_files
workspace_id TEXT NOT NULL,
line_number INTEGER NOT NULL,
column_number INTEGER,

-- Reference type
ref_type TEXT NOT NULL, -- call, instantiation, read, write, type_annotation

-- Context
context_line TEXT, -- The line of code containing the reference

indexed_at TEXT DEFAULT CURRENT_TIMESTAMP,

FOREIGN KEY (symbol_id) REFERENCES symbols(id) ON DELETE CASCADE,
FOREIGN KEY (file_id) REFERENCES source_files(id) ON DELETE CASCADE
);

CREATE INDEX idx_xref_symbol ON cross_references(symbol_id);
CREATE INDEX idx_xref_file ON cross_references(file_id);
CREATE INDEX idx_xref_workspace ON cross_references(workspace_id);
CREATE INDEX idx_xref_type ON cross_references(ref_type);

-- =============================================================================
-- CODE EMBEDDINGS (Semantic Search)
-- Storage: sessions.db (Tier 3 - Regenerable)
-- =============================================================================
CREATE TABLE code_embeddings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_type TEXT NOT NULL, -- file, symbol, docstring
source_id INTEGER NOT NULL, -- FK to source_files or symbols
workspace_id TEXT NOT NULL,

-- Embedding data
embedding BLOB NOT NULL, -- numpy array as bytes
model TEXT NOT NULL, -- e.g., 'all-MiniLM-L6-v2'
dimensions INTEGER NOT NULL, -- 384 for MiniLM

-- Source text (for regeneration)
source_text TEXT, -- The text that was embedded

created_at TEXT DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_code_emb_source ON code_embeddings(source_type, source_id);
CREATE INDEX idx_code_emb_workspace ON code_embeddings(workspace_id);

CLI Extensions

New /cxq Commands

# Index codebase for a workspace
/cxq --index-code # Index current workspace
/cxq --index-code --workspace NAME # Index specific workspace
/cxq --index-code --all-workspaces # Index all workspaces
/cxq --rebuild-code # Full rebuild

# Search code
/cxq --code "query" # Search symbols and files
/cxq --code "query" --type function # Filter by symbol type
/cxq --code "query" --language python # Filter by language
/cxq --code "query" --workspace NAME # Scope to workspace

# Symbol lookup
/cxq --symbol "ClassName.method" # Find symbol definition
/cxq --find-usages "function_name" # Find all usages
/cxq --go-to-definition "symbol" # Show definition location

# Dependency analysis
/cxq --imports FILE # Show file's imports
/cxq --importers FILE # Show who imports this file
/cxq --dependency-graph # Generate dependency graph
/cxq --external-deps # List external packages

# Code statistics
/cxq --code-stats # Codebase statistics
/cxq --code-stats --workspace NAME # Per-workspace stats

Language Support

Phase 1 (Core Languages)

LanguageParserSymbol Types
Pythonast (stdlib)functions, classes, methods, variables, imports
TypeScript/JavaScript@typescript-eslint/parserfunctions, classes, interfaces, types, imports
Rusttree-sitter-rustfunctions, structs, traits, impl blocks, mods
Gogo/parserfunctions, structs, interfaces, packages

Phase 2 (Extended)

LanguageParser
Javatree-sitter-java
C/C++tree-sitter-c, tree-sitter-cpp
Rubytree-sitter-ruby
Shelltree-sitter-bash

Parser Strategy

# Language detection by extension
LANGUAGE_MAP = {
'.py': 'python',
'.pyi': 'python',
'.ts': 'typescript',
'.tsx': 'typescript',
'.js': 'javascript',
'.jsx': 'javascript',
'.rs': 'rust',
'.go': 'go',
'.java': 'java',
'.rb': 'ruby',
'.sh': 'shell',
'.bash': 'shell',
}

# Parser selection
def get_parser(language: str):
if language == 'python':
return PythonASTParser()
elif language in ('typescript', 'javascript'):
return TreeSitterParser('typescript')
elif language == 'rust':
return TreeSitterParser('rust')
elif language == 'go':
return TreeSitterParser('go')
else:
return GenericParser() # Regex-based fallback

Performance Considerations

Indexing Performance

OperationTargetStrategy
Initial index (15K files)< 5 minutesParallel parsing, batch inserts
Incremental update (1 file)< 1 secondContent hash comparison
Symbol extraction (per file)< 100msNative parsers, no disk I/O
Embedding generation (per file)< 500msBatch processing, GPU if available

Query Performance

Query TypeTargetStrategy
Symbol search< 100msFTS5 + indexes
Find usages< 200msIndexed cross_references
Dependency graph< 500msRecursive CTE + caching
Semantic search< 300msVector similarity + LIMIT

Optimization Strategies

  1. Batch Inserts - Use INSERT INTO ... VALUES with 1000 rows per batch
  2. Write-Ahead Logging - PRAGMA journal_mode=WAL for concurrent reads
  3. Memory-Mapped I/O - PRAGMA mmap_size=268435456 (256MB)
  4. Covering Indexes - Include frequently queried columns in indexes
  5. Partial Indexes - Index only relevant subsets (e.g., exported symbols)

Integration Points

With Existing Tables (ADR-118 Compliant)

┌─────────────────────────────────────────────────────────────────┐
│ SESSIONS.DB (Tier 3 - Regenerable) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ EXISTING (Sessions) NEW (Codebase) │
│ ├── messages (76K) ├── source_files (15K) │
│ ├── tool_analytics ├── symbols (500K) │
│ ├── token_economics ←──────────→ ├── dependencies (100K) │
│ └── workspace_registry ←──────────→ ├── cross_references (2M)│
│ └── code_embeddings (50K)│
│ │
│ LINKAGES: │
│ • code_patterns.source_file → source_files.id │
│ • All tables share workspace_id from workspace_registry │
│ • doc_index and source_files unified search │
│ │
└─────────────────────────────────────────────────────────────────┘

With Multi-Workspace System

  • All new tables include workspace_id for isolation
  • Queries support --workspace and --all-workspaces filters
  • Worktrees automatically register and index their code

Alternatives Considered

A1: Separate Database for Code

Pros: Isolation, different optimization profile Cons: Complex joins, two databases to manage, sync issues Decision: Rejected - unified database preferred

A2: Language Server Protocol (LSP) Integration

Pros: Mature tooling, real-time updates Cons: Requires running servers, memory overhead, no offline indexing Decision: Deferred to Phase 2 as enhancement

A3: External Search Engine (Elasticsearch/Meilisearch)

Pros: Faster at scale, better fuzzy matching Cons: External dependency, deployment complexity Decision: Rejected for MVP - reconsider at 1M+ symbols


Implementation Phases

Phase 1: Core Indexing (Week 1-2)

  • Create schema migration
  • Implement Python parser
  • Implement TypeScript parser
  • Basic symbol search
  • File-level indexing

Phase 2: Cross-References (Week 3)

  • Implement cross-reference extraction
  • Find usages functionality
  • Go to definition

Phase 3: Dependencies (Week 4)

  • Import/export tracking
  • Dependency graph generation
  • External package detection

Phase 4: Semantic Search (Week 5)

  • Code embeddings
  • Concept-based search
  • Similar code detection

Success Metrics

MetricTargetMeasurement
Index coverage100% of source filesFiles indexed / total files
Symbol accuracy> 95%Manual validation sample
Query latency P95< 200msPerformance monitoring
Incremental index time< 2s per fileBenchmark
User adoption> 50% of /cxq queriesUsage analytics

  • ADR-020 - Context Extraction (/cx)
  • ADR-021 - Context Query System (/cxq)
  • ADR-023 - SQLite Scalability Analysis
  • ADR-025 - Comprehensive Entry Schema
  • ADR-118 - Four-Tier Database Architecture

References


Decision: Approved for implementation Migrated: 2026-02-03 (ADR-150) Review Date: 2026-06-18