ADR-023: SQLite Scalability Analysis for Codebase Intelligence Platform

Status: Accepted with Conditions Date: 2025-12-18 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Evaluate SQLite's capacity to support expanded codebase indexing at enterprise scale

Executive Summary

Question: Can SQLite support the expanded /cxq codebase indexing schema?

Answer: YES, with optimizations. SQLite can handle the projected scale (5GB, 3M+ rows) effectively with proper configuration. Migration to a heavier database is not needed until reaching 50GB+ or requiring concurrent writes from multiple processes.

Current State Analysis

ADR-118 Compliance: Analysis applies to sessions.db (Tier 3 - Regenerable) where codebase index data resides.

Database Metrics (2025-12-18)

Metric	Value
Database Size	791 MB
Total Rows	~1.1 million
Largest Table	code_patterns (645,855 rows)
FTS Tables	5 virtual tables
Embeddings	49,595 vectors

Table Breakdown

Table	Rows	Purpose
messages	76,485	Session messages
messages_fts	76,485	Full-text search
decisions	134,153	Extracted decisions
code_patterns	645,855	Code pattern library
doc_index	3,880	Documentation catalog
embeddings	49,595	Semantic vectors
workspace_registry	3	Multi-workspace config

Projected Scale After Expansion

Estimated Growth

Table	Current	Projected	Growth
source_files	0	15,000	New
symbols	0	500,000	New
dependencies	0	100,000	New
cross_references	0	2,000,000	New
code_embeddings	0	50,000	New
Total New Rows	0	~2.7M	-
Total All Rows	1.1M	~3.8M	3.5x

Projected Database Size

Component	Size Estimate
Current database	791 MB
source_files (15K × 0.5KB)	7.5 MB
symbols (500K × 0.5KB)	250 MB
dependencies (100K × 0.2KB)	20 MB
cross_references (2M × 0.15KB)	300 MB
code_embeddings (50K × 1.5KB)	75 MB
FTS indexes	~500 MB
B-tree indexes	~200 MB
Projected Total	~2.1 GB

With growth margin: ~3-5 GB

SQLite Capabilities Assessment

Theoretical Limits

Limit	SQLite Maximum	Our Projected	Headroom
Database size	281 TB	5 GB	56,000x
Rows per table	2^64	2M	Practically unlimited
Columns per table	2,000	25	80x
Table count	Unlimited	50	N/A
Concurrent readers	Unlimited	~10	N/A
Index count	Unlimited	~30	N/A

Practical Limits

Aspect	Limit	Mitigation
Single writer	1 process at a time	WAL mode, write queuing
Memory usage	~2x database size	mmap_size pragma, pagination
Backup while writing	Blocks writes	WAL checkpoint during idle
Complex joins	Can be slow	Denormalization, covering indexes

Performance Benchmarks

Query Performance Expectations

Query Type	Rows Scanned	Expected Latency	Acceptable?
Symbol by name (indexed)	1-10	< 1ms	✅
FTS search	500K	10-50ms	✅
Cross-reference lookup	10K	5-20ms	✅
Dependency graph (1 level)	1K	10-30ms	✅
Dependency graph (recursive)	10K	50-200ms	✅
Semantic similarity	50K embeddings	100-300ms	✅
Full table scan	2M	500ms-2s	⚠️ Avoid

Write Performance

Operation	Expected Latency
Single INSERT	< 1ms
Batch INSERT (1000 rows)	20-50ms
Index update	Automatic
Full reindex (15K files)	2-5 minutes

Required Optimizations

1. WAL Mode (Write-Ahead Logging)

PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA wal_autocheckpoint = 1000;

Benefits:

Concurrent reads during writes
Faster write performance
Crash recovery

2. Memory-Mapped I/O

PRAGMA mmap_size = 268435456;  -- 256 MB
PRAGMA cache_size = -64000;     -- 64 MB cache

Benefits:

Reduced disk I/O
Faster repeated queries
Lower memory pressure

3. Optimized Indexes

-- Covering indexes for common queries
CREATE INDEX idx_symbols_lookup ON symbols(
    name, symbol_type, workspace_id, file_id, line_start
);

-- Partial index for exported symbols only
CREATE INDEX idx_symbols_exported ON symbols(name, qualified_name)
    WHERE is_exported = TRUE;

-- Composite index for cross-references
CREATE INDEX idx_xref_combined ON cross_references(
    symbol_id, file_id, ref_type
);

4. Query Optimization

-- Use EXPLAIN QUERY PLAN to verify index usage
EXPLAIN QUERY PLAN SELECT * FROM symbols WHERE name = 'foo';

-- Avoid SELECT * in production queries
SELECT id, name, symbol_type, line_start FROM symbols WHERE ...;

-- Use LIMIT for pagination
SELECT * FROM cross_references WHERE symbol_id = ? LIMIT 100 OFFSET 0;

5. Batch Operations

# Instead of individual INSERTs
for row in rows:
    cursor.execute("INSERT INTO symbols ...", row)  # SLOW

# Use executemany with transactions
cursor.executemany(
    "INSERT INTO symbols (name, type, file_id) VALUES (?, ?, ?)",
    rows
)
conn.commit()  # Single commit for all rows

6. Connection Pooling

import sqlite3
from contextlib import contextmanager

# Thread-local connections for concurrent reads
_local = threading.local()

@contextmanager
def get_connection():
    if not hasattr(_local, 'conn'):
        _local.conn = sqlite3.connect(
            'sessions.db',  # ADR-118: Tier 3
            check_same_thread=False,
            timeout=30
        )
        _local.conn.execute("PRAGMA journal_mode=WAL")
    yield _local.conn

When to Consider Alternatives

Stay with SQLite If:

Database < 50 GB
Single-user or read-heavy workload
Zero external dependencies required
Offline-first operation needed
Sub-second queries acceptable
No real-time streaming requirements

Consider Migration If:

Database > 50 GB
Multiple concurrent writers needed
Sub-10ms query requirements
Native vector search (pgvector) needed
Full-text search with stemming/synonyms
Distributed deployment required

Alternative Database Comparison

If Migration Becomes Necessary

Database	Pros	Cons	When to Use
PostgreSQL + pgvector	Native vectors, mature, powerful	Requires server, complex setup	50GB+, multi-user, native vectors
DuckDB	Analytical queries, columnar	Not designed for OLTP	Heavy analytics, data warehouse
LanceDB	Vector-native, embedded	New, limited ecosystem	Pure vector workloads
SQLite + sqlite-vec	Vector extension, stays embedded	Extension maintenance	Keep embedded, add vectors
Elasticsearch	Fast full-text, scalable	Heavy, requires cluster	Search-primary workload

Recommended Migration Path

Phase 1 (Current): SQLite 3.x (sessions.db - Tier 3)
    ↓ (if >50GB or concurrent writes needed)
Phase 2 (Future): PostgreSQL + pgvector
    ↓ (if massive scale needed)
Phase 3 (Enterprise): Distributed PostgreSQL (Citus)

Implementation Recommendations

Immediate Actions

Apply pragmas in ensure_schema():

def ensure_schema(conn):
    conn.execute("PRAGMA journal_mode = WAL")
    conn.execute("PRAGMA synchronous = NORMAL")
    conn.execute("PRAGMA mmap_size = 268435456")
    conn.execute("PRAGMA cache_size = -64000")
    conn.execute("PRAGMA temp_store = MEMORY")

Add database vacuum schedule:

# Weekly maintenance (cron or scheduled task)
sqlite3 ~/PROJECTS/.coditect-data/context-storage/sessions.db "VACUUM; ANALYZE;"

Monitor query performance:

# Log slow queries
conn.execute("PRAGMA query_time_log = 1")  # Log queries >1 second

Future Optimizations (If Needed)

Implement sqlite-vec for native vector operations
Add read replicas for concurrent heavy read loads
Implement query caching for repeated expensive queries
Consider table partitioning by workspace_id

Risk Assessment

Risk	Probability	Impact	Mitigation
Query slowdown at scale	Medium	Medium	Indexing, query optimization
Write contention	Low	Low	WAL mode, write queuing
Database corruption	Very Low	High	WAL mode, backups
Memory exhaustion	Low	Medium	mmap limits, pagination
Disk space issues	Low	Low	Monitoring, vacuuming

Decision

Primary Decision: Continue with SQLite

Rationale:

Current 791MB database with 1.1M rows performs well
Projected 3-5GB with 3.8M rows is within SQLite's comfortable range
Zero external dependencies aligns with offline-first philosophy
Optimizations can extend capacity 10x before migration needed
Migration path to PostgreSQL is clear if scale demands

Conditions for Continued Use

MUST implement WAL mode and memory optimizations
MUST use covering indexes for frequent queries
MUST batch write operations
SHOULD monitor query latency P95
SHOULD run VACUUM/ANALYZE weekly

Trigger Points for Re-evaluation

Database size exceeds 20 GB
P95 query latency exceeds 500ms
Concurrent writer conflicts occur
Real-time sync requirements emerge

Appendix: SQLite Performance Tuning Checklist

-- Apply at connection open
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA mmap_size = 268435456;
PRAGMA cache_size = -64000;
PRAGMA temp_store = MEMORY;
PRAGMA foreign_keys = ON;

-- Weekly maintenance
VACUUM;
ANALYZE;
PRAGMA optimize;

-- Verify settings
PRAGMA journal_mode;
PRAGMA mmap_size;
PRAGMA cache_size;

ADR-020 - Context Extraction (/cx)
ADR-021 - Context Query System (/cxq)
ADR-022 - Codebase Indexing Expansion
ADR-118 - Four-Tier Database Architecture

References

Decision Status: APPROVED Migrated: 2026-02-03 (ADR-150) Review Date: Q2 2026 (or if trigger points reached)

Executive Summary​

Current State Analysis​

Database Metrics (2025-12-18)​

Table Breakdown​

Projected Scale After Expansion​

Estimated Growth​

Projected Database Size​

SQLite Capabilities Assessment​

Theoretical Limits​

Practical Limits​

Performance Benchmarks​

Query Performance Expectations​

Write Performance​

Required Optimizations​

1. WAL Mode (Write-Ahead Logging)​

2. Memory-Mapped I/O​

3. Optimized Indexes​

4. Query Optimization​

5. Batch Operations​

6. Connection Pooling​

When to Consider Alternatives​

Stay with SQLite If:​

Consider Migration If:​

Alternative Database Comparison​

If Migration Becomes Necessary​

Recommended Migration Path​

Implementation Recommendations​

Immediate Actions​

Future Optimizations (If Needed)​

Risk Assessment​

Decision​

Primary Decision: Continue with SQLite​

Conditions for Continued Use​

Trigger Points for Re-evaluation​

Appendix: SQLite Performance Tuning Checklist​

Related ADRs​

References​