ADR-023: SQLite Scalability Analysis for Codebase Intelligence Platform
Status: Accepted with Conditions Date: 2025-12-18 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Evaluate SQLite's capacity to support expanded codebase indexing at enterprise scale
Executive Summary
Question: Can SQLite support the expanded /cxq codebase indexing schema?
Answer: YES, with optimizations. SQLite can handle the projected scale (5GB, 3M+ rows) effectively with proper configuration. Migration to a heavier database is not needed until reaching 50GB+ or requiring concurrent writes from multiple processes.
Current State Analysis
ADR-118 Compliance: Analysis applies to
sessions.db(Tier 3 - Regenerable) where codebase index data resides.
Database Metrics (2025-12-18)
| Metric | Value |
|---|---|
| Database Size | 791 MB |
| Total Rows | ~1.1 million |
| Largest Table | code_patterns (645,855 rows) |
| FTS Tables | 5 virtual tables |
| Embeddings | 49,595 vectors |
Table Breakdown
| Table | Rows | Purpose |
|---|---|---|
| messages | 76,485 | Session messages |
| messages_fts | 76,485 | Full-text search |
| decisions | 134,153 | Extracted decisions |
| code_patterns | 645,855 | Code pattern library |
| doc_index | 3,880 | Documentation catalog |
| embeddings | 49,595 | Semantic vectors |
| workspace_registry | 3 | Multi-workspace config |
Projected Scale After Expansion
Estimated Growth
| Table | Current | Projected | Growth |
|---|---|---|---|
| source_files | 0 | 15,000 | New |
| symbols | 0 | 500,000 | New |
| dependencies | 0 | 100,000 | New |
| cross_references | 0 | 2,000,000 | New |
| code_embeddings | 0 | 50,000 | New |
| Total New Rows | 0 | ~2.7M | - |
| Total All Rows | 1.1M | ~3.8M | 3.5x |
Projected Database Size
| Component | Size Estimate |
|---|---|
| Current database | 791 MB |
| source_files (15K × 0.5KB) | 7.5 MB |
| symbols (500K × 0.5KB) | 250 MB |
| dependencies (100K × 0.2KB) | 20 MB |
| cross_references (2M × 0.15KB) | 300 MB |
| code_embeddings (50K × 1.5KB) | 75 MB |
| FTS indexes | ~500 MB |
| B-tree indexes | ~200 MB |
| Projected Total | ~2.1 GB |
With growth margin: ~3-5 GB
SQLite Capabilities Assessment
Theoretical Limits
| Limit | SQLite Maximum | Our Projected | Headroom |
|---|---|---|---|
| Database size | 281 TB | 5 GB | 56,000x |
| Rows per table | 2^64 | 2M | Practically unlimited |
| Columns per table | 2,000 | 25 | 80x |
| Table count | Unlimited | 50 | N/A |
| Concurrent readers | Unlimited | ~10 | N/A |
| Index count | Unlimited | ~30 | N/A |
Practical Limits
| Aspect | Limit | Mitigation |
|---|---|---|
| Single writer | 1 process at a time | WAL mode, write queuing |
| Memory usage | ~2x database size | mmap_size pragma, pagination |
| Backup while writing | Blocks writes | WAL checkpoint during idle |
| Complex joins | Can be slow | Denormalization, covering indexes |
Performance Benchmarks
Query Performance Expectations
| Query Type | Rows Scanned | Expected Latency | Acceptable? |
|---|---|---|---|
| Symbol by name (indexed) | 1-10 | < 1ms | ✅ |
| FTS search | 500K | 10-50ms | ✅ |
| Cross-reference lookup | 10K | 5-20ms | ✅ |
| Dependency graph (1 level) | 1K | 10-30ms | ✅ |
| Dependency graph (recursive) | 10K | 50-200ms | ✅ |
| Semantic similarity | 50K embeddings | 100-300ms | ✅ |
| Full table scan | 2M | 500ms-2s | ⚠️ Avoid |
Write Performance
| Operation | Expected Latency |
|---|---|
| Single INSERT | < 1ms |
| Batch INSERT (1000 rows) | 20-50ms |
| Index update | Automatic |
| Full reindex (15K files) | 2-5 minutes |
Required Optimizations
1. WAL Mode (Write-Ahead Logging)
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA wal_autocheckpoint = 1000;
Benefits:
- Concurrent reads during writes
- Faster write performance
- Crash recovery
2. Memory-Mapped I/O
PRAGMA mmap_size = 268435456; -- 256 MB
PRAGMA cache_size = -64000; -- 64 MB cache
Benefits:
- Reduced disk I/O
- Faster repeated queries
- Lower memory pressure
3. Optimized Indexes
-- Covering indexes for common queries
CREATE INDEX idx_symbols_lookup ON symbols(
name, symbol_type, workspace_id, file_id, line_start
);
-- Partial index for exported symbols only
CREATE INDEX idx_symbols_exported ON symbols(name, qualified_name)
WHERE is_exported = TRUE;
-- Composite index for cross-references
CREATE INDEX idx_xref_combined ON cross_references(
symbol_id, file_id, ref_type
);
4. Query Optimization
-- Use EXPLAIN QUERY PLAN to verify index usage
EXPLAIN QUERY PLAN SELECT * FROM symbols WHERE name = 'foo';
-- Avoid SELECT * in production queries
SELECT id, name, symbol_type, line_start FROM symbols WHERE ...;
-- Use LIMIT for pagination
SELECT * FROM cross_references WHERE symbol_id = ? LIMIT 100 OFFSET 0;
5. Batch Operations
# Instead of individual INSERTs
for row in rows:
cursor.execute("INSERT INTO symbols ...", row) # SLOW
# Use executemany with transactions
cursor.executemany(
"INSERT INTO symbols (name, type, file_id) VALUES (?, ?, ?)",
rows
)
conn.commit() # Single commit for all rows
6. Connection Pooling
import sqlite3
from contextlib import contextmanager
# Thread-local connections for concurrent reads
_local = threading.local()
@contextmanager
def get_connection():
if not hasattr(_local, 'conn'):
_local.conn = sqlite3.connect(
'sessions.db', # ADR-118: Tier 3
check_same_thread=False,
timeout=30
)
_local.conn.execute("PRAGMA journal_mode=WAL")
yield _local.conn
When to Consider Alternatives
Stay with SQLite If:
- Database < 50 GB
- Single-user or read-heavy workload
- Zero external dependencies required
- Offline-first operation needed
- Sub-second queries acceptable
- No real-time streaming requirements
Consider Migration If:
- Database > 50 GB
- Multiple concurrent writers needed
- Sub-10ms query requirements
- Native vector search (pgvector) needed
- Full-text search with stemming/synonyms
- Distributed deployment required
Alternative Database Comparison
If Migration Becomes Necessary
| Database | Pros | Cons | When to Use |
|---|---|---|---|
| PostgreSQL + pgvector | Native vectors, mature, powerful | Requires server, complex setup | 50GB+, multi-user, native vectors |
| DuckDB | Analytical queries, columnar | Not designed for OLTP | Heavy analytics, data warehouse |
| LanceDB | Vector-native, embedded | New, limited ecosystem | Pure vector workloads |
| SQLite + sqlite-vec | Vector extension, stays embedded | Extension maintenance | Keep embedded, add vectors |
| Elasticsearch | Fast full-text, scalable | Heavy, requires cluster | Search-primary workload |
Recommended Migration Path
Phase 1 (Current): SQLite 3.x (sessions.db - Tier 3)
↓ (if >50GB or concurrent writes needed)
Phase 2 (Future): PostgreSQL + pgvector
↓ (if massive scale needed)
Phase 3 (Enterprise): Distributed PostgreSQL (Citus)
Implementation Recommendations
Immediate Actions
- Apply pragmas in
ensure_schema():
def ensure_schema(conn):
conn.execute("PRAGMA journal_mode = WAL")
conn.execute("PRAGMA synchronous = NORMAL")
conn.execute("PRAGMA mmap_size = 268435456")
conn.execute("PRAGMA cache_size = -64000")
conn.execute("PRAGMA temp_store = MEMORY")
- Add database vacuum schedule:
# Weekly maintenance (cron or scheduled task)
sqlite3 ~/PROJECTS/.coditect-data/context-storage/sessions.db "VACUUM; ANALYZE;"
- Monitor query performance:
# Log slow queries
conn.execute("PRAGMA query_time_log = 1") # Log queries >1 second
Future Optimizations (If Needed)
- Implement sqlite-vec for native vector operations
- Add read replicas for concurrent heavy read loads
- Implement query caching for repeated expensive queries
- Consider table partitioning by workspace_id
Risk Assessment
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Query slowdown at scale | Medium | Medium | Indexing, query optimization |
| Write contention | Low | Low | WAL mode, write queuing |
| Database corruption | Very Low | High | WAL mode, backups |
| Memory exhaustion | Low | Medium | mmap limits, pagination |
| Disk space issues | Low | Low | Monitoring, vacuuming |
Decision
Primary Decision: Continue with SQLite
Rationale:
- Current 791MB database with 1.1M rows performs well
- Projected 3-5GB with 3.8M rows is within SQLite's comfortable range
- Zero external dependencies aligns with offline-first philosophy
- Optimizations can extend capacity 10x before migration needed
- Migration path to PostgreSQL is clear if scale demands
Conditions for Continued Use
- MUST implement WAL mode and memory optimizations
- MUST use covering indexes for frequent queries
- MUST batch write operations
- SHOULD monitor query latency P95
- SHOULD run VACUUM/ANALYZE weekly
Trigger Points for Re-evaluation
- Database size exceeds 20 GB
- P95 query latency exceeds 500ms
- Concurrent writer conflicts occur
- Real-time sync requirements emerge
Appendix: SQLite Performance Tuning Checklist
-- Apply at connection open
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA mmap_size = 268435456;
PRAGMA cache_size = -64000;
PRAGMA temp_store = MEMORY;
PRAGMA foreign_keys = ON;
-- Weekly maintenance
VACUUM;
ANALYZE;
PRAGMA optimize;
-- Verify settings
PRAGMA journal_mode;
PRAGMA mmap_size;
PRAGMA cache_size;
Related ADRs
- ADR-020 - Context Extraction (
/cx) - ADR-021 - Context Query System (
/cxq) - ADR-022 - Codebase Indexing Expansion
- ADR-118 - Four-Tier Database Architecture
References
Decision Status: APPROVED Migrated: 2026-02-03 (ADR-150) Review Date: Q2 2026 (or if trigger points reached)