Skip to main content

ADR-023: SQLite Scalability Analysis for Codebase Intelligence Platform

Status: Accepted with Conditions Date: 2025-12-18 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Evaluate SQLite's capacity to support expanded codebase indexing at enterprise scale


Executive Summary

Question: Can SQLite support the expanded /cxq codebase indexing schema?

Answer: YES, with optimizations. SQLite can handle the projected scale (5GB, 3M+ rows) effectively with proper configuration. Migration to a heavier database is not needed until reaching 50GB+ or requiring concurrent writes from multiple processes.


Current State Analysis

ADR-118 Compliance: Analysis applies to sessions.db (Tier 3 - Regenerable) where codebase index data resides.

Database Metrics (2025-12-18)

MetricValue
Database Size791 MB
Total Rows~1.1 million
Largest Tablecode_patterns (645,855 rows)
FTS Tables5 virtual tables
Embeddings49,595 vectors

Table Breakdown

TableRowsPurpose
messages76,485Session messages
messages_fts76,485Full-text search
decisions134,153Extracted decisions
code_patterns645,855Code pattern library
doc_index3,880Documentation catalog
embeddings49,595Semantic vectors
workspace_registry3Multi-workspace config

Projected Scale After Expansion

Estimated Growth

TableCurrentProjectedGrowth
source_files015,000New
symbols0500,000New
dependencies0100,000New
cross_references02,000,000New
code_embeddings050,000New
Total New Rows0~2.7M-
Total All Rows1.1M~3.8M3.5x

Projected Database Size

ComponentSize Estimate
Current database791 MB
source_files (15K × 0.5KB)7.5 MB
symbols (500K × 0.5KB)250 MB
dependencies (100K × 0.2KB)20 MB
cross_references (2M × 0.15KB)300 MB
code_embeddings (50K × 1.5KB)75 MB
FTS indexes~500 MB
B-tree indexes~200 MB
Projected Total~2.1 GB

With growth margin: ~3-5 GB


SQLite Capabilities Assessment

Theoretical Limits

LimitSQLite MaximumOur ProjectedHeadroom
Database size281 TB5 GB56,000x
Rows per table2^642MPractically unlimited
Columns per table2,0002580x
Table countUnlimited50N/A
Concurrent readersUnlimited~10N/A
Index countUnlimited~30N/A

Practical Limits

AspectLimitMitigation
Single writer1 process at a timeWAL mode, write queuing
Memory usage~2x database sizemmap_size pragma, pagination
Backup while writingBlocks writesWAL checkpoint during idle
Complex joinsCan be slowDenormalization, covering indexes

Performance Benchmarks

Query Performance Expectations

Query TypeRows ScannedExpected LatencyAcceptable?
Symbol by name (indexed)1-10< 1ms
FTS search500K10-50ms
Cross-reference lookup10K5-20ms
Dependency graph (1 level)1K10-30ms
Dependency graph (recursive)10K50-200ms
Semantic similarity50K embeddings100-300ms
Full table scan2M500ms-2s⚠️ Avoid

Write Performance

OperationExpected Latency
Single INSERT< 1ms
Batch INSERT (1000 rows)20-50ms
Index updateAutomatic
Full reindex (15K files)2-5 minutes

Required Optimizations

1. WAL Mode (Write-Ahead Logging)

PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA wal_autocheckpoint = 1000;

Benefits:

  • Concurrent reads during writes
  • Faster write performance
  • Crash recovery

2. Memory-Mapped I/O

PRAGMA mmap_size = 268435456;  -- 256 MB
PRAGMA cache_size = -64000; -- 64 MB cache

Benefits:

  • Reduced disk I/O
  • Faster repeated queries
  • Lower memory pressure

3. Optimized Indexes

-- Covering indexes for common queries
CREATE INDEX idx_symbols_lookup ON symbols(
name, symbol_type, workspace_id, file_id, line_start
);

-- Partial index for exported symbols only
CREATE INDEX idx_symbols_exported ON symbols(name, qualified_name)
WHERE is_exported = TRUE;

-- Composite index for cross-references
CREATE INDEX idx_xref_combined ON cross_references(
symbol_id, file_id, ref_type
);

4. Query Optimization

-- Use EXPLAIN QUERY PLAN to verify index usage
EXPLAIN QUERY PLAN SELECT * FROM symbols WHERE name = 'foo';

-- Avoid SELECT * in production queries
SELECT id, name, symbol_type, line_start FROM symbols WHERE ...;

-- Use LIMIT for pagination
SELECT * FROM cross_references WHERE symbol_id = ? LIMIT 100 OFFSET 0;

5. Batch Operations

# Instead of individual INSERTs
for row in rows:
cursor.execute("INSERT INTO symbols ...", row) # SLOW

# Use executemany with transactions
cursor.executemany(
"INSERT INTO symbols (name, type, file_id) VALUES (?, ?, ?)",
rows
)
conn.commit() # Single commit for all rows

6. Connection Pooling

import sqlite3
from contextlib import contextmanager

# Thread-local connections for concurrent reads
_local = threading.local()

@contextmanager
def get_connection():
if not hasattr(_local, 'conn'):
_local.conn = sqlite3.connect(
'sessions.db', # ADR-118: Tier 3
check_same_thread=False,
timeout=30
)
_local.conn.execute("PRAGMA journal_mode=WAL")
yield _local.conn

When to Consider Alternatives

Stay with SQLite If:

  • Database < 50 GB
  • Single-user or read-heavy workload
  • Zero external dependencies required
  • Offline-first operation needed
  • Sub-second queries acceptable
  • No real-time streaming requirements

Consider Migration If:

  • Database > 50 GB
  • Multiple concurrent writers needed
  • Sub-10ms query requirements
  • Native vector search (pgvector) needed
  • Full-text search with stemming/synonyms
  • Distributed deployment required

Alternative Database Comparison

If Migration Becomes Necessary

DatabaseProsConsWhen to Use
PostgreSQL + pgvectorNative vectors, mature, powerfulRequires server, complex setup50GB+, multi-user, native vectors
DuckDBAnalytical queries, columnarNot designed for OLTPHeavy analytics, data warehouse
LanceDBVector-native, embeddedNew, limited ecosystemPure vector workloads
SQLite + sqlite-vecVector extension, stays embeddedExtension maintenanceKeep embedded, add vectors
ElasticsearchFast full-text, scalableHeavy, requires clusterSearch-primary workload
Phase 1 (Current): SQLite 3.x (sessions.db - Tier 3)
↓ (if >50GB or concurrent writes needed)
Phase 2 (Future): PostgreSQL + pgvector
↓ (if massive scale needed)
Phase 3 (Enterprise): Distributed PostgreSQL (Citus)

Implementation Recommendations

Immediate Actions

  1. Apply pragmas in ensure_schema():
def ensure_schema(conn):
conn.execute("PRAGMA journal_mode = WAL")
conn.execute("PRAGMA synchronous = NORMAL")
conn.execute("PRAGMA mmap_size = 268435456")
conn.execute("PRAGMA cache_size = -64000")
conn.execute("PRAGMA temp_store = MEMORY")
  1. Add database vacuum schedule:
# Weekly maintenance (cron or scheduled task)
sqlite3 ~/PROJECTS/.coditect-data/context-storage/sessions.db "VACUUM; ANALYZE;"
  1. Monitor query performance:
# Log slow queries
conn.execute("PRAGMA query_time_log = 1") # Log queries >1 second

Future Optimizations (If Needed)

  1. Implement sqlite-vec for native vector operations
  2. Add read replicas for concurrent heavy read loads
  3. Implement query caching for repeated expensive queries
  4. Consider table partitioning by workspace_id

Risk Assessment

RiskProbabilityImpactMitigation
Query slowdown at scaleMediumMediumIndexing, query optimization
Write contentionLowLowWAL mode, write queuing
Database corruptionVery LowHighWAL mode, backups
Memory exhaustionLowMediummmap limits, pagination
Disk space issuesLowLowMonitoring, vacuuming

Decision

Primary Decision: Continue with SQLite

Rationale:

  1. Current 791MB database with 1.1M rows performs well
  2. Projected 3-5GB with 3.8M rows is within SQLite's comfortable range
  3. Zero external dependencies aligns with offline-first philosophy
  4. Optimizations can extend capacity 10x before migration needed
  5. Migration path to PostgreSQL is clear if scale demands

Conditions for Continued Use

  1. MUST implement WAL mode and memory optimizations
  2. MUST use covering indexes for frequent queries
  3. MUST batch write operations
  4. SHOULD monitor query latency P95
  5. SHOULD run VACUUM/ANALYZE weekly

Trigger Points for Re-evaluation

  • Database size exceeds 20 GB
  • P95 query latency exceeds 500ms
  • Concurrent writer conflicts occur
  • Real-time sync requirements emerge

Appendix: SQLite Performance Tuning Checklist

-- Apply at connection open
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA mmap_size = 268435456;
PRAGMA cache_size = -64000;
PRAGMA temp_store = MEMORY;
PRAGMA foreign_keys = ON;

-- Weekly maintenance
VACUUM;
ANALYZE;
PRAGMA optimize;

-- Verify settings
PRAGMA journal_mode;
PRAGMA mmap_size;
PRAGMA cache_size;

  • ADR-020 - Context Extraction (/cx)
  • ADR-021 - Context Query System (/cxq)
  • ADR-022 - Codebase Indexing Expansion
  • ADR-118 - Four-Tier Database Architecture

References


Decision Status: APPROVED Migrated: 2026-02-03 (ADR-150) Review Date: Q2 2026 (or if trigger points reached)