Context Database Improvement Recommendations

Generated: December 29, 2025 Database Version: sessions.db v1.0 Current Size: 1,603 MB (after deduplication)

Executive Summary

The context database is functional but underutilized. Key findings:

Metric	Current	Potential	Status
Active tables	30/56 (54%)	56/56 (100%)	Pending
Semantic search	Disabled	Full RAG capability	Pending
Knowledge graph	Empty	Entity-relationship extraction	Pending
Learning system	Empty	User progress tracking	Pending
Automation	Manual triggers	Scheduled + event-driven	In Progress
Deduplication	UNIQUE constraints added	Prevention at ingestion	✅ DONE
Query Views	10 views	19 views	✅ DONE
Session Insights	1,163 sessions analyzed	All sessions	✅ DONE

Priority Recommendations:

P0: Enable semantic embeddings for RAG search
P1: Implement knowledge graph extraction
P1: Add ingestion-time deduplication
P2: Activate learning system
P2: Schedule automated maintenance

Workflow Steps

Initialize - Set up the environment
Configure - Apply settings
Execute - Run the process
Validate - Check results
Complete - Finalize workflow

1. Content Gaps (What's Missing)

1.1 Empty Tables Analysis

Table	Purpose	Status	Action
`doc_embeddings`	Semantic document search	Empty (0 rows)	Enable embedding generation
`entries_embeddings`	Semantic entry search	Empty (0 rows)	Enable embedding generation
`knowledge_entities`	Entity extraction	Empty (0 rows)	Implement NER pipeline
`knowledge_graph`	Entity relationships	Empty (0 rows)	Implement relation extraction
`file_backup_entries`	File version tracking	Empty (0 rows)	Enable file backup hook
`learning_*` (4 tables)	User learning progress	Empty (0 rows)	Activate learning system

1.2 Underutilized Tables

Table	Rows	Expected Usage	Gap
`session_insights`	5,753	1 per session	✅ DONE - 1,163 sessions analyzed
`project_summaries`	1	1 per project (~5+)	Only one project tracked
`file_snapshots`	1	Many per session	File tracking disabled
`workspace_registry`	3	All workspaces (~10+)	Manual registration only

1.3 Missing Data Types

Currently not captured:

Git commit metadata (author, branch, diff stats)
Tool usage patterns (which tools, success rates)
User preferences and settings
External API call logs
Performance metrics (response times)

2. Ingestion Pipeline Improvements

2.1 Current Ingestion Points

Script	Tables Updated	Trigger
`unified-message-extractor.py`	entries, messages	Manual (`/cx`)
`context-db.py`	doc_index, decisions, etc.	Manual (`/cxq`)
`component-indexer.py`	components, capabilities	Manual
`component-frontmatter-indexer.py`	component_frontmatter	Manual

2.2 Recommended New Pipelines

A. Automatic Session Capture (Hook-based)

# hooks/post-message-capture.py
# Trigger: After each Claude message
# Action: Extract and store in real-time
def on_message(message):
    hash = compute_hash(message)
    if not exists_in_db(hash):
        store_message(message)
        extract_decisions(message)
        extract_code_patterns(message)

B. Semantic Embedding Pipeline

# scripts/generate-embeddings.py
# Trigger: After doc_index updates
# Action: Generate embeddings for semantic search
def generate_embeddings():
    unembedded = get_docs_without_embeddings()
    for doc in unembedded:
        embedding = embed(doc.content)
        store_embedding(doc.id, embedding)

C. Knowledge Graph Extraction

# scripts/extract-knowledge-graph.py
# Trigger: After decisions/patterns update
# Action: Extract entities and relationships
def extract_knowledge():
    for decision in new_decisions():
        entities = extract_entities(decision.text)
        relations = extract_relations(decision.text)
        store_graph(entities, relations)

2.3 Ingestion-Time Deduplication

Problem: Current deduplication is post-hoc (after duplicates exist).

Solution: Check before insert:

-- Pattern: INSERT OR IGNORE with UNIQUE constraint
INSERT OR IGNORE INTO decisions (decision, type, project, confidence)
VALUES (?, ?, ?, ?);

-- Pattern: UPSERT (update if exists)
INSERT INTO code_patterns (code, name, language, type, usage_count)
VALUES (?, ?, ?, ?, 1)
ON CONFLICT(code) DO UPDATE SET
    usage_count = usage_count + 1,
    last_used = datetime('now');

Required Changes:

~~Add UNIQUE constraints to all content-hash columns~~ ✅ DONE (Dec 29, 2025)
- idx_messages_unique_hash on messages(hash)
- idx_entries_unique_hash on entries(hash)
- idx_error_solutions_unique_hash on error_solutions(error_hash)
- idx_decisions_unique on decisions(decision)
- idx_patterns_unique_code on code_patterns(code)
- idx_doc_unique_path on doc_index(file_path)
Modify ingestion scripts to use INSERT OR IGNORE
Add upsert logic for counters (usage_count, access_count)

3. Indexing Improvements

3.1 Current Index Coverage

Table	Indexes	Coverage	Gap
`messages`	6	Good	Missing: content FTS
`entries`	7	Good	Missing: content FTS
`decisions`	6	Good	None
`code_patterns`	7	Good	None
`doc_index`	10	Excellent	None
`components`	5	Moderate	Missing: full-text name/description

3.2 Recommended New Indexes

A. Full-Text Search Indexes (Already exist but verify populated)

-- Verify FTS tables are populated
SELECT COUNT(*) FROM messages_fts;      -- Should match messages
SELECT COUNT(*) FROM decisions_fts;     -- Should match decisions
SELECT COUNT(*) FROM doc_search;        -- Should match doc_index
SELECT COUNT(*) FROM component_search;  -- Should match components

B. Composite Indexes for Common Queries ✅ DONE (Dec 29, 2025)

-- Frequently used query pattern: recent by role and session
CREATE INDEX IF NOT EXISTS idx_entries_session_role_time
ON entries(session_id, role, timestamp DESC);

-- Query pattern: components by type and status
CREATE INDEX IF NOT EXISTS idx_components_type_status
ON components(type, status);

-- Query pattern: docs by workspace and type
CREATE INDEX IF NOT EXISTS idx_doc_workspace_type
ON doc_index(workspace_id, doc_type);

C. Partial Indexes (for filtered queries) ✅ DONE (Dec 29, 2025)

-- Only index active components (538 of 1937)
CREATE INDEX IF NOT EXISTS idx_components_active
ON components(type, name) WHERE status = 'active';

-- High-confidence decisions (skipped - all decisions at 0.4 confidence)
-- CREATE INDEX IF NOT EXISTS idx_decisions_high_conf
-- ON decisions(type, project) WHERE confidence >= 0.8;

3.3 Semantic Search Indexes

Current: doc_embeddings and entries_embeddings are empty.

Action: Implement vector similarity search:

# Option 1: Use SQLite with manual cosine similarity (slow for large datasets)
# Option 2: Use sqlite-vec extension (requires compilation)
# Option 3: Use external vector DB (Chroma, Qdrant) with SQLite metadata

# Recommended: Option 3 - Chroma integration exists in scripts/core/chromadb_setup.py

4. Automation Recommendations

4.1 Scheduled Tasks

Task	Frequency	Script	Priority
Backup to GCS	Daily	`backup-context-db.sh`	P0
Integrity check	Weekly	`PRAGMA integrity_check`	P0
Statistics update	Weekly	`ANALYZE`	P1
Orphan cleanup	Weekly	`delete-orphans.py`	P1
Vacuum	Monthly	`VACUUM`	P2

Cron configuration:

# /etc/cron.d/coditect-context-db
2 * * * /path/to/backup-context-db.sh       # Daily 2 AM
3 * * 0 sqlite3 sessions.db "PRAGMA integrity_check" # Weekly Sunday 3 AM
3 * * 1 sqlite3 sessions.db "ANALYZE"        # Weekly Monday 3 AM
4 1 * * sqlite3 sessions.db "VACUUM"         # Monthly 1st, 4 AM

4.2 Event-Driven Automation

Event	Trigger	Action
Session end	`/export` or session close	Extract and index
Document save	File save in workspace	Re-index doc_index
Component change	Agent/command/skill modified	Update component_frontmatter
High-value decision	Confidence >= 0.9	Add to knowledge graph

Hook implementation:

# hooks/post-session.sh (already exists - enhance)
#!/bin/bash
# After session ends:
# 1. Run /cx to extract messages
# 2. Update doc_index
# 3. Generate embeddings
# 4. Extract knowledge graph

python3 scripts/unified-message-extractor.py
python3 scripts/context-db.py index-docs
python3 scripts/generate-embeddings.py --incremental
python3 scripts/extract-knowledge-graph.py --incremental

4.3 Maintenance Automation Script

#!/bin/bash
# scripts/db-maintenance.sh

DB_PATH="context-storage/sessions.db"

echo "=== Context Database Maintenance ==="
echo "Started: $(date)"

# 1. Integrity check
echo "Checking integrity..."
sqlite3 "$DB_PATH" "PRAGMA integrity_check" | grep -q "ok" || {
    echo "ERROR: Database integrity check failed!"
    exit 1
}

# 2. Update statistics
echo "Updating statistics..."
sqlite3 "$DB_PATH" "ANALYZE"

# 3. Clean orphans
echo "Cleaning orphaned records..."
sqlite3 "$DB_PATH" "
DELETE FROM doc_embeddings WHERE doc_id NOT IN (SELECT id FROM doc_index);
DELETE FROM entries_embeddings WHERE entry_id NOT IN (SELECT id FROM entries);
DELETE FROM summaries WHERE entry_id NOT IN (SELECT id FROM entries);
"

# 4. Report sizes
echo "Table sizes:"
sqlite3 "$DB_PATH" "
SELECT name,
       (SELECT COUNT(*) FROM pragma_table_info(name)) as columns,
       page_count * 4096 / 1024 / 1024.0 as size_mb
FROM sqlite_master
WHERE type='table'
  AND name NOT LIKE 'sqlite_%'
ORDER BY page_count DESC
LIMIT 10;
"

echo "Completed: $(date)"

5. Query Interface Improvements

5.1 Current Query Methods

Method	Interface	Limitation
`/cxq`	CLI command	Limited query types
`context-db.py`	Python script	Requires command line
Direct SQLite	`sqlite3` shell	Requires SQL knowledge

5.2 Recommended Query Enhancements

A. Natural Language Query Interface

# Current: SQL-based
/cxq --sql "SELECT * FROM decisions WHERE confidence > 0.9"

# Proposed: Natural language
/cxq "show me high-confidence decisions from last week"
/cxq "what errors have I solved before?"
/cxq "find code patterns for authentication"

B. Saved/Named Queries ✅ IMPLEMENTED (Dec 29, 2025)

8 new views created for instant queries:

View	Purpose	Query
`knowledge_stats`	Row counts for all knowledge tables	`/cxq --knowledge-stats`
`today_activity`	Last 24h activity	`/cxq --today-activity`
`recent_decisions`	High-confidence decisions (7 days)	`/cxq --view recent_decisions`
`error_frequency`	Most common errors	`/cxq --error-frequency`
`patterns_by_language`	Code patterns by language	`/cxq --patterns-summary`
`session_activity`	Session durations	`/cxq --session-summary`
`component_summary`	Components by type/status	`/cxq --component-overview`
`docs_by_category`	Documentation by category	`/cxq --docs-summary`
`sessions_by_topic`	Session insights by topic	`SELECT * FROM sessions_by_topic WHERE topic='deployment'`

-- Example: Query any view directly
sqlite3 context-storage/sessions.db "SELECT * FROM knowledge_stats"
sqlite3 context-storage/sessions.db "SELECT * FROM error_frequency LIMIT 10"

-- Query sessions by topic (new)
sqlite3 context-storage/sessions.db "SELECT * FROM sessions_by_topic WHERE topic='authentication'"
sqlite3 context-storage/sessions.db "SELECT * FROM sessions_by_topic WHERE topic='deployment' ORDER BY hours DESC"

C. Query Result Formatting

# Current: Raw output
/cxq --decisions

# Proposed: Multiple formats
/cxq --decisions --format table    # ASCII table
/cxq --decisions --format json     # JSON output
/cxq --decisions --format markdown # Markdown table
/cxq --decisions --format csv      # CSV export

5.3 Performance Query Hints

-- Use EXPLAIN QUERY PLAN to verify index usage
EXPLAIN QUERY PLAN
SELECT * FROM entries WHERE session_id = 'abc' AND role = 'assistant';

-- Should show: USING INDEX idx_entries_session

-- Common slow query patterns to avoid:
-- AVOID: LIKE '%pattern%' (can't use index)
-- USE: Full-text search instead
SELECT * FROM decisions_fts WHERE decisions_fts MATCH 'pattern';

-- AVOID: Complex JOINs without LIMIT
-- USE: Subqueries with LIMIT first
SELECT * FROM entries e
JOIN (SELECT id FROM messages ORDER BY timestamp DESC LIMIT 100) m
ON e.id = m.id;

6. Implementation Roadmap

Phase 1: Foundation (Week 1) - ✅ COMPLETE

Add remaining UNIQUE constraints (Dec 29, 2025)
Create db-maintenance.sh script (Dec 29, 2025)
Set up daily backup cron job
Create common query views (8 new views) (Dec 29, 2025)
Add view query commands to /cxq (Dec 29, 2025 - in progress)

Phase 2: Semantic Search (Week 2-3)

Populate doc_embeddings table
Populate entries_embeddings table
Integrate Chroma for vector search
Add /cxq --semantic query mode

Phase 3: Knowledge Graph (Week 4)

Implement entity extraction pipeline
Populate knowledge_entities
Implement relation extraction
Populate knowledge_graph

Phase 4: Automation (Week 5)

Add post-session hook for extraction
Add file-save hook for doc indexing
Implement event-driven triggers
Add monitoring and alerting

Phase 5: Query Enhancements (Week 6)

Natural language query parsing
Saved query management (views created) (Dec 29, 2025)
Multiple output formats
Query performance dashboard

7. Metrics to Track

Metric	Current	Target	Tracking
Database size	1,603 MB	< 2 GB	`du -h sessions.db`
Unique messages	92,924	Growing	`SELECT COUNT(*) FROM messages`
Query response time	Unknown	< 100ms	Query timing
Index hit rate	100%	> 90%	`EXPLAIN QUERY PLAN` verified
Backup success rate	Unknown	100%	Backup logs
Duplicate rate	0%	0%	UNIQUE constraints prevent duplicates
Query views	19	20+	`SELECT COUNT(*) FROM sqlite_master WHERE type='view'`
UNIQUE constraints	6	6	All major tables protected
Total indexes	164	160+	Composite + partial indexes added
Session insights	5,753	All sessions	1,163 sessions analyzed
Hours tracked	1,884	Growing	Total development hours indexed

DATABASE-SCHEMA.md - Complete schema reference
DATABASE-SCHEMA.json - Machine-readable schema
MEMORY-MANAGEMENT-GUIDE.md - Context preservation guide
backup-context-db.sh - Backup script
db-maintenance.sh - Maintenance script
deduplicate-all-tables.py - Deduplication script
generate-session-insights.py - Session insights generator (NEW)

Author: CODITECT Framework Review Status: Phase 1 complete, actively implementing Last Updated: December 29, 2025 Next Review: January 2026

Changelog

Date	Change
2025-12-29	Added `sessions_by_topic` view (19 views total)
2025-12-29	Generated session insights: 5,753 rows, 1,163 sessions, 1,884 hours tracked
2025-12-29	Added `generate-session-insights.py` script
2025-12-29	Added 4 performance indexes: 3 composite + 1 partial (164 total)
2025-12-29	Added /cxq view commands (--views, --today-activity, etc.)
2025-12-29	Phase 1 complete: UNIQUE constraints, views, maintenance script
2025-12-29	Initial document created

Executive Summary​

Workflow Steps​

1. Content Gaps (What's Missing)​

1.1 Empty Tables Analysis​

1.2 Underutilized Tables​

1.3 Missing Data Types​

2. Ingestion Pipeline Improvements​

2.1 Current Ingestion Points​

2.2 Recommended New Pipelines​

2.3 Ingestion-Time Deduplication​

3. Indexing Improvements​

3.1 Current Index Coverage​

3.2 Recommended New Indexes​

3.3 Semantic Search Indexes​

4. Automation Recommendations​

4.1 Scheduled Tasks​

4.2 Event-Driven Automation​

4.3 Maintenance Automation Script​

5. Query Interface Improvements​

5.1 Current Query Methods​

5.2 Recommended Query Enhancements​

5.3 Performance Query Hints​

6. Implementation Roadmap​

Phase 1: Foundation (Week 1) - ✅ COMPLETE​

Phase 2: Semantic Search (Week 2-3)​

Phase 3: Knowledge Graph (Week 4)​

Phase 4: Automation (Week 5)​

Phase 5: Query Enhancements (Week 6)​

7. Metrics to Track​

Related Documents​

Changelog​