Skip to main content

Context Database Improvement Recommendations

Generated: December 29, 2025 Database Version: sessions.db v1.0 Current Size: 1,603 MB (after deduplication)


Executive Summary

The context database is functional but underutilized. Key findings:

MetricCurrentPotentialStatus
Active tables30/56 (54%)56/56 (100%)Pending
Semantic searchDisabledFull RAG capabilityPending
Knowledge graphEmptyEntity-relationship extractionPending
Learning systemEmptyUser progress trackingPending
AutomationManual triggersScheduled + event-drivenIn Progress
DeduplicationUNIQUE constraints addedPrevention at ingestion✅ DONE
Query Views10 views19 views✅ DONE
Session Insights1,163 sessions analyzedAll sessions✅ DONE

Priority Recommendations:

  1. P0: Enable semantic embeddings for RAG search
  2. P1: Implement knowledge graph extraction
  3. P1: Add ingestion-time deduplication
  4. P2: Activate learning system
  5. P2: Schedule automated maintenance

Workflow Steps

  1. Initialize - Set up the environment
  2. Configure - Apply settings
  3. Execute - Run the process
  4. Validate - Check results
  5. Complete - Finalize workflow

1. Content Gaps (What's Missing)

1.1 Empty Tables Analysis

TablePurposeStatusAction
doc_embeddingsSemantic document searchEmpty (0 rows)Enable embedding generation
entries_embeddingsSemantic entry searchEmpty (0 rows)Enable embedding generation
knowledge_entitiesEntity extractionEmpty (0 rows)Implement NER pipeline
knowledge_graphEntity relationshipsEmpty (0 rows)Implement relation extraction
file_backup_entriesFile version trackingEmpty (0 rows)Enable file backup hook
learning_* (4 tables)User learning progressEmpty (0 rows)Activate learning system

1.2 Underutilized Tables

TableRowsExpected UsageGap
session_insights5,7531 per session✅ DONE - 1,163 sessions analyzed
project_summaries11 per project (~5+)Only one project tracked
file_snapshots1Many per sessionFile tracking disabled
workspace_registry3All workspaces (~10+)Manual registration only

1.3 Missing Data Types

Currently not captured:

  • Git commit metadata (author, branch, diff stats)
  • Tool usage patterns (which tools, success rates)
  • User preferences and settings
  • External API call logs
  • Performance metrics (response times)

2. Ingestion Pipeline Improvements

2.1 Current Ingestion Points

ScriptTables UpdatedTrigger
unified-message-extractor.pyentries, messagesManual (/cx)
context-db.pydoc_index, decisions, etc.Manual (/cxq)
component-indexer.pycomponents, capabilitiesManual
component-frontmatter-indexer.pycomponent_frontmatterManual

A. Automatic Session Capture (Hook-based)

# hooks/post-message-capture.py
# Trigger: After each Claude message
# Action: Extract and store in real-time
def on_message(message):
hash = compute_hash(message)
if not exists_in_db(hash):
store_message(message)
extract_decisions(message)
extract_code_patterns(message)

B. Semantic Embedding Pipeline

# scripts/generate-embeddings.py
# Trigger: After doc_index updates
# Action: Generate embeddings for semantic search
def generate_embeddings():
unembedded = get_docs_without_embeddings()
for doc in unembedded:
embedding = embed(doc.content)
store_embedding(doc.id, embedding)

C. Knowledge Graph Extraction

# scripts/extract-knowledge-graph.py
# Trigger: After decisions/patterns update
# Action: Extract entities and relationships
def extract_knowledge():
for decision in new_decisions():
entities = extract_entities(decision.text)
relations = extract_relations(decision.text)
store_graph(entities, relations)

2.3 Ingestion-Time Deduplication

Problem: Current deduplication is post-hoc (after duplicates exist).

Solution: Check before insert:

-- Pattern: INSERT OR IGNORE with UNIQUE constraint
INSERT OR IGNORE INTO decisions (decision, type, project, confidence)
VALUES (?, ?, ?, ?);

-- Pattern: UPSERT (update if exists)
INSERT INTO code_patterns (code, name, language, type, usage_count)
VALUES (?, ?, ?, ?, 1)
ON CONFLICT(code) DO UPDATE SET
usage_count = usage_count + 1,
last_used = datetime('now');

Required Changes:

  1. Add UNIQUE constraints to all content-hash columns ✅ DONE (Dec 29, 2025)
    • idx_messages_unique_hash on messages(hash)
    • idx_entries_unique_hash on entries(hash)
    • idx_error_solutions_unique_hash on error_solutions(error_hash)
    • idx_decisions_unique on decisions(decision)
    • idx_patterns_unique_code on code_patterns(code)
    • idx_doc_unique_path on doc_index(file_path)
  2. Modify ingestion scripts to use INSERT OR IGNORE
  3. Add upsert logic for counters (usage_count, access_count)

3. Indexing Improvements

3.1 Current Index Coverage

TableIndexesCoverageGap
messages6GoodMissing: content FTS
entries7GoodMissing: content FTS
decisions6GoodNone
code_patterns7GoodNone
doc_index10ExcellentNone
components5ModerateMissing: full-text name/description

A. Full-Text Search Indexes (Already exist but verify populated)

-- Verify FTS tables are populated
SELECT COUNT(*) FROM messages_fts; -- Should match messages
SELECT COUNT(*) FROM decisions_fts; -- Should match decisions
SELECT COUNT(*) FROM doc_search; -- Should match doc_index
SELECT COUNT(*) FROM component_search; -- Should match components

B. Composite Indexes for Common Queries ✅ DONE (Dec 29, 2025)

-- Frequently used query pattern: recent by role and session
CREATE INDEX IF NOT EXISTS idx_entries_session_role_time
ON entries(session_id, role, timestamp DESC);

-- Query pattern: components by type and status
CREATE INDEX IF NOT EXISTS idx_components_type_status
ON components(type, status);

-- Query pattern: docs by workspace and type
CREATE INDEX IF NOT EXISTS idx_doc_workspace_type
ON doc_index(workspace_id, doc_type);

C. Partial Indexes (for filtered queries) ✅ DONE (Dec 29, 2025)

-- Only index active components (538 of 1937)
CREATE INDEX IF NOT EXISTS idx_components_active
ON components(type, name) WHERE status = 'active';

-- High-confidence decisions (skipped - all decisions at 0.4 confidence)
-- CREATE INDEX IF NOT EXISTS idx_decisions_high_conf
-- ON decisions(type, project) WHERE confidence >= 0.8;

3.3 Semantic Search Indexes

Current: doc_embeddings and entries_embeddings are empty.

Action: Implement vector similarity search:

# Option 1: Use SQLite with manual cosine similarity (slow for large datasets)
# Option 2: Use sqlite-vec extension (requires compilation)
# Option 3: Use external vector DB (Chroma, Qdrant) with SQLite metadata

# Recommended: Option 3 - Chroma integration exists in scripts/core/chromadb_setup.py

4. Automation Recommendations

4.1 Scheduled Tasks

TaskFrequencyScriptPriority
Backup to GCSDailybackup-context-db.shP0
Integrity checkWeeklyPRAGMA integrity_checkP0
Statistics updateWeeklyANALYZEP1
Orphan cleanupWeeklydelete-orphans.pyP1
VacuumMonthlyVACUUMP2

Cron configuration:

# /etc/cron.d/coditect-context-db
0 2 * * * /path/to/backup-context-db.sh # Daily 2 AM
0 3 * * 0 sqlite3 sessions.db "PRAGMA integrity_check" # Weekly Sunday 3 AM
0 3 * * 1 sqlite3 sessions.db "ANALYZE" # Weekly Monday 3 AM
0 4 1 * * sqlite3 sessions.db "VACUUM" # Monthly 1st, 4 AM

4.2 Event-Driven Automation

EventTriggerAction
Session end/export or session closeExtract and index
Document saveFile save in workspaceRe-index doc_index
Component changeAgent/command/skill modifiedUpdate component_frontmatter
High-value decisionConfidence >= 0.9Add to knowledge graph

Hook implementation:

# hooks/post-session.sh (already exists - enhance)
#!/bin/bash
# After session ends:
# 1. Run /cx to extract messages
# 2. Update doc_index
# 3. Generate embeddings
# 4. Extract knowledge graph

python3 scripts/unified-message-extractor.py
python3 scripts/context-db.py index-docs
python3 scripts/generate-embeddings.py --incremental
python3 scripts/extract-knowledge-graph.py --incremental

4.3 Maintenance Automation Script

#!/bin/bash
# scripts/db-maintenance.sh

DB_PATH="context-storage/sessions.db"

echo "=== Context Database Maintenance ==="
echo "Started: $(date)"

# 1. Integrity check
echo "Checking integrity..."
sqlite3 "$DB_PATH" "PRAGMA integrity_check" | grep -q "ok" || {
echo "ERROR: Database integrity check failed!"
exit 1
}

# 2. Update statistics
echo "Updating statistics..."
sqlite3 "$DB_PATH" "ANALYZE"

# 3. Clean orphans
echo "Cleaning orphaned records..."
sqlite3 "$DB_PATH" "
DELETE FROM doc_embeddings WHERE doc_id NOT IN (SELECT id FROM doc_index);
DELETE FROM entries_embeddings WHERE entry_id NOT IN (SELECT id FROM entries);
DELETE FROM summaries WHERE entry_id NOT IN (SELECT id FROM entries);
"

# 4. Report sizes
echo "Table sizes:"
sqlite3 "$DB_PATH" "
SELECT name,
(SELECT COUNT(*) FROM pragma_table_info(name)) as columns,
page_count * 4096 / 1024 / 1024.0 as size_mb
FROM sqlite_master
WHERE type='table'
AND name NOT LIKE 'sqlite_%'
ORDER BY page_count DESC
LIMIT 10;
"

echo "Completed: $(date)"

5. Query Interface Improvements

5.1 Current Query Methods

MethodInterfaceLimitation
/cxqCLI commandLimited query types
context-db.pyPython scriptRequires command line
Direct SQLitesqlite3 shellRequires SQL knowledge

A. Natural Language Query Interface

# Current: SQL-based
/cxq --sql "SELECT * FROM decisions WHERE confidence > 0.9"

# Proposed: Natural language
/cxq "show me high-confidence decisions from last week"
/cxq "what errors have I solved before?"
/cxq "find code patterns for authentication"

B. Saved/Named Queries ✅ IMPLEMENTED (Dec 29, 2025)

8 new views created for instant queries:

ViewPurposeQuery
knowledge_statsRow counts for all knowledge tables/cxq --knowledge-stats
today_activityLast 24h activity/cxq --today-activity
recent_decisionsHigh-confidence decisions (7 days)/cxq --view recent_decisions
error_frequencyMost common errors/cxq --error-frequency
patterns_by_languageCode patterns by language/cxq --patterns-summary
session_activitySession durations/cxq --session-summary
component_summaryComponents by type/status/cxq --component-overview
docs_by_categoryDocumentation by category/cxq --docs-summary
sessions_by_topicSession insights by topicSELECT * FROM sessions_by_topic WHERE topic='deployment'
-- Example: Query any view directly
sqlite3 context-storage/sessions.db "SELECT * FROM knowledge_stats"
sqlite3 context-storage/sessions.db "SELECT * FROM error_frequency LIMIT 10"

-- Query sessions by topic (new)
sqlite3 context-storage/sessions.db "SELECT * FROM sessions_by_topic WHERE topic='authentication'"
sqlite3 context-storage/sessions.db "SELECT * FROM sessions_by_topic WHERE topic='deployment' ORDER BY hours DESC"

C. Query Result Formatting

# Current: Raw output
/cxq --decisions

# Proposed: Multiple formats
/cxq --decisions --format table # ASCII table
/cxq --decisions --format json # JSON output
/cxq --decisions --format markdown # Markdown table
/cxq --decisions --format csv # CSV export

5.3 Performance Query Hints

-- Use EXPLAIN QUERY PLAN to verify index usage
EXPLAIN QUERY PLAN
SELECT * FROM entries WHERE session_id = 'abc' AND role = 'assistant';

-- Should show: USING INDEX idx_entries_session

-- Common slow query patterns to avoid:
-- AVOID: LIKE '%pattern%' (can't use index)
-- USE: Full-text search instead
SELECT * FROM decisions_fts WHERE decisions_fts MATCH 'pattern';

-- AVOID: Complex JOINs without LIMIT
-- USE: Subqueries with LIMIT first
SELECT * FROM entries e
JOIN (SELECT id FROM messages ORDER BY timestamp DESC LIMIT 100) m
ON e.id = m.id;

6. Implementation Roadmap

Phase 1: Foundation (Week 1) - ✅ COMPLETE

  • Add remaining UNIQUE constraints (Dec 29, 2025)
  • Create db-maintenance.sh script (Dec 29, 2025)
  • Set up daily backup cron job
  • Create common query views (8 new views) (Dec 29, 2025)
  • Add view query commands to /cxq (Dec 29, 2025 - in progress)

Phase 2: Semantic Search (Week 2-3)

  • Populate doc_embeddings table
  • Populate entries_embeddings table
  • Integrate Chroma for vector search
  • Add /cxq --semantic query mode

Phase 3: Knowledge Graph (Week 4)

  • Implement entity extraction pipeline
  • Populate knowledge_entities
  • Implement relation extraction
  • Populate knowledge_graph

Phase 4: Automation (Week 5)

  • Add post-session hook for extraction
  • Add file-save hook for doc indexing
  • Implement event-driven triggers
  • Add monitoring and alerting

Phase 5: Query Enhancements (Week 6)

  • Natural language query parsing
  • Saved query management (views created) (Dec 29, 2025)
  • Multiple output formats
  • Query performance dashboard

7. Metrics to Track

MetricCurrentTargetTracking
Database size1,603 MB< 2 GBdu -h sessions.db
Unique messages92,924GrowingSELECT COUNT(*) FROM messages
Query response timeUnknown< 100msQuery timing
Index hit rate100%> 90%EXPLAIN QUERY PLAN verified
Backup success rateUnknown100%Backup logs
Duplicate rate0%0%UNIQUE constraints prevent duplicates
Query views1920+SELECT COUNT(*) FROM sqlite_master WHERE type='view'
UNIQUE constraints66All major tables protected
Total indexes164160+Composite + partial indexes added
Session insights5,753All sessions1,163 sessions analyzed
Hours tracked1,884GrowingTotal development hours indexed


Author: CODITECT Framework Review Status: Phase 1 complete, actively implementing Last Updated: December 29, 2025 Next Review: January 2026


Changelog

DateChange
2025-12-29Added sessions_by_topic view (19 views total)
2025-12-29Generated session insights: 5,753 rows, 1,163 sessions, 1,884 hours tracked
2025-12-29Added generate-session-insights.py script
2025-12-29Added 4 performance indexes: 3 composite + 1 partial (164 total)
2025-12-29Added /cxq view commands (--views, --today-activity, etc.)
2025-12-29Phase 1 complete: UNIQUE constraints, views, maintenance script
2025-12-29Initial document created