Context Database Improvement Recommendations
Generated: December 29, 2025 Database Version: context.db v1.0 Current Size: 1,603 MB (after deduplication)
Executive Summary
The context database is functional but underutilized. Key findings:
| Metric | Current | Potential | Status |
|---|---|---|---|
| Active tables | 30/56 (54%) | 56/56 (100%) | Pending |
| Semantic search | Disabled | Full RAG capability | Pending |
| Knowledge graph | Empty | Entity-relationship extraction | Pending |
| Learning system | Empty | User progress tracking | Pending |
| Automation | Manual triggers | Scheduled + event-driven | In Progress |
| Deduplication | UNIQUE constraints added | Prevention at ingestion | ✅ DONE |
| Query Views | 10 views | 19 views | ✅ DONE |
| Session Insights | 1,163 sessions analyzed | All sessions | ✅ DONE |
Priority Recommendations:
- P0: Enable semantic embeddings for RAG search
- P1: Implement knowledge graph extraction
- P1: Add ingestion-time deduplication
- P2: Activate learning system
- P2: Schedule automated maintenance
Workflow Steps
- Initialize - Set up the environment
- Configure - Apply settings
- Execute - Run the process
- Validate - Check results
- Complete - Finalize workflow
1. Content Gaps (What's Missing)
1.1 Empty Tables Analysis
| Table | Purpose | Status | Action |
|---|---|---|---|
doc_embeddings | Semantic document search | Empty (0 rows) | Enable embedding generation |
entries_embeddings | Semantic entry search | Empty (0 rows) | Enable embedding generation |
knowledge_entities | Entity extraction | Empty (0 rows) | Implement NER pipeline |
knowledge_graph | Entity relationships | Empty (0 rows) | Implement relation extraction |
file_backup_entries | File version tracking | Empty (0 rows) | Enable file backup hook |
learning_* (4 tables) | User learning progress | Empty (0 rows) | Activate learning system |
1.2 Underutilized Tables
| Table | Rows | Expected Usage | Gap |
|---|---|---|---|
session_insights | 5,753 | 1 per session | ✅ DONE - 1,163 sessions analyzed |
project_summaries | 1 | 1 per project (~5+) | Only one project tracked |
file_snapshots | 1 | Many per session | File tracking disabled |
workspace_registry | 3 | All workspaces (~10+) | Manual registration only |
1.3 Missing Data Types
Currently not captured:
- Git commit metadata (author, branch, diff stats)
- Tool usage patterns (which tools, success rates)
- User preferences and settings
- External API call logs
- Performance metrics (response times)
2. Ingestion Pipeline Improvements
2.1 Current Ingestion Points
| Script | Tables Updated | Trigger |
|---|---|---|
unified-message-extractor.py | entries, messages | Manual (/cx) |
context-db.py | doc_index, decisions, etc. | Manual (/cxq) |
component-indexer.py | components, capabilities | Manual |
component-frontmatter-indexer.py | component_frontmatter | Manual |
2.2 Recommended New Pipelines
A. Automatic Session Capture (Hook-based)
# H.P.005-HOOKS/post-message-capture.py
# Trigger: After each Claude message
# Action: Extract and store in real-time
def on_message(message):
hash = compute_hash(message)
if not exists_in_db(hash):
store_message(message)
extract_decisions(message)
extract_code_patterns(message)
B. Semantic Embedding Pipeline
# H.P.004-SCRIPTS/generate-embeddings.py
# Trigger: After doc_index updates
# Action: Generate embeddings for semantic search
def generate_embeddings():
unembedded = get_docs_without_embeddings()
for doc in unembedded:
embedding = embed(doc.content)
store_embedding(doc.id, embedding)
C. Knowledge Graph Extraction
# H.P.004-SCRIPTS/extract-knowledge-graph.py
# Trigger: After decisions/patterns update
# Action: Extract entities and relationships
def extract_knowledge():
for decision in new_decisions():
entities = extract_entities(decision.text)
relations = extract_relations(decision.text)
store_graph(entities, relations)
2.3 Ingestion-Time Deduplication
Problem: Current deduplication is post-hoc (after duplicates exist).
Solution: Check before insert:
-- Pattern: INSERT OR IGNORE with UNIQUE constraint
INSERT OR IGNORE INTO decisions (decision, type, project, confidence)
VALUES (?, ?, ?, ?);
-- Pattern: UPSERT (update if exists)
INSERT INTO code_patterns (code, name, language, type, usage_count)
VALUES (?, ?, ?, ?, 1)
ON CONFLICT(code) DO UPDATE SET
usage_count = usage_count + 1,
last_used = datetime('now');
Required Changes:
Add UNIQUE constraints to all content-hash columns✅ DONE (Dec 29, 2025)idx_messages_unique_hashon messages(hash)idx_entries_unique_hashon entries(hash)idx_error_solutions_unique_hashon error_solutions(error_hash)idx_decisions_uniqueon decisions(decision)idx_patterns_unique_codeon code_patterns(code)idx_doc_unique_pathon doc_index(file_path)
- Modify ingestion H.P.004-SCRIPTS to use
INSERT OR IGNORE - Add upsert logic for counters (usage_count, access_count)
3. Indexing Improvements
3.1 Current Index Coverage
| Table | Indexes | Coverage | Gap |
|---|---|---|---|
messages | 6 | Good | Missing: content FTS |
entries | 7 | Good | Missing: content FTS |
decisions | 6 | Good | None |
code_patterns | 7 | Good | None |
doc_index | 10 | Excellent | None |
components | 5 | Moderate | Missing: full-text name/description |
3.2 Recommended New Indexes
A. Full-Text Search Indexes (Already exist but verify populated)
-- Verify FTS tables are populated
SELECT COUNT(*) FROM messages_fts; -- Should match messages
SELECT COUNT(*) FROM decisions_fts; -- Should match decisions
SELECT COUNT(*) FROM doc_search; -- Should match doc_index
SELECT COUNT(*) FROM component_search; -- Should match components
B. Composite Indexes for Common Queries ✅ DONE (Dec 29, 2025)
-- Frequently used query pattern: recent by role and session
CREATE INDEX IF NOT EXISTS idx_entries_session_role_time
ON entries(session_id, role, timestamp DESC);
-- Query pattern: components by type and status
CREATE INDEX IF NOT EXISTS idx_components_type_status
ON components(type, status);
-- Query pattern: docs by workspace and type
CREATE INDEX IF NOT EXISTS idx_doc_workspace_type
ON doc_index(workspace_id, doc_type);
C. Partial Indexes (for filtered queries) ✅ DONE (Dec 29, 2025)
-- Only index active components (538 of 1937)
CREATE INDEX IF NOT EXISTS idx_components_active
ON components(type, name) WHERE status = 'active';
-- High-confidence decisions (skipped - all decisions at 0.4 confidence)
-- CREATE INDEX IF NOT EXISTS idx_decisions_high_conf
-- ON decisions(type, project) WHERE confidence >= 0.8;
3.3 Semantic Search Indexes
Current: doc_embeddings and entries_embeddings are empty.
Action: Implement vector similarity search:
# Option 1: Use SQLite with manual cosine similarity (slow for large datasets)
# Option 2: Use sqlite-vec extension (requires compilation)
# Option 3: Use external vector DB (Chroma, Qdrant) with SQLite metadata
# Recommended: Option 3 - Chroma integration exists in H.P.004-SCRIPTS/core/chromadb_setup.py
4. Automation Recommendations
4.1 Scheduled Tasks
| Task | Frequency | Script | Priority |
|---|---|---|---|
| Backup to GCS | Daily | backup-context-db.sh | P0 |
| Integrity check | Weekly | PRAGMA integrity_check | P0 |
| Statistics update | Weekly | ANALYZE | P1 |
| Orphan cleanup | Weekly | delete-orphans.py | P1 |
| Vacuum | Monthly | VACUUM | P2 |
Cron H.P.009-CONFIGuration:
# /etc/cron.d/coditect-context-db
0 2 * * * /path/to/backup-context-db.sh # Daily 2 AM
0 3 * * 0 sqlite3 context.db "PRAGMA integrity_check" # Weekly Sunday 3 AM
0 3 * * 1 sqlite3 context.db "ANALYZE" # Weekly Monday 3 AM
0 4 1 * * sqlite3 context.db "VACUUM" # Monthly 1st, 4 AM
4.2 Event-Driven Automation
| Event | Trigger | Action |
|---|---|---|
| Session end | /export or session close | Extract and index |
| Document save | File save in workspace | Re-index doc_index |
| Component change | Agent/command/skill modified | Update component_frontmatter |
| High-value decision | Confidence >= 0.9 | Add to knowledge graph |
Hook implementation:
# H.P.005-HOOKS/post-session.sh (already exists - enhance)
#!/bin/bash
# After session ends:
# 1. Run /cx to extract messages
# 2. Update doc_index
# 3. Generate embeddings
# 4. Extract knowledge graph
python3 H.P.004-SCRIPTS/unified-message-extractor.py
python3 H.P.004-SCRIPTS/context-db.py index-docs
python3 H.P.004-SCRIPTS/generate-embeddings.py --incremental
python3 H.P.004-SCRIPTS/extract-knowledge-graph.py --incremental
4.3 Maintenance Automation Script
#!/bin/bash
# H.P.004-SCRIPTS/db-maintenance.sh
DB_PATH="context-storage/context.db"
echo "=== Context Database Maintenance ==="
echo "Started: $(date)"
# 1. Integrity check
echo "Checking integrity..."
sqlite3 "$DB_PATH" "PRAGMA integrity_check" | grep -q "ok" || {
echo "ERROR: Database integrity check failed!"
exit 1
}
# 2. Update statistics
echo "Updating statistics..."
sqlite3 "$DB_PATH" "ANALYZE"
# 3. Clean orphans
echo "Cleaning orphaned records..."
sqlite3 "$DB_PATH" "
DELETE FROM doc_embeddings WHERE doc_id NOT IN (SELECT id FROM doc_index);
DELETE FROM entries_embeddings WHERE entry_id NOT IN (SELECT id FROM entries);
DELETE FROM summaries WHERE entry_id NOT IN (SELECT id FROM entries);
"
# 4. Report sizes
echo "Table sizes:"
sqlite3 "$DB_PATH" "
SELECT name,
(SELECT COUNT(*) FROM pragma_table_info(name)) as columns,
page_count * 4096 / 1024 / 1024.0 as size_mb
FROM sqlite_master
WHERE type='table'
AND name NOT LIKE 'sqlite_%'
ORDER BY page_count DESC
LIMIT 10;
"
echo "Completed: $(date)"
5. Query Interface Improvements
5.1 Current Query Methods
| Method | Interface | Limitation |
|---|---|---|
/cxq | CLI command | Limited query types |
context-db.py | Python script | Requires command line |
| Direct SQLite | sqlite3 shell | Requires SQL knowledge |
5.2 Recommended Query Enhancements
A. Natural Language Query Interface
# Current: SQL-based
/cxq --sql "SELECT * FROM decisions WHERE confidence > 0.9"
# Proposed: Natural language
/cxq "show me high-confidence decisions from last week"
/cxq "what errors have I solved before?"
/cxq "find code patterns for authentication"
B. Saved/Named Queries ✅ IMPLEMENTED (Dec 29, 2025)
8 new views created for instant queries:
| View | Purpose | Query |
|---|---|---|
knowledge_stats | Row counts for all knowledge tables | /cxq --knowledge-stats |
today_activity | Last 24h activity | /cxq --today-activity |
recent_decisions | High-confidence decisions (7 days) | /cxq --view recent_decisions |
error_frequency | Most common errors | /cxq --error-frequency |
patterns_by_language | Code patterns by language | /cxq --patterns-summary |
session_activity | Session durations | /cxq --session-summary |
component_summary | Components by type/status | /cxq --component-overview |
docs_by_category | Documentation by category | /cxq --docs-summary |
sessions_by_topic | Session insights by topic | SELECT * FROM sessions_by_topic WHERE topic='deployment' |
-- Example: Query any view directly
sqlite3 context-storage/context.db "SELECT * FROM knowledge_stats"
sqlite3 context-storage/context.db "SELECT * FROM error_frequency LIMIT 10"
-- Query sessions by topic (new)
sqlite3 context-storage/context.db "SELECT * FROM sessions_by_topic WHERE topic='authentication'"
sqlite3 context-storage/context.db "SELECT * FROM sessions_by_topic WHERE topic='deployment' ORDER BY hours DESC"
C. Query Result Formatting
# Current: Raw output
/cxq --decisions
# Proposed: Multiple formats
/cxq --decisions --format table # ASCII table
/cxq --decisions --format json # JSON output
/cxq --decisions --format markdown # Markdown table
/cxq --decisions --format csv # CSV export
5.3 Performance Query Hints
-- Use EXPLAIN QUERY PLAN to verify index usage
EXPLAIN QUERY PLAN
SELECT * FROM entries WHERE session_id = 'abc' AND role = 'assistant';
-- Should show: USING INDEX idx_entries_session
-- Common slow query patterns to avoid:
-- AVOID: LIKE '%pattern%' (can't use index)
-- USE: Full-text search instead
SELECT * FROM decisions_fts WHERE decisions_fts MATCH 'pattern';
-- AVOID: Complex JOINs without LIMIT
-- USE: Subqueries with LIMIT first
SELECT * FROM entries e
JOIN (SELECT id FROM messages ORDER BY timestamp DESC LIMIT 100) m
ON e.id = m.id;
6. Implementation Roadmap
Phase 1: Foundation (Week 1) - ✅ COMPLETE
- Add remaining UNIQUE constraints (Dec 29, 2025)
- Create
db-maintenance.shscript (Dec 29, 2025) - Set up daily backup cron job
- Create common query views (8 new views) (Dec 29, 2025)
- Add view query H.P.002-COMMANDS to
/cxq(Dec 29, 2025 - in progress)
Phase 2: Semantic Search (Week 2-3)
- Populate
doc_embeddingstable - Populate
entries_embeddingstable - Integrate Chroma for vector search
- Add
/cxq --semanticquery mode
Phase 3: Knowledge Graph (Week 4)
- Implement entity extraction pipeline
- Populate
knowledge_entities - Implement relation extraction
- Populate
knowledge_graph
Phase 4: Automation (Week 5)
- Add post-session hook for extraction
- Add file-save hook for doc indexing
- Implement event-driven triggers
- Add monitoring and alerting
Phase 5: Query Enhancements (Week 6)
- Natural language query parsing
- Saved query management (views created) (Dec 29, 2025)
- Multiple output formats
- Query performance dashboard
7. Metrics to Track
| Metric | Current | Target | Tracking |
|---|---|---|---|
| Database size | 1,603 MB | < 2 GB | du -h context.db |
| Unique messages | 92,924 | Growing | SELECT COUNT(*) FROM messages |
| Query response time | Unknown | < 100ms | Query timing |
| Index hit rate | 100% | > 90% | EXPLAIN QUERY PLAN verified |
| Backup success rate | Unknown | 100% | Backup logs |
| Duplicate rate | 0% | 0% | UNIQUE constraints prevent duplicates |
| Query views | 19 | 20+ | SELECT COUNT(*) FROM sqlite_master WHERE type='view' |
| UNIQUE constraints | 6 | 6 | All major tables protected |
| Total indexes | 164 | 160+ | Composite + partial indexes added |
| Session insights | 5,753 | All sessions | 1,163 sessions analyzed |
| Hours tracked | 1,884 | Growing | Total development hours indexed |
Related Documents
- DATABASE-SCHEMA.md - Complete schema reference
- DATABASE-SCHEMA.json - Machine-readable schema
- MEMORY-MANAGEMENT-GUIDE.md - Context preservation guide
- backup-context-db.sh - Backup script
- db-maintenance.sh - Maintenance script
- deduplicate-all-tables.py - Deduplication script
- generate-session-insights.py - Session insights generator (NEW)
Author: CODITECT Framework Review Status: Phase 1 complete, actively implementing Last Updated: December 29, 2025 Next Review: January 2026
Changelog
| Date | Change |
|---|---|
| 2025-12-29 | Added sessions_by_topic view (19 views total) |
| 2025-12-29 | Generated session insights: 5,753 rows, 1,163 sessions, 1,884 hours tracked |
| 2025-12-29 | Added generate-session-insights.py script |
| 2025-12-29 | Added 4 performance indexes: 3 composite + 1 partial (164 total) |
| 2025-12-29 | Added /cxq view H.P.002-COMMANDS (--views, --today-activity, etc.) |
| 2025-12-29 | Phase 1 complete: UNIQUE constraints, views, maintenance script |
| 2025-12-29 | Initial document created |