ADR-118: Four-Tier Database Architecture

Status

ACCEPTED (2026-01-28)

Context

Problem Statement

CODITECT's original database architecture evolved organically, resulting in a single monolithic context.db that contained:

Component metadata (agents, skills, commands)
Session messages (251K+ rows)
Tool analytics (8.2M+ rows)
Token economics (17.8M+ rows)
Accumulated knowledge (decisions, learnings, error_solutions)

This caused several problems:

Problem	Impact
Single point of failure	All data lost if context.db corrupts
Backup complexity	Can't prioritize critical vs regenerable data
Size bloat	12+ GB makes backup slow and expensive
Framework updates	Risk overwriting customer data
Recovery difficulty	No way to rebuild without full backup

Previous Decisions

ADR-089: Separated component data into platform.db (read-only)
ADR-103: Conceptualized four-database split but never fully implemented

Key Insight

Not all data has equal value:

Data Type	Replaceability	Example
Irreplaceable	Cannot be regenerated	Decisions, learnings, error_solutions
Regenerable	Can rebuild from source	Messages (from JSONL), analytics
Derived	Computed from other data	Embeddings, indexes

Decision

Four-Tier Database Architecture

Split data into four tiers based on criticality and regenerability:

┌─────────────────────────────────────────────────────────────────┐
│                    TIER 1: PLATFORM (Regenerable)               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ platform.db                                              │   │
│  │ • Component metadata (agents, skills, commands)          │   │
│  │ • Capabilities, invocation patterns                      │   │
│  │ • Regenerable via: python3 scripts/component-indexer.py  │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Backup: NOT REQUIRED (regenerate in ~2 minutes)                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                  TIER 2: ORGANIZATION (Critical)                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ org.db                           ⚠️ CRITICAL             │   │
│  │ • skill_learnings (758K+ rows) - Accumulated improvements│   │
│  │ • decisions (1.8K+ rows) - Architectural decisions       │   │
│  │ • error_solutions (475+ rows) - Problem-solution pairs   │   │
│  │ • CANNOT BE REGENERATED - represents months of learning  │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Backup: DAILY with Grandfather-Father-Son (GFS) retention      │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                   TIER 3: SESSIONS (Regenerable)                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ sessions.db                                              │   │
│  │ • messages (251K+ rows) - Extracted from JSONL           │   │
│  │ • tool_analytics (8.2M+ rows) - Tool usage metrics       │   │
│  │ • token_economics (17.8M+ rows) - Token tracking         │   │
│  │ • activity_associations - Project/track mappings         │   │
│  │ • Regenerable via: python3 scripts/unified-message-      │   │
│  │   extractor.py (source: unified_messages.jsonl)          │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Backup: OPTIONAL (can regenerate from unified_messages.jsonl)  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                   TIER 4: PROJECTS (Regenerable)                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ projects.db                                              │   │
│  │ • projects - Registered project metadata                 │   │
│  │ • content_hashes - File change detection                 │   │
│  │ • project_embeddings - Semantic search vectors           │   │
│  │ • exclude_patterns - Per-project ignore rules            │   │
│  │ • project_tags, project_activity                         │   │
│  │ • Regenerable via: /cx --index-project <path>            │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Backup: OPTIONAL (regenerable from project source files)       │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                 TIER 1b: PLATFORM INDEX (Regenerable)            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ platform-index.db                                        │   │
│  │ • component_embeddings - Vector embeddings (384-dim)     │   │
│  │ • component_hashes - SHA256 change detection             │   │
│  │ • embedding_models - Model metadata                      │   │
│  │ • Regenerable via: /cx --reindex-framework               │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Backup: NOT REQUIRED (regenerate from platform.db content)     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                 TIER 3b: MESSAGING (Regenerable)                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ messaging.db                      (ADR-160)              │   │
│  │ • session_registry - Active session tracking             │   │
│  │ • inter_session_messages - Cross-session communication   │   │
│  │ • file_locks - Distributed file locking                  │   │
│  │ • task_claims - Task ownership coordination              │   │
│  │ • WAL mode enabled for concurrent access                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Backup: NOT REQUIRED (ephemeral coordination data)             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                 TIER 1c: CALL GRAPH (Regenerable)                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ call_graph.db                                            │   │
│  │ • functions - Indexed function definitions               │   │
│  │ • calls - Function call relationships                    │   │
│  │ • files - Source file metadata                           │   │
│  │ • Regenerable via: MCP call-graph index_directory        │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Backup: NOT REQUIRED (regenerate from source code)             │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                      LEGACY (Deprecated)                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ context.db                       ❌ DEPRECATED            │   │
│  │ • Being migrated to org.db + sessions.db                 │   │
│  │ • DO NOT USE in new code                                 │   │
│  │ • Will be removed after migration complete               │   │
│  └─────────────────────────────────────────────────────────┘   │
│  Status: MIGRATION IN PROGRESS                                  │
└─────────────────────────────────────────────────────────────────┘

Database Specifications

Tier	Database	Size	Key Tables	Backup	Recovery
1	`platform.db`	~40 MB	components (+component_origin), capabilities	None	Regenerate (~2 min)
1b	`platform-index.db`	~100 MB	component_embeddings, component_hashes	None	Regenerate via /cx --reindex-framework
1c	`call_graph.db`	Variable	functions, calls, files	None	Regenerate via MCP index
2	`org.db`	~250 MB	skill_learnings, decisions, error_solutions	CRITICAL	Restore from backup
3	`sessions.db`	~11 GB	messages, tool_analytics, token_economics, activity_associations	Optional	Regenerate via /cx
3b	`messaging.db`	~1 MB	session_registry, inter_session_messages, file_locks, task_claims	None	Ephemeral (recreate)
4	`projects.db`	Variable	projects, content_hashes, project_embeddings, exclude_patterns	Optional	Regenerate via /cx --index-project
LEGACY	`context.db`	~12 GB	(migrating)	N/A	DO NOT USE

Location

All databases stored in: ~/.coditect-data/context-storage/

(Per ADR-114 User Data Separation)

Import Pattern

from scripts.core.paths import (
    get_org_db_path,       # TIER 2: Critical knowledge
    get_sessions_db_path,  # TIER 3: Regenerable sessions
    get_projects_db_path,  # TIER 4: Project indexing
    get_context_db_path,   # LEGACY: Deprecated, do not use
)

# Correct usage
ORG_DB = get_org_db_path()           # For decisions, learnings
SESSIONS_DB = get_sessions_db_path()  # For messages, activities
PROJECTS_DB = get_projects_db_path()  # For project metadata, embeddings

# Other databases (resolved from CODITECT_DATA_DIR/context-storage/)
PLATFORM_DB = CODITECT_DATA_DIR / 'context-storage' / 'platform.db'
PLATFORM_INDEX_DB = CODITECT_DATA_DIR / 'context-storage' / 'platform-index.db'
MESSAGING_DB = CODITECT_DATA_DIR / 'context-storage' / 'messaging.db'
CALL_GRAPH_DB = CODITECT_DATA_DIR / 'context-storage' / 'call_graph.db'

# WRONG - Never use context.db in new code
# CONTEXT_DB = get_context_db_path()  # DEPRECATED

Component Origin Tracking (ADR-180)

The platform.db components table includes a component_origin column to distinguish the source of each component:

Value	Source	Description
`system`	`core/agents/`, `core/commands/`, etc.	Built-in framework components
`product`	`products/{name}/`	Product-provided components
`user`	`user-components/` (via `_user` symlinks)	User-created components

The component-indexer.py populates this column by scanning all three locations.

Migration Status

Table	Source	Target	Status
`components`	context.db	platform.db	✅ Complete
`capabilities`	context.db	platform.db	✅ Complete
`skill_learnings`	context.db	org.db	✅ Complete
`decisions`	context.db	org.db	✅ Complete
`error_solutions`	context.db	org.db	✅ Complete
`messages`	context.db	sessions.db	✅ Complete
`tool_analytics`	context.db	sessions.db	✅ Complete
`token_economics`	context.db	sessions.db	✅ Complete
`activity_associations`	(new)	sessions.db	✅ Created (J.14)

Consequences

Positive

Targeted Backups: Only backup critical data (org.db ~250MB vs 12GB)
Fast Recovery: Regenerate Tier 1 and Tier 3 from source files
Data Safety: Critical knowledge protected separately
Smaller Footprint: Daily backups are 250MB not 12GB
Clear Ownership: Know which database owns which data

Negative

Migration Effort: One-time migration of existing data
Code Updates: Scripts must use correct database
Join Complexity: Cross-database queries not possible

Risks

Risk	Mitigation
Code uses wrong database	Deprecation warnings, linting
Migration data loss	Pre-migration backup, validation
Split-brain state	Transaction boundaries at database level

Implementation

Phase 1: Schema Creation (Complete)

-- org.db tables (TIER 2)
CREATE TABLE skill_learnings (...);
CREATE TABLE decisions (...);
CREATE TABLE error_solutions (...);

-- sessions.db tables (TIER 3)
CREATE TABLE messages (...);
CREATE TABLE tool_analytics (...);
CREATE TABLE token_economics (...);
CREATE TABLE activity_associations (...);  -- J.14

Phase 2: Data Migration (Complete)

# Migrate critical data to org.db
python3 scripts/migrate-to-org-db.py

# Migrate session data to sessions.db
python3 scripts/migrate-to-sessions-db.py

Phase 3: Code Updates (Complete)

All scripts updated to use correct tier-specific imports.

Phase 4: Deprecation (Complete)

get_context_db_path() emits deprecation warnings. context.db is no longer created by the install script.

Phase 5: Install Script Integration (ADR-180)

The install script (v3.0.0) creates all databases during installation:

Install Step	Database(s)	Method
Step 5	org.db, sessions.db, messaging.db	`initialize_database()` — schema only, no seed data
Step 13	platform.db	`initialize_platform_database()` — runs `component-indexer.py`
Step 14	platform-index.db	`initialize_platform_index_database()` — schema + default model config
Step 15	projects.db	`initialize_projects_database()` — schema only, projects registered interactively

Clean seeding (ADR-180 D7): All databases are created with schema only. No pre-populated decisions, messages, or personal data. The only non-schema inserts are:

schema_versions records (metadata about the schema itself)
embedding_models default record (model configuration)

call_graph.db is created on-demand by the MCP call-graph tools, not during installation.

Backup Strategy

Tier 2 (org.db) - CRITICAL

# Daily backup with GFS retention
# Grandfather (monthly): 12 months
# Father (weekly): 4 weeks
# Son (daily): 7 days

~/.coditect/scripts/backup-org-db.sh
# → gs://coditect-backups/org.db/daily/YYYY-MM-DD/

Tier 3 (sessions.db) - Optional

# Weekly backup (optional, regenerable)
~/.coditect/scripts/backup-sessions-db.sh
# → gs://coditect-backups/sessions.db/weekly/

Recovery Procedures

Scenario	Recovery
org.db corrupt	Restore from latest GCS backup
sessions.db corrupt	Regenerate: `python3 scripts/unified-message-extractor.py --rebuild`
messaging.db corrupt	Delete and re-run install (ephemeral coordination data)
platform.db corrupt	Regenerate: `python3 scripts/component-indexer.py`
platform-index.db corrupt	Regenerate: `/cx --reindex-framework`
call_graph.db corrupt	Delete and re-index via MCP call-graph tools
projects.db corrupt	Regenerate: `python3 scripts/init_projects_db.py` then re-index projects

ADR-080: Context Extraction System (/cx)
ADR-089: Component-Customer Data Separation (superseded)
ADR-103: Four-Database Architecture (superseded, never implemented)
ADR-114: User Data Separation from Framework
ADR-121: Path Discovery Consolidation
ADR-148: Database Schema Documentation Standard (extends with field-level docs)
ADR-149: Query Language Evolution Strategy (extends with query capabilities)

Changelog

Date	Change
2026-01-28	Initial version - Four-Tier architecture
2026-01-28	Added activity_associations to sessions.db (J.14)
2026-01-28	Added TIER 4: projects.db for project indexing (J.15.1.3)
2026-02-12	ADR-180: Added messaging.db (Tier 3b), platform-index.db (Tier 1b), call_graph.db (Tier 1c)
2026-02-12	ADR-180: Added component_origin column to platform.db (system/product/user)
2026-02-12	ADR-180: Documented install script Phase 5 - clean database seeding
2026-02-12	Updated Phase 3/4 status to Complete, added recovery procedures for all databases

Track: J.15 (Memory - Project Database) Task: J.15.1.3, N.6.15.2.3 (Install Script Update)

Status​

Context​

Problem Statement​

Previous Decisions​

Key Insight​

Decision​

Four-Tier Database Architecture​

Database Specifications​

Location​

Import Pattern​

Component Origin Tracking (ADR-180)​

Migration Status​

Consequences​

Positive​

Negative​

Risks​

Implementation​

Phase 1: Schema Creation (Complete)​

Phase 2: Data Migration (Complete)​

Phase 3: Code Updates (Complete)​

Phase 4: Deprecation (Complete)​

Phase 5: Install Script Integration (ADR-180)​

Backup Strategy​

Tier 2 (org.db) - CRITICAL​

Tier 3 (sessions.db) - Optional​

Recovery Procedures​

Related​

Changelog​