ADR-148: Database Schema Documentation Standard

Status

ACCEPTED (2026-02-03)

Context

Problem Statement

CODITECT's four-tier database architecture (ADR-118) contains 94 tables across 4 databases, but documentation exists only at the high level. There is no:

Field-level documentation - What each column means and its purpose
Relationship documentation - Foreign keys and entity relationships
Data dictionary - Centralized reference for all tables and fields
Purpose mapping - Which business functions each table serves

This causes problems:

Developers don't know which table to use for new features
Field names are ambiguous (e.g., sync_status vs cloud_sync_status)
Cloud sync mapping is undocumented
Query optimization is guesswork without understanding data relationships

Current State

Database	Tables	Views	FTS	Documentation
`platform.db`	6	0	1	Schema only
`org.db`	8	1	0	Schema only
`sessions.db`	74	0	4	Schema only
`projects.db`	6	0	1	Schema only

Decision

Establish a Database Schema Documentation Standard that provides:

1. Data Dictionary Format

Each table documented in docs/reference/database/DATA-DICTIONARY.md:

## Table: `messages` (sessions.db)

**Purpose:** Stores extracted messages from all LLM sessions (Claude, Codex, Gemini, KIMI)

**Tier:** 3 (Regenerable from JSONL source files)

**Related Tables:**
- `embeddings` (1:1 - semantic vectors)
- `token_economics` (1:N - per-message tokens)
- `message_component_invocations` (1:N - skill/agent invocations)

### Fields

| Field | Type | Nullable | Default | Description |
|-------|------|----------|---------|-------------|
| `id` | INTEGER | NO | AUTO | Primary key |
| `hash` | TEXT | NO | - | SHA-256 content hash (dedup key) |
| `content` | TEXT | NO | - | Message text content |
| `role` | TEXT | NO | - | Message role: user, assistant, tool_use, tool_result |
| `source_type` | TEXT | YES | - | Source format: jsonl, export |
| `source_file` | TEXT | YES | - | Original file path |
| `session_id` | TEXT | YES | - | Claude Code session UUID |
| `checkpoint` | TEXT | YES | - | Checkpoint reference |
| `timestamp` | TEXT | YES | - | ISO 8601 timestamp |
| `extracted_at` | TEXT | YES | - | When message was extracted |
| `content_length` | INTEGER | YES | - | Character count |
| `has_code` | BOOLEAN | YES | - | Contains code blocks |
| `has_markdown` | BOOLEAN | YES | - | Contains markdown formatting |
| `agent_context` | TEXT | YES | - | JSON: agent execution context |
| `tool_use` | TEXT | YES | - | JSON: tool calls in message |
| `tool_result` | TEXT | YES | - | JSON: tool results |
| `export_metadata` | TEXT | YES | - | JSON: export-specific metadata |
| `component_invocations` | TEXT | YES | - | JSON: invoked skills/agents |
| `tenant_id` | TEXT | YES | - | Multi-tenant: organization ID |
| `user_id` | TEXT | YES | - | Multi-tenant: user ID |
| `team_id` | TEXT | YES | - | Multi-tenant: team ID |
| `project_id` | TEXT | YES | - | Multi-tenant: project ID |
| `cloud_id` | TEXT | YES | - | Cloud-assigned UUID |
| `synced_at` | TEXT | YES | - | Last cloud sync timestamp |
| `sync_status` | TEXT | YES | 'pending' | Cloud sync: pending, synced, error |

### Indexes

| Index | Columns | Purpose |
|-------|---------|---------|
| `idx_messages_tenant` | `tenant_id` | Tenant isolation |
| `idx_messages_user` | `tenant_id, user_id` | User filtering |
| `idx_messages_project` | `tenant_id, project_id` | Project filtering |
| `idx_messages_sync` | `sync_status, synced_at` | Cloud sync queries |
| `idx_messages_cloud_id` | `cloud_id` | Cloud deduplication |

### FTS5 Index

Table `messages_fts` provides full-text search on `content` and `role`.

2. Entity-Relationship Diagrams

Mermaid ER diagrams in docs/reference/database/ER-DIAGRAMS.md:

3. Cloud Sync Mapping

Document which tables sync to cloud and their PostgreSQL equivalents:

Local Table	Cloud Table	Sync Direction	Frequency
`messages`	`context.Message`	Push	Real-time
`decisions`	`context.Decision`	Push/Pull	Real-time
`skill_learnings`	`context.SkillLearning`	Push	Batch
`error_solutions`	`context.ErrorSolution`	Push/Pull	Real-time
`token_economics`	`analytics.TokenUsage`	Push	Batch
`tool_analytics`	`analytics.ToolUsage`	Push	Batch

4. Database Purpose Matrix

Quick reference for developers:

Need	Table	Database	Example
Store a decision	`decisions`	org.db	Architecture choice
Log token usage	`token_economics`	sessions.db	API call cost
Track tool calls	`tool_analytics`	sessions.db	Read/Write/Bash
Index a component	`components`	platform.db	New agent
Register a project	`projects`	projects.db	Customer repo

5. Auto-Generation Script

Create scripts/generate-data-dictionary.py that:

Connects to all 4 databases
Extracts schema via PRAGMA table_info()
Generates markdown documentation
Validates against expected schema
Detects schema drift

# Usage
python3 scripts/generate-data-dictionary.py --output docs/reference/database/DATA-DICTIONARY.md
python3 scripts/generate-data-dictionary.py --validate  # Check for drift

Database Tier Reference

Tier 1: platform.db (Component Metadata)

Purpose: Index of CODITECT framework components Regenerable: Yes (via component-indexer.py) Tables: 6

Table	Purpose
`components`	Master component registry (agents, skills, commands)
`capabilities`	Component capabilities (primary, tag, domain, action)
`triggers`	When to use/avoid components
`component_relationships`	Dependencies (invokes, alternative, complement)
`component_composability`	Orchestration capabilities
`component_usage_stats`	Runtime usage metrics

Tier 2: org.db (Critical Knowledge)

Purpose: Irreplaceable organizational knowledge Regenerable: NO - requires daily backup Tables: 8

Table	Purpose
`decisions`	Architectural/technical decisions
`error_solutions`	Error patterns with solutions
`skill_learnings`	Skill effectiveness tracking
`projects`	Project registry
`project_tracks`	PILOT track progress
`local_identity`	Auth state
`cloud_sync_state`	Sync tracking

Tier 3: sessions.db (Session Data)

Purpose: Session messages and analytics Regenerable: Yes (from JSONL exports) Tables: 74

Core Tables:

Table	Purpose
`messages`	Extracted session messages
`entries`	Raw session entries with threading
`token_economics`	Token usage and costs
`tool_analytics`	Tool execution metrics
`embeddings`	Semantic vectors for messages

Knowledge Tables:

Table	Purpose
`code_patterns`	Extracted code patterns
`knowledge_graph`	Entity relationships
`knowledge_entities`	Indexed entities

Agent Loop Tables:

Table	Purpose
`agent_sessions`	Ralph loop tracking
`checkpoints`	Execution state snapshots
`reasoning_traces`	Agent reasoning steps
`health_events`	Agent health monitoring

Learning Tables:

Table	Purpose
`learning_modules`	Learning content
`learning_progress`	User progress
`learning_badges`	Achievements

Tier 4: projects.db (Project Embeddings)

Purpose: Project-specific semantic search Regenerable: Yes (re-index from source) Tables: 6

Table	Purpose
`projects`	Registered projects with git metadata
`content_hashes`	File change detection
`project_embeddings`	Code chunk embeddings
`exclude_patterns`	Gitignore-style patterns
`project_tags`	Project categorization
`project_activity`	Activity log

Consequences

Positive

Self-documenting databases - Developers can understand schema without reading code
Consistent field naming - Data dictionary enforces naming conventions
Cloud sync clarity - Clear mapping between local and cloud
Onboarding acceleration - New contributors understand data model faster
Query optimization - Understanding relationships enables efficient queries

Negative

Maintenance burden - Documentation must be updated with schema changes
Auto-generation limitations - Scripts can't capture business context

Mitigations

Auto-generation script reduces manual maintenance
Pre-commit hook validates documentation matches schema
Schema changes require data dictionary update in same PR

Implementation

Phase 1: Data Dictionary Foundation (J.24.1)

Create docs/reference/database/DATA-DICTIONARY.md
Document all tables in sessions.db (largest database)
Create auto-generation script

Phase 2: Complete Coverage (J.24.2)

Document org.db tables (Tier 2 - critical)
Document platform.db tables (Tier 1)
Document projects.db tables (Tier 4)

Phase 3: ER Diagrams (J.24.3)

Create Mermaid ER diagrams
Document cross-database relationships
Add to architecture documentation

Phase 4: Cloud Sync Documentation (J.24.4)

Document local → cloud table mapping
Document sync direction and frequency
Create sync troubleshooting guide

ADR-118: Four-Tier Database Architecture
ADR-114: User Data Separation
ADR-149: Query Language Evolution Strategy
DATABASE-SCHEMA.md: Current schema reference (to be enhanced)

Changelog

Date	Change
2026-02-03	Initial version

Track: J.24 (Database Schema Documentation) Task: J.24.1.1

Status​

Context​

Problem Statement​

Current State​

Decision​

1. Data Dictionary Format​

2. Entity-Relationship Diagrams​

3. Cloud Sync Mapping​

4. Database Purpose Matrix​

5. Auto-Generation Script​

Database Tier Reference​

Tier 1: platform.db (Component Metadata)​

Tier 2: org.db (Critical Knowledge)​

Tier 3: sessions.db (Session Data)​

Tier 4: projects.db (Project Embeddings)​

Consequences​

Positive​

Negative​

Mitigations​

Implementation​

Phase 1: Data Dictionary Foundation (J.24.1)​

Phase 2: Complete Coverage (J.24.2)​

Phase 3: ER Diagrams (J.24.3)​

Phase 4: Cloud Sync Documentation (J.24.4)​

Related​

Changelog​