ADR-148: Database Schema Documentation Standard
Status
ACCEPTED (2026-02-03)
Context
Problem Statement
CODITECT's four-tier database architecture (ADR-118) contains 94 tables across 4 databases, but documentation exists only at the high level. There is no:
- Field-level documentation - What each column means and its purpose
- Relationship documentation - Foreign keys and entity relationships
- Data dictionary - Centralized reference for all tables and fields
- Purpose mapping - Which business functions each table serves
This causes problems:
- Developers don't know which table to use for new features
- Field names are ambiguous (e.g.,
sync_statusvscloud_sync_status) - Cloud sync mapping is undocumented
- Query optimization is guesswork without understanding data relationships
Current State
| Database | Tables | Views | FTS | Documentation |
|---|---|---|---|---|
platform.db | 6 | 0 | 1 | Schema only |
org.db | 8 | 1 | 0 | Schema only |
sessions.db | 74 | 0 | 4 | Schema only |
projects.db | 6 | 0 | 1 | Schema only |
Decision
Establish a Database Schema Documentation Standard that provides:
1. Data Dictionary Format
Each table documented in docs/reference/database/DATA-DICTIONARY.md:
## Table: `messages` (sessions.db)
**Purpose:** Stores extracted messages from all LLM sessions (Claude, Codex, Gemini, KIMI)
**Tier:** 3 (Regenerable from JSONL source files)
**Related Tables:**
- `embeddings` (1:1 - semantic vectors)
- `token_economics` (1:N - per-message tokens)
- `message_component_invocations` (1:N - skill/agent invocations)
### Fields
| Field | Type | Nullable | Default | Description |
|-------|------|----------|---------|-------------|
| `id` | INTEGER | NO | AUTO | Primary key |
| `hash` | TEXT | NO | - | SHA-256 content hash (dedup key) |
| `content` | TEXT | NO | - | Message text content |
| `role` | TEXT | NO | - | Message role: user, assistant, tool_use, tool_result |
| `source_type` | TEXT | YES | - | Source format: jsonl, export |
| `source_file` | TEXT | YES | - | Original file path |
| `session_id` | TEXT | YES | - | Claude Code session UUID |
| `checkpoint` | TEXT | YES | - | Checkpoint reference |
| `timestamp` | TEXT | YES | - | ISO 8601 timestamp |
| `extracted_at` | TEXT | YES | - | When message was extracted |
| `content_length` | INTEGER | YES | - | Character count |
| `has_code` | BOOLEAN | YES | - | Contains code blocks |
| `has_markdown` | BOOLEAN | YES | - | Contains markdown formatting |
| `agent_context` | TEXT | YES | - | JSON: agent execution context |
| `tool_use` | TEXT | YES | - | JSON: tool calls in message |
| `tool_result` | TEXT | YES | - | JSON: tool results |
| `export_metadata` | TEXT | YES | - | JSON: export-specific metadata |
| `component_invocations` | TEXT | YES | - | JSON: invoked skills/agents |
| `tenant_id` | TEXT | YES | - | Multi-tenant: organization ID |
| `user_id` | TEXT | YES | - | Multi-tenant: user ID |
| `team_id` | TEXT | YES | - | Multi-tenant: team ID |
| `project_id` | TEXT | YES | - | Multi-tenant: project ID |
| `cloud_id` | TEXT | YES | - | Cloud-assigned UUID |
| `synced_at` | TEXT | YES | - | Last cloud sync timestamp |
| `sync_status` | TEXT | YES | 'pending' | Cloud sync: pending, synced, error |
### Indexes
| Index | Columns | Purpose |
|-------|---------|---------|
| `idx_messages_tenant` | `tenant_id` | Tenant isolation |
| `idx_messages_user` | `tenant_id, user_id` | User filtering |
| `idx_messages_project` | `tenant_id, project_id` | Project filtering |
| `idx_messages_sync` | `sync_status, synced_at` | Cloud sync queries |
| `idx_messages_cloud_id` | `cloud_id` | Cloud deduplication |
### FTS5 Index
Table `messages_fts` provides full-text search on `content` and `role`.
2. Entity-Relationship Diagrams
Mermaid ER diagrams in docs/reference/database/ER-DIAGRAMS.md:
3. Cloud Sync Mapping
Document which tables sync to cloud and their PostgreSQL equivalents:
| Local Table | Cloud Table | Sync Direction | Frequency |
|---|---|---|---|
messages | context.Message | Push | Real-time |
decisions | context.Decision | Push/Pull | Real-time |
skill_learnings | context.SkillLearning | Push | Batch |
error_solutions | context.ErrorSolution | Push/Pull | Real-time |
token_economics | analytics.TokenUsage | Push | Batch |
tool_analytics | analytics.ToolUsage | Push | Batch |
4. Database Purpose Matrix
Quick reference for developers:
| Need | Table | Database | Example |
|---|---|---|---|
| Store a decision | decisions | org.db | Architecture choice |
| Log token usage | token_economics | sessions.db | API call cost |
| Track tool calls | tool_analytics | sessions.db | Read/Write/Bash |
| Index a component | components | platform.db | New agent |
| Register a project | projects | projects.db | Customer repo |
5. Auto-Generation Script
Create scripts/generate-data-dictionary.py that:
- Connects to all 4 databases
- Extracts schema via
PRAGMA table_info() - Generates markdown documentation
- Validates against expected schema
- Detects schema drift
# Usage
python3 scripts/generate-data-dictionary.py --output docs/reference/database/DATA-DICTIONARY.md
python3 scripts/generate-data-dictionary.py --validate # Check for drift
Database Tier Reference
Tier 1: platform.db (Component Metadata)
Purpose: Index of CODITECT framework components
Regenerable: Yes (via component-indexer.py)
Tables: 6
| Table | Purpose |
|---|---|
components | Master component registry (agents, skills, commands) |
capabilities | Component capabilities (primary, tag, domain, action) |
triggers | When to use/avoid components |
component_relationships | Dependencies (invokes, alternative, complement) |
component_composability | Orchestration capabilities |
component_usage_stats | Runtime usage metrics |
Tier 2: org.db (Critical Knowledge)
Purpose: Irreplaceable organizational knowledge Regenerable: NO - requires daily backup Tables: 8
| Table | Purpose |
|---|---|
decisions | Architectural/technical decisions |
error_solutions | Error patterns with solutions |
skill_learnings | Skill effectiveness tracking |
projects | Project registry |
project_tracks | PILOT track progress |
local_identity | Auth state |
cloud_sync_state | Sync tracking |
Tier 3: sessions.db (Session Data)
Purpose: Session messages and analytics Regenerable: Yes (from JSONL exports) Tables: 74
Core Tables:
| Table | Purpose |
|---|---|
messages | Extracted session messages |
entries | Raw session entries with threading |
token_economics | Token usage and costs |
tool_analytics | Tool execution metrics |
embeddings | Semantic vectors for messages |
Knowledge Tables:
| Table | Purpose |
|---|---|
code_patterns | Extracted code patterns |
knowledge_graph | Entity relationships |
knowledge_entities | Indexed entities |
Agent Loop Tables:
| Table | Purpose |
|---|---|
agent_sessions | Ralph loop tracking |
checkpoints | Execution state snapshots |
reasoning_traces | Agent reasoning steps |
health_events | Agent health monitoring |
Learning Tables:
| Table | Purpose |
|---|---|
learning_modules | Learning content |
learning_progress | User progress |
learning_badges | Achievements |
Tier 4: projects.db (Project Embeddings)
Purpose: Project-specific semantic search Regenerable: Yes (re-index from source) Tables: 6
| Table | Purpose |
|---|---|
projects | Registered projects with git metadata |
content_hashes | File change detection |
project_embeddings | Code chunk embeddings |
exclude_patterns | Gitignore-style patterns |
project_tags | Project categorization |
project_activity | Activity log |
Consequences
Positive
- Self-documenting databases - Developers can understand schema without reading code
- Consistent field naming - Data dictionary enforces naming conventions
- Cloud sync clarity - Clear mapping between local and cloud
- Onboarding acceleration - New contributors understand data model faster
- Query optimization - Understanding relationships enables efficient queries
Negative
- Maintenance burden - Documentation must be updated with schema changes
- Auto-generation limitations - Scripts can't capture business context
Mitigations
- Auto-generation script reduces manual maintenance
- Pre-commit hook validates documentation matches schema
- Schema changes require data dictionary update in same PR
Implementation
Phase 1: Data Dictionary Foundation (J.24.1)
- Create
docs/reference/database/DATA-DICTIONARY.md - Document all tables in
sessions.db(largest database) - Create auto-generation script
Phase 2: Complete Coverage (J.24.2)
- Document
org.dbtables (Tier 2 - critical) - Document
platform.dbtables (Tier 1) - Document
projects.dbtables (Tier 4)
Phase 3: ER Diagrams (J.24.3)
- Create Mermaid ER diagrams
- Document cross-database relationships
- Add to architecture documentation
Phase 4: Cloud Sync Documentation (J.24.4)
- Document local → cloud table mapping
- Document sync direction and frequency
- Create sync troubleshooting guide
Related
- ADR-118: Four-Tier Database Architecture
- ADR-114: User Data Separation
- ADR-149: Query Language Evolution Strategy
- DATABASE-SCHEMA.md: Current schema reference (to be enhanced)
Changelog
| Date | Change |
|---|---|
| 2026-02-03 | Initial version |
Track: J.24 (Database Schema Documentation) Task: J.24.1.1