Skip to main content

ADR-148: Database Schema Documentation Standard

Status

ACCEPTED (2026-02-03)

Context

Problem Statement

CODITECT's four-tier database architecture (ADR-118) contains 94 tables across 4 databases, but documentation exists only at the high level. There is no:

  1. Field-level documentation - What each column means and its purpose
  2. Relationship documentation - Foreign keys and entity relationships
  3. Data dictionary - Centralized reference for all tables and fields
  4. Purpose mapping - Which business functions each table serves

This causes problems:

  • Developers don't know which table to use for new features
  • Field names are ambiguous (e.g., sync_status vs cloud_sync_status)
  • Cloud sync mapping is undocumented
  • Query optimization is guesswork without understanding data relationships

Current State

DatabaseTablesViewsFTSDocumentation
platform.db601Schema only
org.db810Schema only
sessions.db7404Schema only
projects.db601Schema only

Decision

Establish a Database Schema Documentation Standard that provides:

1. Data Dictionary Format

Each table documented in docs/reference/database/DATA-DICTIONARY.md:

## Table: `messages` (sessions.db)

**Purpose:** Stores extracted messages from all LLM sessions (Claude, Codex, Gemini, KIMI)

**Tier:** 3 (Regenerable from JSONL source files)

**Related Tables:**
- `embeddings` (1:1 - semantic vectors)
- `token_economics` (1:N - per-message tokens)
- `message_component_invocations` (1:N - skill/agent invocations)

### Fields

| Field | Type | Nullable | Default | Description |
|-------|------|----------|---------|-------------|
| `id` | INTEGER | NO | AUTO | Primary key |
| `hash` | TEXT | NO | - | SHA-256 content hash (dedup key) |
| `content` | TEXT | NO | - | Message text content |
| `role` | TEXT | NO | - | Message role: user, assistant, tool_use, tool_result |
| `source_type` | TEXT | YES | - | Source format: jsonl, export |
| `source_file` | TEXT | YES | - | Original file path |
| `session_id` | TEXT | YES | - | Claude Code session UUID |
| `checkpoint` | TEXT | YES | - | Checkpoint reference |
| `timestamp` | TEXT | YES | - | ISO 8601 timestamp |
| `extracted_at` | TEXT | YES | - | When message was extracted |
| `content_length` | INTEGER | YES | - | Character count |
| `has_code` | BOOLEAN | YES | - | Contains code blocks |
| `has_markdown` | BOOLEAN | YES | - | Contains markdown formatting |
| `agent_context` | TEXT | YES | - | JSON: agent execution context |
| `tool_use` | TEXT | YES | - | JSON: tool calls in message |
| `tool_result` | TEXT | YES | - | JSON: tool results |
| `export_metadata` | TEXT | YES | - | JSON: export-specific metadata |
| `component_invocations` | TEXT | YES | - | JSON: invoked skills/agents |
| `tenant_id` | TEXT | YES | - | Multi-tenant: organization ID |
| `user_id` | TEXT | YES | - | Multi-tenant: user ID |
| `team_id` | TEXT | YES | - | Multi-tenant: team ID |
| `project_id` | TEXT | YES | - | Multi-tenant: project ID |
| `cloud_id` | TEXT | YES | - | Cloud-assigned UUID |
| `synced_at` | TEXT | YES | - | Last cloud sync timestamp |
| `sync_status` | TEXT | YES | 'pending' | Cloud sync: pending, synced, error |

### Indexes

| Index | Columns | Purpose |
|-------|---------|---------|
| `idx_messages_tenant` | `tenant_id` | Tenant isolation |
| `idx_messages_user` | `tenant_id, user_id` | User filtering |
| `idx_messages_project` | `tenant_id, project_id` | Project filtering |
| `idx_messages_sync` | `sync_status, synced_at` | Cloud sync queries |
| `idx_messages_cloud_id` | `cloud_id` | Cloud deduplication |

### FTS5 Index

Table `messages_fts` provides full-text search on `content` and `role`.

2. Entity-Relationship Diagrams

Mermaid ER diagrams in docs/reference/database/ER-DIAGRAMS.md:

3. Cloud Sync Mapping

Document which tables sync to cloud and their PostgreSQL equivalents:

Local TableCloud TableSync DirectionFrequency
messagescontext.MessagePushReal-time
decisionscontext.DecisionPush/PullReal-time
skill_learningscontext.SkillLearningPushBatch
error_solutionscontext.ErrorSolutionPush/PullReal-time
token_economicsanalytics.TokenUsagePushBatch
tool_analyticsanalytics.ToolUsagePushBatch

4. Database Purpose Matrix

Quick reference for developers:

NeedTableDatabaseExample
Store a decisiondecisionsorg.dbArchitecture choice
Log token usagetoken_economicssessions.dbAPI call cost
Track tool callstool_analyticssessions.dbRead/Write/Bash
Index a componentcomponentsplatform.dbNew agent
Register a projectprojectsprojects.dbCustomer repo

5. Auto-Generation Script

Create scripts/generate-data-dictionary.py that:

  1. Connects to all 4 databases
  2. Extracts schema via PRAGMA table_info()
  3. Generates markdown documentation
  4. Validates against expected schema
  5. Detects schema drift
# Usage
python3 scripts/generate-data-dictionary.py --output docs/reference/database/DATA-DICTIONARY.md
python3 scripts/generate-data-dictionary.py --validate # Check for drift

Database Tier Reference

Tier 1: platform.db (Component Metadata)

Purpose: Index of CODITECT framework components Regenerable: Yes (via component-indexer.py) Tables: 6

TablePurpose
componentsMaster component registry (agents, skills, commands)
capabilitiesComponent capabilities (primary, tag, domain, action)
triggersWhen to use/avoid components
component_relationshipsDependencies (invokes, alternative, complement)
component_composabilityOrchestration capabilities
component_usage_statsRuntime usage metrics

Tier 2: org.db (Critical Knowledge)

Purpose: Irreplaceable organizational knowledge Regenerable: NO - requires daily backup Tables: 8

TablePurpose
decisionsArchitectural/technical decisions
error_solutionsError patterns with solutions
skill_learningsSkill effectiveness tracking
projectsProject registry
project_tracksPILOT track progress
local_identityAuth state
cloud_sync_stateSync tracking

Tier 3: sessions.db (Session Data)

Purpose: Session messages and analytics Regenerable: Yes (from JSONL exports) Tables: 74

Core Tables:

TablePurpose
messagesExtracted session messages
entriesRaw session entries with threading
token_economicsToken usage and costs
tool_analyticsTool execution metrics
embeddingsSemantic vectors for messages

Knowledge Tables:

TablePurpose
code_patternsExtracted code patterns
knowledge_graphEntity relationships
knowledge_entitiesIndexed entities

Agent Loop Tables:

TablePurpose
agent_sessionsRalph loop tracking
checkpointsExecution state snapshots
reasoning_tracesAgent reasoning steps
health_eventsAgent health monitoring

Learning Tables:

TablePurpose
learning_modulesLearning content
learning_progressUser progress
learning_badgesAchievements

Tier 4: projects.db (Project Embeddings)

Purpose: Project-specific semantic search Regenerable: Yes (re-index from source) Tables: 6

TablePurpose
projectsRegistered projects with git metadata
content_hashesFile change detection
project_embeddingsCode chunk embeddings
exclude_patternsGitignore-style patterns
project_tagsProject categorization
project_activityActivity log

Consequences

Positive

  1. Self-documenting databases - Developers can understand schema without reading code
  2. Consistent field naming - Data dictionary enforces naming conventions
  3. Cloud sync clarity - Clear mapping between local and cloud
  4. Onboarding acceleration - New contributors understand data model faster
  5. Query optimization - Understanding relationships enables efficient queries

Negative

  1. Maintenance burden - Documentation must be updated with schema changes
  2. Auto-generation limitations - Scripts can't capture business context

Mitigations

  • Auto-generation script reduces manual maintenance
  • Pre-commit hook validates documentation matches schema
  • Schema changes require data dictionary update in same PR

Implementation

Phase 1: Data Dictionary Foundation (J.24.1)

  1. Create docs/reference/database/DATA-DICTIONARY.md
  2. Document all tables in sessions.db (largest database)
  3. Create auto-generation script

Phase 2: Complete Coverage (J.24.2)

  1. Document org.db tables (Tier 2 - critical)
  2. Document platform.db tables (Tier 1)
  3. Document projects.db tables (Tier 4)

Phase 3: ER Diagrams (J.24.3)

  1. Create Mermaid ER diagrams
  2. Document cross-database relationships
  3. Add to architecture documentation

Phase 4: Cloud Sync Documentation (J.24.4)

  1. Document local → cloud table mapping
  2. Document sync direction and frequency
  3. Create sync troubleshooting guide
  • ADR-118: Four-Tier Database Architecture
  • ADR-114: User Data Separation
  • ADR-149: Query Language Evolution Strategy
  • DATABASE-SCHEMA.md: Current schema reference (to be enhanced)

Changelog

DateChange
2026-02-03Initial version

Track: J.24 (Database Schema Documentation) Task: J.24.1.1