ADR-025: Comprehensive Entry Schema - Zero Data Loss Extraction

Status: Accepted Date: 2025-12-23 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Capture ALL data from Claude Code JSONL files with zero loss, enabling complete agent session tracking and conversation reconstruction

Context and Problem Statement

The Data Loss Problem

The existing context extraction system (ADR-020, ADR-021) successfully captures user and assistant messages but:

Ignores 4 other entry types - system, queue-operation, summary, file-history-snapshot entries are discarded
Loses agent metadata - MoE expert/judge sessions lack agentId, parentUuid, isSidechain tracking
Breaks conversation threading - UUID relationships not preserved, making thread reconstruction impossible
Discards raw data - Original JSON structure lost, preventing future schema evolution

Discovery (Dec 23, 2025): Audit of 1,040 JSONL files revealed:

Entry Type	Count	% of Total	Previously Captured
assistant	95,821	52.2%	✅ Yes
user	57,290	31.2%	✅ Yes (partial)
queue-operation	17,073	9.3%	❌ No
file-history-snapshot	12,342	6.7%	❌ No
summary	988	0.5%	❌ No
system	743	0.4%	❌ No

Impact: 16.5% of all data was being discarded, including critical operational metadata.

Agent Session Tracking Gap

Agent sessions (MoE experts, judges, Task tool subagents) produce rich metadata:

{
  "parentUuid": "b101bc78-e0c9-4824-b815-3e94f6cc3ba3",
  "isSidechain": true,
  "userType": "external",
  "sessionId": "60cacdf6-403f-4350-8abc-42111016a762",
  "agentId": "ab782dc",
  "slug": "kind-chasing-hennessy",
  "type": "user",
  "message": { "role": "user", "content": "..." },
  "uuid": "62a64845-97ca-4ee9-967b-930196b50567",
  "timestamp": "2025-12-23T03:52:49.876Z"
}

Before this ADR: Only 14 of 94,000+ messages (0.01%) had agent context preserved.

Decision Drivers

Technical Requirements

R1: Zero data loss - preserve ALL fields from ALL entry types
R2: Raw JSON preservation - store original entry for future schema evolution
R3: Agent context tracking - full metadata for MoE/Task subagent sessions
R4: Conversation threading - uuid/parent_uuid relationships for thread reconstruction
R5: Type-specific optimization - specialized tables for each entry type's unique fields
R6: Backward compatibility - existing /cx and /cxq workflows unchanged

Integration Constraints

ADR-118 Compliance: All comprehensive entry tables belong in sessions.db (Tier 3 - Regenerable).

C1: Must extend existing sessions.db database (ADR-118 Tier 3)
C2: Must integrate with unified-message-extractor.py pipeline
C3: Must support incremental indexing (no full rebuilds required)
C4: Must work with existing FTS5 search infrastructure

Decision Outcome

Chosen Solution: Implement a comprehensive multi-table schema (DATABASE-SCHEMA-V2) with:

Master entries table - ALL entry types with raw_json preserved
Type-specific tables - Optimized for each entry type's unique fields
Threading table - UUID relationships for conversation reconstruction
Agent context - Full metadata for subagent session tracking

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│              ENTRIES (master) - sessions.db (Tier 3)            │
│  All JSONL entries with common fields + full JSON blob           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
│  │   MESSAGES   │  │   SUMMARIES  │  │   FILE_SNAPSHOTS     │   │
│  │ user/assistant│ │ conversation │  │   file backups       │   │
│  └──────────────┘  └──────────────┘  └──────────────────────┘   │
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐   │
│  │SYSTEM_EVENTS │  │QUEUE_OPERATIONS│ │   MESSAGE_THREADING │   │
│  │ compaction   │  │ command queue │  │   uuid relationships│   │
│  └──────────────┘  └──────────────┘  └──────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Database Schema

Storage Location: All tables in sessions.db (Tier 3 - Regenerable per ADR-118)

1. ENTRIES (Master Table)

Stores ALL entries with common fields extracted for indexing.

CREATE TABLE entries (
    -- Primary key
    id INTEGER PRIMARY KEY AUTOINCREMENT,

    -- Content hash for deduplication
    hash TEXT UNIQUE NOT NULL,

    -- Entry identification
    entry_type TEXT NOT NULL,           -- user, assistant, system, queue-operation, summary, file-history-snapshot
    uuid TEXT,                          -- Entry's unique ID
    parent_uuid TEXT,                   -- Parent entry UUID (threading)
    logical_parent_uuid TEXT,           -- Logical parent (for compaction)

    -- Session context
    session_id TEXT,
    agent_id TEXT,                      -- NULL for main sessions
    slug TEXT,                          -- Human-readable session name

    -- Environment
    cwd TEXT,                           -- Working directory
    git_branch TEXT,
    version TEXT,                       -- Claude Code version

    -- Timestamps
    timestamp TEXT,
    extracted_at TEXT,

    -- Flags
    is_sidechain BOOLEAN DEFAULT FALSE,
    is_meta BOOLEAN DEFAULT FALSE,
    user_type TEXT,                     -- external, internal

    -- Content (extracted for search)
    content TEXT,                       -- Main text content
    role TEXT,                          -- user, assistant, system

    -- Full original JSON (preserves everything)
    raw_json TEXT NOT NULL,

    -- Source tracking
    source_file TEXT,
    source_line INTEGER
);

-- Indexes
CREATE INDEX idx_entries_type ON entries(entry_type);
CREATE INDEX idx_entries_uuid ON entries(uuid);
CREATE INDEX idx_entries_parent ON entries(parent_uuid);
CREATE INDEX idx_entries_session ON entries(session_id);
CREATE INDEX idx_entries_agent ON entries(agent_id);
CREATE INDEX idx_entries_timestamp ON entries(timestamp);
CREATE INDEX idx_entries_role ON entries(role);

-- Full-text search
CREATE VIRTUAL TABLE entries_fts USING fts5(
    content,
    role,
    entry_type,
    content='entries',
    content_rowid='id'
);

2. SUMMARIES (Conversation Summaries)

Extracted from summary entries.

CREATE TABLE summaries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
    summary_text TEXT NOT NULL,
    leaf_uuid TEXT,                     -- Reference to conversation leaf
    UNIQUE(leaf_uuid)
);

CREATE INDEX idx_summaries_leaf ON summaries(leaf_uuid);

CREATE VIRTUAL TABLE summaries_fts USING fts5(
    summary_text,
    content='summaries',
    content_rowid='id'
);

3. SYSTEM_EVENTS (System Operations)

Extracted from system entries.

CREATE TABLE system_events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
    subtype TEXT,                       -- compact_boundary, error, etc.
    level TEXT,                         -- info, warning, error
    content TEXT,
    compact_trigger TEXT,               -- auto, manual
    compact_pre_tokens INTEGER,
    cause TEXT,
    error TEXT,
    max_retries INTEGER,
    retry_attempt INTEGER,
    retry_in_ms INTEGER
);

CREATE INDEX idx_system_events_subtype ON system_events(subtype);
CREATE INDEX idx_system_events_level ON system_events(level);

4. QUEUE_OPERATIONS (Command Queue)

Extracted from queue-operation entries.

CREATE TABLE queue_operations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
    operation TEXT NOT NULL,            -- enqueue, dequeue, clear
    content TEXT,                       -- Command content
    timestamp TEXT
);

CREATE INDEX idx_queue_ops_operation ON queue_operations(operation);

5. FILE_SNAPSHOTS (File History)

Extracted from file-history-snapshot entries.

CREATE TABLE file_snapshots (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
    message_id TEXT,                    -- Links to message
    snapshot_timestamp TEXT,
    is_snapshot_update BOOLEAN DEFAULT FALSE,
    snapshot_data TEXT                  -- JSON blob of trackedFileBackups
);

CREATE TABLE file_backup_entries (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    snapshot_id INTEGER REFERENCES file_snapshots(id) ON DELETE CASCADE,
    file_path TEXT NOT NULL,
    backup_file_name TEXT,
    version INTEGER,
    backup_time TEXT
);

CREATE INDEX idx_file_backups_path ON file_backup_entries(file_path);
CREATE INDEX idx_file_backups_snapshot ON file_backup_entries(snapshot_id);

6. MESSAGE_THREADING (Conversation Graph)

Extracted relationships for graph traversal.

CREATE TABLE message_threading (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT UNIQUE NOT NULL,
    parent_uuid TEXT,
    logical_parent_uuid TEXT,
    session_id TEXT,
    agent_id TEXT,
    entry_type TEXT,
    timestamp TEXT
);

CREATE INDEX idx_threading_parent ON message_threading(parent_uuid);
CREATE INDEX idx_threading_logical ON message_threading(logical_parent_uuid);
CREATE INDEX idx_threading_session ON message_threading(session_id);
CREATE INDEX idx_threading_agent ON message_threading(agent_id);

7. ENHANCED EXPORT FIELDS (v4.0 - January 2026)

New columns added to messages table to capture enhanced metadata from export TXT files.

-- Migration: Add enhanced export fields (v4.0 unified-message-extractor)
ALTER TABLE messages ADD COLUMN tool_use TEXT;        -- JSON blob of tool invocations
ALTER TABLE messages ADD COLUMN tool_result TEXT;     -- JSON blob of tool results with status
ALTER TABLE messages ADD COLUMN export_metadata TEXT; -- JSON blob of export header metadata

-- Indexes for efficient querying
CREATE INDEX idx_messages_tool_status ON messages(json_extract(tool_result, '$.status'));
CREATE INDEX idx_messages_export_model ON messages(json_extract(export_metadata, '$.model'));

tool_use Column (JSON Structure)

Stores tool invocations from assistant messages.

{
  "tool_name": "Edit",
  "parameters": {
    "file_path": "/path/to/file.py",
    "old_string": "...",
    "new_string": "..."
  }
}

tool_result Column (JSON Structure)

Stores tool execution results with status tracking.

{
  "tool_name": "Edit",
  "status": "success",           // success | error | interrupted | truncated
  "diff": {
    "lines_added": 15,
    "lines_removed": 3,
    "truncated_range": null      // [start, end] if output truncated
  },
  "error_message": null          // Present when status is "error"
}

Status Values:

Status	Description
`success`	Tool completed normally
`error`	Tool returned error message
`interrupted`	Tool was cancelled before completion
`truncated`	Output was truncated (large files)

export_metadata Column (JSON Structure)

Stores metadata extracted from export file headers.

{
  "version": "1.0.96",
  "model": "claude-opus-4-5-20251101",
  "tier": "anthropic_business",
  "working_dir": "/Users/halcasteel/PROJECTS/coditect-rollout-master"
}

Field Mapping by Entry Type

user → entries + messages

Source Field	entries	messages
type	entry_type	-
uuid	uuid	-
parentUuid	parent_uuid	-
sessionId	session_id	-
agentId	agent_id	-
cwd	cwd	-
gitBranch	git_branch	-
version	version	-
timestamp	timestamp	-
isSidechain	is_sidechain	-
isMeta	is_meta	-
userType	user_type	-
slug	slug	-
message.role	role	role
message.content	content	content
(full entry)	raw_json	-

assistant → entries + messages

Source Field	entries	messages
type	entry_type	-
uuid	uuid	-
parentUuid	parent_uuid	-
requestId	-	request_id
message.id	-	message_id
message.model	-	model
message.usage.*	-	input_tokens, etc.
(full entry)	raw_json	-

system → entries + system_events

Source Field	entries	system_events
type	entry_type	-
subtype	-	subtype
level	-	level
compactMetadata	-	compact_* fields
error	-	error

summary → entries + summaries

Source Field	entries	summaries
type	entry_type	-
summary	content	summary_text
leafUuid	-	leaf_uuid

queue-operation → entries + queue_operations

Source Field	entries	queue_operations
type	entry_type	-
operation	-	operation
content	content	content

file-history-snapshot → entries + file_snapshots + file_backup_entries

Source Field	entries	file_snapshots	file_backup_entries
type	entry_type	-	-
messageId	-	message_id	-
snapshot.trackedFileBackups	-	snapshot_data	(expanded rows)

CLI Interface

New Commands

# Index all 6 entry types
python3 scripts/context-db.py --index-comprehensive

# Rebuild from scratch
python3 scripts/context-db.py --rebuild-comprehensive

# View statistics
python3 scripts/context-db.py --comprehensive-stats

# View message threading relationships
python3 scripts/context-db.py --threading

# Follow conversation thread from UUID
python3 scripts/context-db.py --thread-from <uuid>

# List file backup entries
python3 scripts/context-db.py --file-backups
python3 scripts/context-db.py --file-backups --file-path "README.md"

# Enhanced Export Analytics (v4.0 - January 2026)
python3 scripts/context-db.py --diff-stats       # Code change statistics
python3 scripts/context-db.py --failed-tools     # Failed/interrupted tool invocations
python3 scripts/context-db.py --export-models    # Messages by Claude model

Query Examples

-- Find all agent messages
SELECT e.*, m.* FROM entries e
JOIN messages m ON m.entry_id = e.id
WHERE e.agent_id IS NOT NULL;

-- Get conversation thread
WITH RECURSIVE thread AS (
    SELECT * FROM message_threading WHERE uuid = ?
    UNION ALL
    SELECT mt.* FROM message_threading mt
    JOIN thread t ON mt.uuid = t.parent_uuid
)
SELECT * FROM thread;

-- Find all summaries for recall
SELECT s.summary_text, e.session_id, e.timestamp
FROM summaries s
JOIN entries e ON s.entry_id = e.id
ORDER BY e.timestamp DESC;

-- Get file history for a path
SELECT fbe.*, fs.snapshot_timestamp
FROM file_backup_entries fbe
JOIN file_snapshots fs ON fbe.snapshot_id = fs.id
WHERE fbe.file_path LIKE '%/README.md'
ORDER BY fs.snapshot_timestamp DESC;

-- Enhanced Export Analytics (v4.0)
-- Get code change statistics
SELECT
    COUNT(*) as total_edits,
    SUM(json_extract(tool_result, '$.diff.lines_added')) as lines_added,
    SUM(json_extract(tool_result, '$.diff.lines_removed')) as lines_removed
FROM messages
WHERE tool_result IS NOT NULL
  AND json_extract(tool_result, '$.tool_name') = 'Edit';

-- Find failed tool invocations
SELECT
    json_extract(tool_result, '$.tool_name') as tool,
    json_extract(tool_result, '$.status') as status,
    json_extract(tool_result, '$.error_message') as error,
    timestamp
FROM messages
WHERE json_extract(tool_result, '$.status') IN ('error', 'interrupted', 'truncated')
ORDER BY timestamp DESC;

-- Group messages by Claude model
SELECT
    json_extract(export_metadata, '$.model') as model,
    COUNT(*) as message_count
FROM messages
WHERE export_metadata IS NOT NULL
GROUP BY json_extract(export_metadata, '$.model')
ORDER BY message_count DESC;

Implementation Details

Files Modified

File	Changes
`scripts/context-db.py`	Added 7 new tables, `index_comprehensive_entries()`, `get_comprehensive_stats()`, CLI options
`scripts/unified-message-extractor.py`	Added `create_comprehensive_entry()`, `extract_comprehensive()`
`internal/architecture/DATABASE-SCHEMA-V2.md`	Schema design document

Migration Strategy

No migration required - New tables are additive
Existing data preserved - messages table unchanged
Incremental indexing - Run --index-comprehensive to populate new tables
Hash-based deduplication - Same entry won't be indexed twice

Consequences

Positive Outcomes

P1: Zero Data Loss

All 6 entry types captured (vs. 2 previously)
Raw JSON preserved for future schema evolution
100% of JSONL content now queryable

P2: Agent Session Tracking

Full MoE expert/judge output recovery
Parent-child session relationships
Sidechain conversation detection

P3: Conversation Reconstruction

UUID threading enables graph traversal
Complete conversation history recovery
Temporal ordering via timestamps

P4: Operational Insights

System events (compaction, errors, retries) now visible
Command queue history for debugging
File backup tracking for recovery

Negative Outcomes / Trade-offs

N1: Increased Storage

~50% more database space due to raw_json
Mitigated by: Deduplication still active

N2: Indexing Time

Full rebuild takes longer with 6 entry types
Mitigated by: Incremental indexing supported

N3: Schema Complexity

7 new tables to understand
Mitigated by: Clear separation of concerns, comprehensive documentation

Estimated Storage

~180K entries × ~2KB avg = ~360MB entries table
Type-specific tables: ~50MB total
Indexes: ~100MB
Total: ~500MB (vs. ~250MB before)

ADR-020 - Context Extraction (/cx)
ADR-021 - Context Query System (/cxq)
ADR-023 - SQLite Scalability Analysis
ADR-118 - Four-Tier Database Architecture
scripts/unified-message-extractor.py - Extraction logic
scripts/context-db.py - Database operations
internal/architecture/DATABASE-SCHEMA-V2.md - Schema design

Validation Criteria

All 6 entry types extracted from JSONL files
Agent context (agent_id, parent_uuid, is_sidechain) preserved
UUID threading enables conversation reconstruction
Raw JSON stored for every entry
FTS5 search works on entries table
--comprehensive-stats shows correct counts
--threading displays UUID relationships
--thread-from reconstructs conversation

Status: ✅ ACCEPTED Decision Date: 2025-12-23 Migrated: 2026-02-03 (ADR-150) Implementation Status: COMPLETE Maintainer: CODITECT Core Team Review Date: 2026-03-23 (quarterly review)

Changelog

v1.1.0 (2026-01-04)

Enhanced Export Fields (v4.0 unified-message-extractor)

Added tool_use column to messages table for tool invocation tracking
Added tool_result column with status tracking (success/error/interrupted/truncated)
Added export_metadata column for export header metadata (version, model, tier)
Added indexes for efficient JSON field querying
Added new CLI commands: --diff-stats, --failed-tools, --export-models
Added SQL query examples for enhanced analytics

v1.0.0 (2025-12-23)

Initial comprehensive entry schema
All 6 JSONL entry types captured
Agent context tracking
Conversation threading

Context and Problem Statement​

The Data Loss Problem​

Agent Session Tracking Gap​

Decision Drivers​

Technical Requirements​

Integration Constraints​

Decision Outcome​

Architecture Overview​

Database Schema​

1. ENTRIES (Master Table)​

2. SUMMARIES (Conversation Summaries)​

3. SYSTEM_EVENTS (System Operations)​

4. QUEUE_OPERATIONS (Command Queue)​

5. FILE_SNAPSHOTS (File History)​

6. MESSAGE_THREADING (Conversation Graph)​

7. ENHANCED EXPORT FIELDS (v4.0 - January 2026)​

tool_use Column (JSON Structure)​

tool_result Column (JSON Structure)​

export_metadata Column (JSON Structure)​

Field Mapping by Entry Type​

user → entries + messages​

assistant → entries + messages​

system → entries + system_events​

summary → entries + summaries​

queue-operation → entries + queue_operations​

file-history-snapshot → entries + file_snapshots + file_backup_entries​

CLI Interface​

New Commands​

Query Examples​

Implementation Details​

Files Modified​

Migration Strategy​

Consequences​

Positive Outcomes​

Negative Outcomes / Trade-offs​

Estimated Storage​

Related ADRs​

Validation Criteria​

Changelog​

v1.1.0 (2026-01-04)​

v1.0.0 (2025-12-23)​