Skip to main content

ADR-025: Comprehensive Entry Schema - Zero Data Loss Extraction

Status: Accepted Date: 2025-12-23 Updated: 2026-02-03 (Migrated to coditect-core per ADR-150) Deciders: Hal Casteel (Founder/CEO/CTO), CODITECT Core Team Technical Story: Capture ALL data from Claude Code JSONL files with zero loss, enabling complete agent session tracking and conversation reconstruction


Context and Problem Statement

The Data Loss Problem

The existing context extraction system (ADR-020, ADR-021) successfully captures user and assistant messages but:

  1. Ignores 4 other entry types - system, queue-operation, summary, file-history-snapshot entries are discarded
  2. Loses agent metadata - MoE expert/judge sessions lack agentId, parentUuid, isSidechain tracking
  3. Breaks conversation threading - UUID relationships not preserved, making thread reconstruction impossible
  4. Discards raw data - Original JSON structure lost, preventing future schema evolution

Discovery (Dec 23, 2025): Audit of 1,040 JSONL files revealed:

Entry TypeCount% of TotalPreviously Captured
assistant95,82152.2%✅ Yes
user57,29031.2%✅ Yes (partial)
queue-operation17,0739.3%❌ No
file-history-snapshot12,3426.7%❌ No
summary9880.5%❌ No
system7430.4%❌ No

Impact: 16.5% of all data was being discarded, including critical operational metadata.

Agent Session Tracking Gap

Agent sessions (MoE experts, judges, Task tool subagents) produce rich metadata:

{
"parentUuid": "b101bc78-e0c9-4824-b815-3e94f6cc3ba3",
"isSidechain": true,
"userType": "external",
"sessionId": "60cacdf6-403f-4350-8abc-42111016a762",
"agentId": "ab782dc",
"slug": "kind-chasing-hennessy",
"type": "user",
"message": { "role": "user", "content": "..." },
"uuid": "62a64845-97ca-4ee9-967b-930196b50567",
"timestamp": "2025-12-23T03:52:49.876Z"
}

Before this ADR: Only 14 of 94,000+ messages (0.01%) had agent context preserved.


Decision Drivers

Technical Requirements

  • R1: Zero data loss - preserve ALL fields from ALL entry types
  • R2: Raw JSON preservation - store original entry for future schema evolution
  • R3: Agent context tracking - full metadata for MoE/Task subagent sessions
  • R4: Conversation threading - uuid/parent_uuid relationships for thread reconstruction
  • R5: Type-specific optimization - specialized tables for each entry type's unique fields
  • R6: Backward compatibility - existing /cx and /cxq workflows unchanged

Integration Constraints

ADR-118 Compliance: All comprehensive entry tables belong in sessions.db (Tier 3 - Regenerable).

  • C1: Must extend existing sessions.db database (ADR-118 Tier 3)
  • C2: Must integrate with unified-message-extractor.py pipeline
  • C3: Must support incremental indexing (no full rebuilds required)
  • C4: Must work with existing FTS5 search infrastructure

Decision Outcome

Chosen Solution: Implement a comprehensive multi-table schema (DATABASE-SCHEMA-V2) with:

  1. Master entries table - ALL entry types with raw_json preserved
  2. Type-specific tables - Optimized for each entry type's unique fields
  3. Threading table - UUID relationships for conversation reconstruction
  4. Agent context - Full metadata for subagent session tracking

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ ENTRIES (master) - sessions.db (Tier 3) │
│ All JSONL entries with common fields + full JSON blob │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ MESSAGES │ │ SUMMARIES │ │ FILE_SNAPSHOTS │ │
│ │ user/assistant│ │ conversation │ │ file backups │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │SYSTEM_EVENTS │ │QUEUE_OPERATIONS│ │ MESSAGE_THREADING │ │
│ │ compaction │ │ command queue │ │ uuid relationships│ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Database Schema

Storage Location: All tables in sessions.db (Tier 3 - Regenerable per ADR-118)

1. ENTRIES (Master Table)

Stores ALL entries with common fields extracted for indexing.

CREATE TABLE entries (
-- Primary key
id INTEGER PRIMARY KEY AUTOINCREMENT,

-- Content hash for deduplication
hash TEXT UNIQUE NOT NULL,

-- Entry identification
entry_type TEXT NOT NULL, -- user, assistant, system, queue-operation, summary, file-history-snapshot
uuid TEXT, -- Entry's unique ID
parent_uuid TEXT, -- Parent entry UUID (threading)
logical_parent_uuid TEXT, -- Logical parent (for compaction)

-- Session context
session_id TEXT,
agent_id TEXT, -- NULL for main sessions
slug TEXT, -- Human-readable session name

-- Environment
cwd TEXT, -- Working directory
git_branch TEXT,
version TEXT, -- Claude Code version

-- Timestamps
timestamp TEXT,
extracted_at TEXT,

-- Flags
is_sidechain BOOLEAN DEFAULT FALSE,
is_meta BOOLEAN DEFAULT FALSE,
user_type TEXT, -- external, internal

-- Content (extracted for search)
content TEXT, -- Main text content
role TEXT, -- user, assistant, system

-- Full original JSON (preserves everything)
raw_json TEXT NOT NULL,

-- Source tracking
source_file TEXT,
source_line INTEGER
);

-- Indexes
CREATE INDEX idx_entries_type ON entries(entry_type);
CREATE INDEX idx_entries_uuid ON entries(uuid);
CREATE INDEX idx_entries_parent ON entries(parent_uuid);
CREATE INDEX idx_entries_session ON entries(session_id);
CREATE INDEX idx_entries_agent ON entries(agent_id);
CREATE INDEX idx_entries_timestamp ON entries(timestamp);
CREATE INDEX idx_entries_role ON entries(role);

-- Full-text search
CREATE VIRTUAL TABLE entries_fts USING fts5(
content,
role,
entry_type,
content='entries',
content_rowid='id'
);

2. SUMMARIES (Conversation Summaries)

Extracted from summary entries.

CREATE TABLE summaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
summary_text TEXT NOT NULL,
leaf_uuid TEXT, -- Reference to conversation leaf
UNIQUE(leaf_uuid)
);

CREATE INDEX idx_summaries_leaf ON summaries(leaf_uuid);

CREATE VIRTUAL TABLE summaries_fts USING fts5(
summary_text,
content='summaries',
content_rowid='id'
);

3. SYSTEM_EVENTS (System Operations)

Extracted from system entries.

CREATE TABLE system_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
subtype TEXT, -- compact_boundary, error, etc.
level TEXT, -- info, warning, error
content TEXT,
compact_trigger TEXT, -- auto, manual
compact_pre_tokens INTEGER,
cause TEXT,
error TEXT,
max_retries INTEGER,
retry_attempt INTEGER,
retry_in_ms INTEGER
);

CREATE INDEX idx_system_events_subtype ON system_events(subtype);
CREATE INDEX idx_system_events_level ON system_events(level);

4. QUEUE_OPERATIONS (Command Queue)

Extracted from queue-operation entries.

CREATE TABLE queue_operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
operation TEXT NOT NULL, -- enqueue, dequeue, clear
content TEXT, -- Command content
timestamp TEXT
);

CREATE INDEX idx_queue_ops_operation ON queue_operations(operation);

5. FILE_SNAPSHOTS (File History)

Extracted from file-history-snapshot entries.

CREATE TABLE file_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entry_id INTEGER REFERENCES entries(id) ON DELETE CASCADE,
message_id TEXT, -- Links to message
snapshot_timestamp TEXT,
is_snapshot_update BOOLEAN DEFAULT FALSE,
snapshot_data TEXT -- JSON blob of trackedFileBackups
);

CREATE TABLE file_backup_entries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
snapshot_id INTEGER REFERENCES file_snapshots(id) ON DELETE CASCADE,
file_path TEXT NOT NULL,
backup_file_name TEXT,
version INTEGER,
backup_time TEXT
);

CREATE INDEX idx_file_backups_path ON file_backup_entries(file_path);
CREATE INDEX idx_file_backups_snapshot ON file_backup_entries(snapshot_id);

6. MESSAGE_THREADING (Conversation Graph)

Extracted relationships for graph traversal.

CREATE TABLE message_threading (
id INTEGER PRIMARY KEY AUTOINCREMENT,
uuid TEXT UNIQUE NOT NULL,
parent_uuid TEXT,
logical_parent_uuid TEXT,
session_id TEXT,
agent_id TEXT,
entry_type TEXT,
timestamp TEXT
);

CREATE INDEX idx_threading_parent ON message_threading(parent_uuid);
CREATE INDEX idx_threading_logical ON message_threading(logical_parent_uuid);
CREATE INDEX idx_threading_session ON message_threading(session_id);
CREATE INDEX idx_threading_agent ON message_threading(agent_id);

7. ENHANCED EXPORT FIELDS (v4.0 - January 2026)

New columns added to messages table to capture enhanced metadata from export TXT files.

-- Migration: Add enhanced export fields (v4.0 unified-message-extractor)
ALTER TABLE messages ADD COLUMN tool_use TEXT; -- JSON blob of tool invocations
ALTER TABLE messages ADD COLUMN tool_result TEXT; -- JSON blob of tool results with status
ALTER TABLE messages ADD COLUMN export_metadata TEXT; -- JSON blob of export header metadata

-- Indexes for efficient querying
CREATE INDEX idx_messages_tool_status ON messages(json_extract(tool_result, '$.status'));
CREATE INDEX idx_messages_export_model ON messages(json_extract(export_metadata, '$.model'));

tool_use Column (JSON Structure)

Stores tool invocations from assistant messages.

{
"tool_name": "Edit",
"parameters": {
"file_path": "/path/to/file.py",
"old_string": "...",
"new_string": "..."
}
}

tool_result Column (JSON Structure)

Stores tool execution results with status tracking.

{
"tool_name": "Edit",
"status": "success", // success | error | interrupted | truncated
"diff": {
"lines_added": 15,
"lines_removed": 3,
"truncated_range": null // [start, end] if output truncated
},
"error_message": null // Present when status is "error"
}

Status Values:

StatusDescription
successTool completed normally
errorTool returned error message
interruptedTool was cancelled before completion
truncatedOutput was truncated (large files)

export_metadata Column (JSON Structure)

Stores metadata extracted from export file headers.

{
"version": "1.0.96",
"model": "claude-opus-4-5-20251101",
"tier": "anthropic_business",
"working_dir": "/Users/halcasteel/PROJECTS/coditect-rollout-master"
}

Field Mapping by Entry Type

user → entries + messages

Source Fieldentriesmessages
typeentry_type-
uuiduuid-
parentUuidparent_uuid-
sessionIdsession_id-
agentIdagent_id-
cwdcwd-
gitBranchgit_branch-
versionversion-
timestamptimestamp-
isSidechainis_sidechain-
isMetais_meta-
userTypeuser_type-
slugslug-
message.rolerolerole
message.contentcontentcontent
(full entry)raw_json-

assistant → entries + messages

Source Fieldentriesmessages
typeentry_type-
uuiduuid-
parentUuidparent_uuid-
requestId-request_id
message.id-message_id
message.model-model
message.usage.*-input_tokens, etc.
(full entry)raw_json-

system → entries + system_events

Source Fieldentriessystem_events
typeentry_type-
subtype-subtype
level-level
compactMetadata-compact_* fields
error-error

summary → entries + summaries

Source Fieldentriessummaries
typeentry_type-
summarycontentsummary_text
leafUuid-leaf_uuid

queue-operation → entries + queue_operations

Source Fieldentriesqueue_operations
typeentry_type-
operation-operation
contentcontentcontent

file-history-snapshot → entries + file_snapshots + file_backup_entries

Source Fieldentriesfile_snapshotsfile_backup_entries
typeentry_type--
messageId-message_id-
snapshot.trackedFileBackups-snapshot_data(expanded rows)

CLI Interface

New Commands

# Index all 6 entry types
python3 scripts/context-db.py --index-comprehensive

# Rebuild from scratch
python3 scripts/context-db.py --rebuild-comprehensive

# View statistics
python3 scripts/context-db.py --comprehensive-stats

# View message threading relationships
python3 scripts/context-db.py --threading

# Follow conversation thread from UUID
python3 scripts/context-db.py --thread-from <uuid>

# List file backup entries
python3 scripts/context-db.py --file-backups
python3 scripts/context-db.py --file-backups --file-path "README.md"

# Enhanced Export Analytics (v4.0 - January 2026)
python3 scripts/context-db.py --diff-stats # Code change statistics
python3 scripts/context-db.py --failed-tools # Failed/interrupted tool invocations
python3 scripts/context-db.py --export-models # Messages by Claude model

Query Examples

-- Find all agent messages
SELECT e.*, m.* FROM entries e
JOIN messages m ON m.entry_id = e.id
WHERE e.agent_id IS NOT NULL;

-- Get conversation thread
WITH RECURSIVE thread AS (
SELECT * FROM message_threading WHERE uuid = ?
UNION ALL
SELECT mt.* FROM message_threading mt
JOIN thread t ON mt.uuid = t.parent_uuid
)
SELECT * FROM thread;

-- Find all summaries for recall
SELECT s.summary_text, e.session_id, e.timestamp
FROM summaries s
JOIN entries e ON s.entry_id = e.id
ORDER BY e.timestamp DESC;

-- Get file history for a path
SELECT fbe.*, fs.snapshot_timestamp
FROM file_backup_entries fbe
JOIN file_snapshots fs ON fbe.snapshot_id = fs.id
WHERE fbe.file_path LIKE '%/README.md'
ORDER BY fs.snapshot_timestamp DESC;

-- Enhanced Export Analytics (v4.0)
-- Get code change statistics
SELECT
COUNT(*) as total_edits,
SUM(json_extract(tool_result, '$.diff.lines_added')) as lines_added,
SUM(json_extract(tool_result, '$.diff.lines_removed')) as lines_removed
FROM messages
WHERE tool_result IS NOT NULL
AND json_extract(tool_result, '$.tool_name') = 'Edit';

-- Find failed tool invocations
SELECT
json_extract(tool_result, '$.tool_name') as tool,
json_extract(tool_result, '$.status') as status,
json_extract(tool_result, '$.error_message') as error,
timestamp
FROM messages
WHERE json_extract(tool_result, '$.status') IN ('error', 'interrupted', 'truncated')
ORDER BY timestamp DESC;

-- Group messages by Claude model
SELECT
json_extract(export_metadata, '$.model') as model,
COUNT(*) as message_count
FROM messages
WHERE export_metadata IS NOT NULL
GROUP BY json_extract(export_metadata, '$.model')
ORDER BY message_count DESC;

Implementation Details

Files Modified

FileChanges
scripts/context-db.pyAdded 7 new tables, index_comprehensive_entries(), get_comprehensive_stats(), CLI options
scripts/unified-message-extractor.pyAdded create_comprehensive_entry(), extract_comprehensive()
internal/architecture/DATABASE-SCHEMA-V2.mdSchema design document

Migration Strategy

  1. No migration required - New tables are additive
  2. Existing data preserved - messages table unchanged
  3. Incremental indexing - Run --index-comprehensive to populate new tables
  4. Hash-based deduplication - Same entry won't be indexed twice

Consequences

Positive Outcomes

P1: Zero Data Loss

  • All 6 entry types captured (vs. 2 previously)
  • Raw JSON preserved for future schema evolution
  • 100% of JSONL content now queryable

P2: Agent Session Tracking

  • Full MoE expert/judge output recovery
  • Parent-child session relationships
  • Sidechain conversation detection

P3: Conversation Reconstruction

  • UUID threading enables graph traversal
  • Complete conversation history recovery
  • Temporal ordering via timestamps

P4: Operational Insights

  • System events (compaction, errors, retries) now visible
  • Command queue history for debugging
  • File backup tracking for recovery

Negative Outcomes / Trade-offs

N1: Increased Storage

  • ~50% more database space due to raw_json
  • Mitigated by: Deduplication still active

N2: Indexing Time

  • Full rebuild takes longer with 6 entry types
  • Mitigated by: Incremental indexing supported

N3: Schema Complexity

  • 7 new tables to understand
  • Mitigated by: Clear separation of concerns, comprehensive documentation

Estimated Storage

~180K entries × ~2KB avg = ~360MB entries table
Type-specific tables: ~50MB total
Indexes: ~100MB
Total: ~500MB (vs. ~250MB before)

  • ADR-020 - Context Extraction (/cx)
  • ADR-021 - Context Query System (/cxq)
  • ADR-023 - SQLite Scalability Analysis
  • ADR-118 - Four-Tier Database Architecture
  • scripts/unified-message-extractor.py - Extraction logic
  • scripts/context-db.py - Database operations
  • internal/architecture/DATABASE-SCHEMA-V2.md - Schema design

Validation Criteria

  • All 6 entry types extracted from JSONL files
  • Agent context (agent_id, parent_uuid, is_sidechain) preserved
  • UUID threading enables conversation reconstruction
  • Raw JSON stored for every entry
  • FTS5 search works on entries table
  • --comprehensive-stats shows correct counts
  • --threading displays UUID relationships
  • --thread-from reconstructs conversation

Status: ✅ ACCEPTED Decision Date: 2025-12-23 Migrated: 2026-02-03 (ADR-150) Implementation Status: COMPLETE Maintainer: CODITECT Core Team Review Date: 2026-03-23 (quarterly review)


Document Version: 1.1.0 Last Updated: 2026-02-03 Author: CODITECT Architecture Team Copyright: 2025-2026 AZ1.AI INC. All rights reserved.


Changelog

v1.1.0 (2026-01-04)

Enhanced Export Fields (v4.0 unified-message-extractor)

  • Added tool_use column to messages table for tool invocation tracking
  • Added tool_result column with status tracking (success/error/interrupted/truncated)
  • Added export_metadata column for export header metadata (version, model, tier)
  • Added indexes for efficient JSON field querying
  • Added new CLI commands: --diff-stats, --failed-tools, --export-models
  • Added SQL query examples for enhanced analytics

v1.0.0 (2025-12-23)

  • Initial comprehensive entry schema
  • All 6 JSONL entry types captured
  • Agent context tracking
  • Conversation threading