ADR-020: Context Extraction Command (/cx) - Anti-Forgetting Memory System

Status

ACCEPTED (2025-12-10, Updated 2026-02-03)

Context

Problem Statement

LLM sessions are ephemeral - valuable context, decisions, and learnings are lost when sessions end or context windows are exceeded. CODITECT requires a persistent memory system that:

Captures session data from multiple LLM providers (Claude, Codex, Gemini, KIMI)
Deduplicates messages to avoid redundant storage
Extracts knowledge (decisions, patterns, error solutions)
Indexes content for instant searchability
Preserves provenance for traceability

Multi-LLM Challenge

Different LLM CLI tools store sessions in different formats and locations:

LLM	Session Location	Format
Claude	`~/.claude/projects/<hash>/<uuid>.jsonl`	JSONL with tool_use entries
Codex	`~/.codex/history.jsonl`	JSONL with conversation turns
Gemini	`~/.gemini/tmp/<hash>/chats/session-*.json`	JSON with structured messages
KIMI	`~/.kimi/sessions/<hash>/<uuid>/wire.jsonl`	JSONL with protocol data

A unified extraction system is needed to normalize these into a consistent format.

Decision

Implement the /cx command as a unified multi-LLM context extraction system with:

1. Unified Message Extractor

Single Python script (scripts/unified-message-extractor.py) that:

Auto-detects LLM source from file paths and content markers
Extracts messages into CODITECT Universal Session Format (CUSF)
Deduplicates using SHA-256 content hashing
Tags messages with llm_source and llm_model

2. Storage Architecture (ADR-118 Compliant)

┌─────────────────────────────────────────────────────────────────┐
│                     /cx EXTRACTION PIPELINE                      │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   SOURCE FILES (Read-Only)                       │
│                                                                  │
│  Claude:  ~/.claude/projects/<hash>/<uuid>.jsonl                │
│  Codex:   ~/.codex/history.jsonl                                 │
│  Gemini:  ~/.gemini/tmp/<hash>/chats/session-*.json             │
│  KIMI:    ~/.kimi/sessions/<hash>/<uuid>/wire.jsonl             │
│                                                                  │
│  Exports: ~/PROJECTS/.coditect-data/sessions-export-pending-*/  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              UNIFIED MESSAGE STORE (JSONL Source)               │
│                                                                  │
│  ~/PROJECTS/.coditect-data/context-storage/unified_messages.jsonl│
│                                                                  │
│  - Hash-based deduplication                                      │
│  - LLM source identification                                     │
│  - Full provenance tracking                                      │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  ADR-118 FOUR-TIER DATABASES                     │
│                                                                  │
│  TIER 2 (org.db) - Critical Knowledge:                          │
│    • decisions, skill_learnings, error_solutions                 │
│                                                                  │
│  TIER 3 (sessions.db) - Regenerable Session Data:               │
│    • messages, tool_analytics, token_economics                   │
│    • message_component_invocations, activity_associations        │
└─────────────────────────────────────────────────────────────────┘

3. Message Schema (CUSF Format)

{
  "hash": "sha256...",
  "content": "Full message text...",
  "role": "assistant",
  "llm_source": "claude",
  "llm_model": "claude-opus-4-5",
  "provenance": {
    "source_type": "session",
    "source_file": "/path/to/file.jsonl",
    "source_line": 42,
    "session_id": "uuid-here",
    "checkpoint": null
  },
  "timestamps": {
    "occurred": "2026-02-03T12:00:00Z",
    "extracted_at": "2026-02-03T19:00:00Z"
  },
  "metadata": {
    "content_length": 1247,
    "has_code": true,
    "has_markdown": true
  }
}

4. Parallel Post-Processing Pipeline

/cx Pipeline
├── Sequential (must be first):
│   ├── Message extraction from JSONL/export
│   ├── Deduplication (hash-based)
│   └── Analytics save to sessions.db (Tier 3)
│
└── Parallel (different tables/resources):
    ├── Knowledge Extraction → org.db (decisions, patterns, errors)
    ├── Trajectory Extraction → tool_analytics table
    ├── MCP Call Graph Reindex → functions/edges tables
    └── Incremental Classify → .md files only

5. Export Directories (ADR-114 Compliant)

LLM	Pending Directory
Claude	`~/PROJECTS/.coditect-data/sessions-export-pending-anthropic/`
Codex	`~/PROJECTS/.coditect-data/sessions-export-pending-codex/`
Gemini	`~/PROJECTS/.coditect-data/sessions-export-pending-gemini/`
KIMI	`~/PROJECTS/.coditect-data/sessions-export-pending-kimi/`

Implementation

Command Interface

# Process ALL LLMs (default)
/cx

# Process specific LLM only
/cx --llm claude
/cx --llm codex
/cx --llm gemini

# With semantic embeddings
/cx --with-embeddings

# Dry run
/cx --dry-run

# Skip auto-indexing
/cx --no-index

# Process single file
/cx FILE

LLM Auto-Detection

Detection priority:

Path pattern: /.claude/ → claude, /.codex/ → codex, /.gemini/ → gemini, /.kimi/ → kimi
File markers:
- Claude: ASCII art banner, "Claude Code v", Model identifiers
- Codex: "type": "codex", o1/o3 model references
- Gemini: "type": "gemini", gemini-2.0 references
- KIMI: k1.5/k2 model references

Deduplication Strategy

Hash Function: SHA-256 of normalized content
Normalization: Trim whitespace, normalize line endings
Conflict Resolution: First-seen wins (preserves original provenance)

Consequences

Positive

Unified memory across all LLM providers
No data loss - all sessions captured with provenance
Instant searchability via FTS5 indexing
Knowledge preservation - decisions, patterns, errors extracted
Multi-tenant ready - supports tenant/team/project isolation

Negative

Storage growth - unified store can grow large (13+ GB)
Processing time - large sessions take time to process
Complexity - multiple LLM formats to maintain

Mitigations

Tier separation (ADR-118): Only org.db requires backup
Incremental processing: Only new messages extracted
Archive pipeline: Processed exports moved to archive

Integration Points

System	Integration
`/cxq` (ADR-021)	Query the extracted data
`/sx`	Interactive session export to pending directories
`/export`	Built-in Claude export to pending
Context Watcher	Auto-export on threshold (ADR-134)
Cloud Sync	Sync to PostgreSQL (ADR-053)

Files

File	Purpose
`scripts/unified-message-extractor.py`	Main extraction script
`scripts/extractors/claude_extractor.py`	Claude-specific extraction
`scripts/extractors/codex_extractor.py`	Codex-specific extraction
`scripts/extractors/gemini_extractor.py`	Gemini-specific extraction
`scripts/extractors/kimi_extractor.py`	KIMI-specific extraction
`commands/cx.md`	Command documentation

ADR-021: Context Query System (/cxq)
ADR-118: Four-Tier Database Architecture
ADR-114: User Data Separation
ADR-122: Unified LLM Architecture
ADR-134: Unified Multi-LLM Watcher
ADR-148: Database Schema Documentation Standard

Changelog

Date	Change
2025-12-10	Initial version
2026-01-28	Added multi-LLM support (ADR-122)
2026-02-03	Documented as formal ADR (was referenced but not created)

Track: J (Memory Intelligence) Command: /cx Script: scripts/unified-message-extractor.py

Status​

Context​

Problem Statement​

Multi-LLM Challenge​

Decision​

1. Unified Message Extractor​

2. Storage Architecture (ADR-118 Compliant)​

3. Message Schema (CUSF Format)​

4. Parallel Post-Processing Pipeline​

5. Export Directories (ADR-114 Compliant)​

Implementation​

Command Interface​

LLM Auto-Detection​

Deduplication Strategy​

Consequences​

Positive​

Negative​

Mitigations​

Integration Points​

Files​

Related​

Changelog​