Skip to main content

ADR-020: Context Extraction Command (/cx) - Anti-Forgetting Memory System

Status

ACCEPTED (2025-12-10, Updated 2026-02-03)

Context

Problem Statement

LLM sessions are ephemeral - valuable context, decisions, and learnings are lost when sessions end or context windows are exceeded. CODITECT requires a persistent memory system that:

  1. Captures session data from multiple LLM providers (Claude, Codex, Gemini, KIMI)
  2. Deduplicates messages to avoid redundant storage
  3. Extracts knowledge (decisions, patterns, error solutions)
  4. Indexes content for instant searchability
  5. Preserves provenance for traceability

Multi-LLM Challenge

Different LLM CLI tools store sessions in different formats and locations:

LLMSession LocationFormat
Claude~/.claude/projects/<hash>/<uuid>.jsonlJSONL with tool_use entries
Codex~/.codex/history.jsonlJSONL with conversation turns
Gemini~/.gemini/tmp/<hash>/chats/session-*.jsonJSON with structured messages
KIMI~/.kimi/sessions/<hash>/<uuid>/wire.jsonlJSONL with protocol data

A unified extraction system is needed to normalize these into a consistent format.

Decision

Implement the /cx command as a unified multi-LLM context extraction system with:

1. Unified Message Extractor

Single Python script (scripts/unified-message-extractor.py) that:

  • Auto-detects LLM source from file paths and content markers
  • Extracts messages into CODITECT Universal Session Format (CUSF)
  • Deduplicates using SHA-256 content hashing
  • Tags messages with llm_source and llm_model

2. Storage Architecture (ADR-118 Compliant)

┌─────────────────────────────────────────────────────────────────┐
│ /cx EXTRACTION PIPELINE │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ SOURCE FILES (Read-Only) │
│ │
│ Claude: ~/.claude/projects/<hash>/<uuid>.jsonl │
│ Codex: ~/.codex/history.jsonl │
│ Gemini: ~/.gemini/tmp/<hash>/chats/session-*.json │
│ KIMI: ~/.kimi/sessions/<hash>/<uuid>/wire.jsonl │
│ │
│ Exports: ~/PROJECTS/.coditect-data/sessions-export-pending-*/ │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ UNIFIED MESSAGE STORE (JSONL Source) │
│ │
│ ~/PROJECTS/.coditect-data/context-storage/unified_messages.jsonl│
│ │
│ - Hash-based deduplication │
│ - LLM source identification │
│ - Full provenance tracking │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ ADR-118 FOUR-TIER DATABASES │
│ │
│ TIER 2 (org.db) - Critical Knowledge: │
│ • decisions, skill_learnings, error_solutions │
│ │
│ TIER 3 (sessions.db) - Regenerable Session Data: │
│ • messages, tool_analytics, token_economics │
│ • message_component_invocations, activity_associations │
└─────────────────────────────────────────────────────────────────┘

3. Message Schema (CUSF Format)

{
"hash": "sha256...",
"content": "Full message text...",
"role": "assistant",
"llm_source": "claude",
"llm_model": "claude-opus-4-5",
"provenance": {
"source_type": "session",
"source_file": "/path/to/file.jsonl",
"source_line": 42,
"session_id": "uuid-here",
"checkpoint": null
},
"timestamps": {
"occurred": "2026-02-03T12:00:00Z",
"extracted_at": "2026-02-03T19:00:00Z"
},
"metadata": {
"content_length": 1247,
"has_code": true,
"has_markdown": true
}
}

4. Parallel Post-Processing Pipeline

/cx Pipeline
├── Sequential (must be first):
│ ├── Message extraction from JSONL/export
│ ├── Deduplication (hash-based)
│ └── Analytics save to sessions.db (Tier 3)

└── Parallel (different tables/resources):
├── Knowledge Extraction → org.db (decisions, patterns, errors)
├── Trajectory Extraction → tool_analytics table
├── MCP Call Graph Reindex → functions/edges tables
└── Incremental Classify → .md files only

5. Export Directories (ADR-114 Compliant)

LLMPending Directory
Claude~/PROJECTS/.coditect-data/sessions-export-pending-anthropic/
Codex~/PROJECTS/.coditect-data/sessions-export-pending-codex/
Gemini~/PROJECTS/.coditect-data/sessions-export-pending-gemini/
KIMI~/PROJECTS/.coditect-data/sessions-export-pending-kimi/

Implementation

Command Interface

# Process ALL LLMs (default)
/cx

# Process specific LLM only
/cx --llm claude
/cx --llm codex
/cx --llm gemini

# With semantic embeddings
/cx --with-embeddings

# Dry run
/cx --dry-run

# Skip auto-indexing
/cx --no-index

# Process single file
/cx FILE

LLM Auto-Detection

Detection priority:

  1. Path pattern: /.claude/ → claude, /.codex/ → codex, /.gemini/ → gemini, /.kimi/ → kimi
  2. File markers:
    • Claude: ASCII art banner, "Claude Code v", Model identifiers
    • Codex: "type": "codex", o1/o3 model references
    • Gemini: "type": "gemini", gemini-2.0 references
    • KIMI: k1.5/k2 model references

Deduplication Strategy

  • Hash Function: SHA-256 of normalized content
  • Normalization: Trim whitespace, normalize line endings
  • Conflict Resolution: First-seen wins (preserves original provenance)

Consequences

Positive

  1. Unified memory across all LLM providers
  2. No data loss - all sessions captured with provenance
  3. Instant searchability via FTS5 indexing
  4. Knowledge preservation - decisions, patterns, errors extracted
  5. Multi-tenant ready - supports tenant/team/project isolation

Negative

  1. Storage growth - unified store can grow large (13+ GB)
  2. Processing time - large sessions take time to process
  3. Complexity - multiple LLM formats to maintain

Mitigations

  • Tier separation (ADR-118): Only org.db requires backup
  • Incremental processing: Only new messages extracted
  • Archive pipeline: Processed exports moved to archive

Integration Points

SystemIntegration
/cxq (ADR-021)Query the extracted data
/sxInteractive session export to pending directories
/exportBuilt-in Claude export to pending
Context WatcherAuto-export on threshold (ADR-134)
Cloud SyncSync to PostgreSQL (ADR-053)

Files

FilePurpose
scripts/unified-message-extractor.pyMain extraction script
scripts/extractors/claude_extractor.pyClaude-specific extraction
scripts/extractors/codex_extractor.pyCodex-specific extraction
scripts/extractors/gemini_extractor.pyGemini-specific extraction
scripts/extractors/kimi_extractor.pyKIMI-specific extraction
commands/cx.mdCommand documentation
  • ADR-021: Context Query System (/cxq)
  • ADR-118: Four-Tier Database Architecture
  • ADR-114: User Data Separation
  • ADR-122: Unified LLM Architecture
  • ADR-134: Unified Multi-LLM Watcher
  • ADR-148: Database Schema Documentation Standard

Changelog

DateChange
2025-12-10Initial version
2026-01-28Added multi-LLM support (ADR-122)
2026-02-03Documented as formal ADR (was referenced but not created)

Track: J (Memory Intelligence) Command: /cx Script: scripts/unified-message-extractor.py