ADR-020: Context Extraction Command (/cx) - Anti-Forgetting Memory System
Status
ACCEPTED (2025-12-10, Updated 2026-02-03)
Context
Problem Statement
LLM sessions are ephemeral - valuable context, decisions, and learnings are lost when sessions end or context windows are exceeded. CODITECT requires a persistent memory system that:
- Captures session data from multiple LLM providers (Claude, Codex, Gemini, KIMI)
- Deduplicates messages to avoid redundant storage
- Extracts knowledge (decisions, patterns, error solutions)
- Indexes content for instant searchability
- Preserves provenance for traceability
Multi-LLM Challenge
Different LLM CLI tools store sessions in different formats and locations:
| LLM | Session Location | Format |
|---|---|---|
| Claude | ~/.claude/projects/<hash>/<uuid>.jsonl | JSONL with tool_use entries |
| Codex | ~/.codex/history.jsonl | JSONL with conversation turns |
| Gemini | ~/.gemini/tmp/<hash>/chats/session-*.json | JSON with structured messages |
| KIMI | ~/.kimi/sessions/<hash>/<uuid>/wire.jsonl | JSONL with protocol data |
A unified extraction system is needed to normalize these into a consistent format.
Decision
Implement the /cx command as a unified multi-LLM context extraction system with:
1. Unified Message Extractor
Single Python script (scripts/unified-message-extractor.py) that:
- Auto-detects LLM source from file paths and content markers
- Extracts messages into CODITECT Universal Session Format (CUSF)
- Deduplicates using SHA-256 content hashing
- Tags messages with
llm_sourceandllm_model
2. Storage Architecture (ADR-118 Compliant)
┌─────────────────────────────────────────────────────────────────┐
│ /cx EXTRACTION PIPELINE │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SOURCE FILES (Read-Only) │
│ │
│ Claude: ~/.claude/projects/<hash>/<uuid>.jsonl │
│ Codex: ~/.codex/history.jsonl │
│ Gemini: ~/.gemini/tmp/<hash>/chats/session-*.json │
│ KIMI: ~/.kimi/sessions/<hash>/<uuid>/wire.jsonl │
│ │
│ Exports: ~/PROJECTS/.coditect-data/sessions-export-pending-*/ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ UNIFIED MESSAGE STORE (JSONL Source) │
│ │
│ ~/PROJECTS/.coditect-data/context-storage/unified_messages.jsonl│
│ │
│ - Hash-based deduplication │
│ - LLM source identification │
│ - Full provenance tracking │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ADR-118 FOUR-TIER DATABASES │
│ │
│ TIER 2 (org.db) - Critical Knowledge: │
│ • decisions, skill_learnings, error_solutions │
│ │
│ TIER 3 (sessions.db) - Regenerable Session Data: │
│ • messages, tool_analytics, token_economics │
│ • message_component_invocations, activity_associations │
└─────────────────────────────────────────────────────────────────┘
3. Message Schema (CUSF Format)
{
"hash": "sha256...",
"content": "Full message text...",
"role": "assistant",
"llm_source": "claude",
"llm_model": "claude-opus-4-5",
"provenance": {
"source_type": "session",
"source_file": "/path/to/file.jsonl",
"source_line": 42,
"session_id": "uuid-here",
"checkpoint": null
},
"timestamps": {
"occurred": "2026-02-03T12:00:00Z",
"extracted_at": "2026-02-03T19:00:00Z"
},
"metadata": {
"content_length": 1247,
"has_code": true,
"has_markdown": true
}
}
4. Parallel Post-Processing Pipeline
/cx Pipeline
├── Sequential (must be first):
│ ├── Message extraction from JSONL/export
│ ├── Deduplication (hash-based)
│ └── Analytics save to sessions.db (Tier 3)
│
└── Parallel (different tables/resources):
├── Knowledge Extraction → org.db (decisions, patterns, errors)
├── Trajectory Extraction → tool_analytics table
├── MCP Call Graph Reindex → functions/edges tables
└── Incremental Classify → .md files only
5. Export Directories (ADR-114 Compliant)
| LLM | Pending Directory |
|---|---|
| Claude | ~/PROJECTS/.coditect-data/sessions-export-pending-anthropic/ |
| Codex | ~/PROJECTS/.coditect-data/sessions-export-pending-codex/ |
| Gemini | ~/PROJECTS/.coditect-data/sessions-export-pending-gemini/ |
| KIMI | ~/PROJECTS/.coditect-data/sessions-export-pending-kimi/ |
Implementation
Command Interface
# Process ALL LLMs (default)
/cx
# Process specific LLM only
/cx --llm claude
/cx --llm codex
/cx --llm gemini
# With semantic embeddings
/cx --with-embeddings
# Dry run
/cx --dry-run
# Skip auto-indexing
/cx --no-index
# Process single file
/cx FILE
LLM Auto-Detection
Detection priority:
- Path pattern:
/.claude/→ claude,/.codex/→ codex,/.gemini/→ gemini,/.kimi/→ kimi - File markers:
- Claude: ASCII art banner, "Claude Code v", Model identifiers
- Codex:
"type": "codex", o1/o3 model references - Gemini:
"type": "gemini", gemini-2.0 references - KIMI: k1.5/k2 model references
Deduplication Strategy
- Hash Function: SHA-256 of normalized content
- Normalization: Trim whitespace, normalize line endings
- Conflict Resolution: First-seen wins (preserves original provenance)
Consequences
Positive
- Unified memory across all LLM providers
- No data loss - all sessions captured with provenance
- Instant searchability via FTS5 indexing
- Knowledge preservation - decisions, patterns, errors extracted
- Multi-tenant ready - supports tenant/team/project isolation
Negative
- Storage growth - unified store can grow large (13+ GB)
- Processing time - large sessions take time to process
- Complexity - multiple LLM formats to maintain
Mitigations
- Tier separation (ADR-118): Only org.db requires backup
- Incremental processing: Only new messages extracted
- Archive pipeline: Processed exports moved to archive
Integration Points
| System | Integration |
|---|---|
/cxq (ADR-021) | Query the extracted data |
/sx | Interactive session export to pending directories |
/export | Built-in Claude export to pending |
| Context Watcher | Auto-export on threshold (ADR-134) |
| Cloud Sync | Sync to PostgreSQL (ADR-053) |
Files
| File | Purpose |
|---|---|
scripts/unified-message-extractor.py | Main extraction script |
scripts/extractors/claude_extractor.py | Claude-specific extraction |
scripts/extractors/codex_extractor.py | Codex-specific extraction |
scripts/extractors/gemini_extractor.py | Gemini-specific extraction |
scripts/extractors/kimi_extractor.py | KIMI-specific extraction |
commands/cx.md | Command documentation |
Related
- ADR-021: Context Query System (
/cxq) - ADR-118: Four-Tier Database Architecture
- ADR-114: User Data Separation
- ADR-122: Unified LLM Architecture
- ADR-134: Unified Multi-LLM Watcher
- ADR-148: Database Schema Documentation Standard
Changelog
| Date | Change |
|---|---|
| 2025-12-10 | Initial version |
| 2026-01-28 | Added multi-LLM support (ADR-122) |
| 2026-02-03 | Documented as formal ADR (was referenced but not created) |
Track: J (Memory Intelligence)
Command: /cx
Script: scripts/unified-message-extractor.py