/cx - Context Extraction (Unified Multi-LLM Session Capture)
Extract and deduplicate messages from Claude, Codex, and Gemini sessions into a unified store with LLM source identification and automatic indexing for instant searchability. Supports ADR-122 multi-LLM architecture.
Usage
# Process ALL LLMs (Claude + Codex + Gemini) - DEFAULT
/cx
# Process specific LLM only
/cx --llm claude # Claude sessions only
/cx --llm codex # Codex sessions only
/cx --llm gemini # Gemini sessions only
# Project attribution (ADR-156)
/cx --project CUST-avivatec-fpa # Attribute to specific project
/cx # Auto-detects from CODITECT_PROJECT env var
# Also generate semantic embeddings for RAG search
/cx --with-embeddings
# Process single file (auto-detects type AND LLM source)
/cx FILE
# Only process large JSONL files (>10MB)
/cx --min-size 10
# Skip auto-indexing (extraction only)
/cx --no-index
# Skip call graph reindex (reindex is ON by default)
/cx --no-reindex
# Skip incremental document classification
/cx --no-classify
# Dry run (show what would be processed)
/cx --dry-run
# Keep export files in place (don't archive)
/cx --no-archive
# Verify file type detection
/cx FILE --verify
# Reindex specific path instead of default
/cx --reindex-path /path/to/code
# Show LLM source statistics
/cx --stats
# Standalone embedding backfill (J.16) — embeds ALL unembedded messages
/cx --embed-all
# Embedding backfill with custom batch size
/cx --embed-all --embedding-batch-size 500
System Prompt
⚠️ EXECUTION DIRECTIVE:
When the user invokes /cx with ANY arguments (or no arguments), you MUST:
- IMMEDIATELY execute the command - no questions, no explanations first
- ALWAYS show full output from the script execution
- ALWAYS provide summary with key metrics after execution completes
DO NOT:
- Say "I don't need to take action" - you ALWAYS execute when invoked
- Ask for confirmation - the user invoking the command IS the confirmation
- Skip execution even if it seems redundant - run it anyway
You are executing the unified message extractor for CODITECT session preservation.
Multi-LLM Architecture (ADR-122):
| LLM | Session Location | Pending Exports | llm_source |
|---|---|---|---|
| Claude | ~/.claude/projects/ | exports-pending/ | claude |
| Codex | ~/.codex/history.jsonl | sessions-export-pending-codex/ | codex |
| Gemini | ~/.gemini/sessions/ | sessions-export-pending-gemini/ | gemini |
Storage Locations (ADR-114 + ADR-122):
- Extracted messages:
~/PROJECTS/.coditect-data/context-storage/unified_messages.jsonl - Claude exports pending:
~/PROJECTS/.coditect-data/context-storage/exports-pending/ - Codex exports pending:
~/PROJECTS/.coditect-data/sessions-export-pending-codex/ - Gemini exports pending:
~/PROJECTS/.coditect-data/sessions-export-pending-gemini/ - Export archive:
~/PROJECTS/.coditect-data/context-storage/exports-archive/
Export Pipeline:
Claude: session-exporter.py → exports-pending/ → /cx → exports-archive/
Codex: extract-codex-session.py → sessions-export-pending-codex/ → /cx → exports-archive/
Gemini: extract-gemini-session.py → sessions-export-pending-gemini/ → /cx → exports-archive/
Note: The context-storage/ directory and subdirectories are created automatically.
Default Behavior (no arguments):
- Processes ALL LLM sources (Claude, Codex, Gemini)
- Claude: JSONL session files from
~/.claude/projects/ - Codex: History from
~/.codex/history.jsonl - Gemini: Sessions from
~/.gemini/sessions/ - Processes ALL export TXT/JSONL files from pending directories
- Archives export files after processing (moved to exports-archive/)
- Session files are READ-ONLY (never moved or modified)
- All unique messages extracted with llm_source and llm_model identification
- AUTO-INDEXES into SQLite (FTS5 full-text search)
- AUTO-EXTRACTS knowledge (decisions, patterns, error solutions)
- AUTO-EXTRACTS trajectory (tool calls with hash-based deduplication)
LLM Auto-Detection: The script auto-detects LLM source by:
- Path pattern:
/.claude/→ claude,/.codex/→ codex,/.gemini/→ gemini - File markers:
- Claude: ASCII art banner (▐▛███▜▌), "Claude Code v", Model identifiers (Opus, Sonnet)
- Codex:
"type": "codex"in JSONL, o1/o3 model references - Gemini:
"type": "gemini"in JSONL, gemini-2.0 model references
- Export metadata: LLM source embedded in export files
Execute the unified message extractor (works from any directory in a CODITECT-enabled repo):
python3 "$(git rev-parse --show-toplevel)/.coditect/scripts/unified-message-extractor.py" $ARGS
Trajectory extraction runs automatically (H.5.6) - extracts tool_use entries from session files with hash-based deduplication into sessions.db tool_analytics table (ADR-118 Tier 3).
Options
LLM Selection (ADR-122)
| Option | Description |
|---|---|
--llm LLM | Process specific LLM only: claude, codex, gemini, or all (default: all) |
--stats | Show LLM source distribution statistics |
Project Attribution (ADR-156)
| Option | Description |
|---|---|
--project PROJECT_ID | Attribute extracted messages to specific project (e.g., CUST-avivatec-fpa) |
--no-project | Disable project attribution (global scope) |
Resolution order: --project flag → CODITECT_PROJECT env var → discover_project() → None (global)
File Processing
| Option | Description |
|---|---|
FILE | Single file to process (auto-detects JSONL/TXT and LLM source) |
--min-size MB | Minimum JSONL file size in MB (default: 0) |
--archive-dir PATH | Custom archive directory (default: context-storage/exports-archive) |
--no-archive | Keep export files in place after processing |
--output PATH | Custom output directory (default: context-storage) |
--merge | Merge existing legacy dedup stores into unified store |
--dry-run | Show what would be processed without making changes |
--verify | Test file type and LLM detection without processing |
--no-index | Skip auto-indexing (extraction only, no SQLite) |
--with-embeddings | Also generate semantic embeddings for RAG search |
--embed-all | Standalone embedding backfill: embed ALL unembedded messages with no timeout (J.16) |
--embedding-batch-size N | Messages per batch for embedding generation (default: 1000) |
--no-reindex | Skip call graph reindex (reindex is ON by default) |
--reindex-path PATH | Specific path to reindex (default: ~/.coditect/scripts) |
Session Log Indexing (J.32)
| Option | Description |
|---|---|
--no-session-logs | Skip session log indexing (default: enabled) |
--session-logs-only | Only index session logs, skip everything else |
--session-logs-root PATH | Custom session logs root directory |
Session Log Indexing Details (J.32):
- Discovers
SESSION-LOG-*.mdfiles under~/.coditect-data/session-logs/projects/ - Extracts structured entries (timestamps, task IDs, authors, files modified)
- Stores in
session_log_entriestable with FTS5 index in sessions.db - Cross-references in
messagestable for unified/cxqsearch - Incremental: mtime-based change detection skips unchanged files
- Project-scoped: extracts
project_idfrom directory path
Project Operations (ADR-118 TIER 4, J.15.3-4)
| Option | Description |
|---|---|
--register-project PATH | Register a project directory for indexing and search |
--index-project PATH_OR_NAME | Index a registered project's source files (J.15.3) |
--embed-project PATH_OR_NAME | Generate semantic embeddings for a project (J.15.4) |
--reembed-all | Force re-embed all files (use with --embed-project) |
--list-projects | List all registered projects |
--project-stats | Show projects.db statistics |
Project Indexing Details (J.15.3):
- J.15.3.1: File discovery with exclude patterns (.git, node_modules, venv, etc.)
- J.15.3.2: Content type detection (code, document, config, data, binary)
- J.15.3.3: Content hashing for change detection
- J.15.3.5: Incremental indexing (only changed files re-indexed)
Project Embedding Details (J.15.4):
- J.15.4.1: Content-type-specific chunking (code: function boundaries, docs: paragraphs)
- J.15.4.2: SentenceTransformer embeddings (all-MiniLM-L6-v2, 384 dimensions)
- J.15.4.3: Stored in project_embeddings table
- J.15.4.4: Hash-based invalidation (only re-embed changed files)
Examples
Process Everything (Default)
/cx
Result: Processes all JSONL sessions + all exports, archives exports, auto-indexes to SQLite, extracts knowledge
Process Single Export
/cx 2025-12-10-EXPORT-session.txt
Result: Extracts messages, archives to exports-archive/
Only Large Sessions
/cx --min-size 10
Result: Processes JSONL >10MB + all exports
Preview First
/cx --dry-run
Result: Shows what would be processed without changes
Verify File Detection
/cx README.md --verify
Result: Shows "unknown" (not a Claude export)
Merge Legacy Stores
/cx --merge
Result: Imports from existing dedup_state stores into unified store
With Semantic Embeddings
/cx --with-embeddings
Result: Extract + index + embeddings for semantic/RAG search (slower)
Fast Extraction Only (No Indexing)
/cx --no-index
Result: Extract messages only, skip SQLite indexing (use when you'll index later)
Register, Index, and Embed a Project (J.15.3-4)
# Step 1: Register project
/cx --register-project ~/my-project
# Step 2: Index source files
/cx --index-project my-project
# Step 3: Generate semantic embeddings
/cx --embed-project my-project
# Force re-embed all files
/cx --embed-project my-project --reembed-all
# List all registered projects
/cx --list-projects
Result: Full project indexing with semantic embeddings for similarity search
Output Format
Unified messages are stored as JSONL with full provenance, LLM source identification, and project attribution (ADR-156):
{
"hash": "sha256...",
"content": "Full message text...",
"role": "assistant",
"llm_source": "claude",
"llm_model": "claude-opus-4-5",
"project_id": "CUST-avivatec-fpa",
"provenance": {
"source_type": "export",
"source_file": "/path/to/file.txt",
"source_line": 42,
"session_id": null,
"checkpoint": "2025-12-10-session"
},
"timestamps": {
"occurred": "2025-12-10T12:00:00Z",
"extracted_at": "2025-12-10T19:00:00Z"
},
"metadata": {
"content_length": 1247,
"has_code": true,
"has_markdown": true
}
}
LLM Source Values:
| llm_source | llm_model examples |
|---|---|
claude | claude-opus-4-5, claude-sonnet-4, claude-haiku-4.5 |
codex | o1-pro, o3, gpt-4o |
gemini | gemini-2.0-flash, gemini-2.0-pro |
Workflow
/export # Export current session to TXT
/cx # Process all + auto-index (ONE COMMAND does everything!)
/cxq "search term" # Search immediately - database ready!
Simplified Pipeline:
Before: /export → /cx → /cxq --index → /cxq --extract → /cxq
Now: /export → /cx → /cxq (auto-indexing is automatic!)
Project-Scoped Workflow (ADR-156)
# Set project context for the session
export CODITECT_PROJECT=CUST-avivatec-fpa
# Or use /sx with project flag
/sx --llm claude --project CUST-avivatec-fpa
# Process with project attribution
/cx --project CUST-avivatec-fpa
# Query only this project's context
/cxq --decisions --project CUST-avivatec-fpa
Integration
Works with:
/export- Create export files to capture/cxq- Query the SQLite database/trajectory- View execution trajectories (ADR-079)
Parallel Post-Processing Pipeline (v5.0.0):
/cx Pipeline
├── Sequential (must be first):
│ ├── Message extraction from JSONL/export
│ ├── Deduplication (hash-based)
│ └── Analytics save to sessions.db (Tier 3)
│
└── Parallel (different tables/resources):
├── Trajectory Extraction (H.5.6) → tool_analytics table
├── MCP Call Graph Reindex (H.5.5) → functions/edges tables
├── Incremental Classify (H.5.7) → .md files only
└── Session Log Indexing (J.32) → session_log_entries table
| Stage | Target | Runs In |
|---|---|---|
| Trajectory | tool_analytics table | Parallel |
| MCP Reindex | functions, edges tables | Parallel |
| Classify | .md files (no DB) | Parallel |
| Session Logs | session_log_entries + session_log_fts tables | Parallel |
Safe parallelization: WAL mode enabled, each stage writes to different tables/resources.
Required Summary Format
After execution completes, ALWAYS provide this summary:
✅ Context extraction complete!
Summary:
- LLMs processed: Claude, Codex, Gemini (or specific if --llm used)
- Processed: X JSONL session files + Y export files
- New messages extracted: N unique messages
- Total messages in store: T (was T-N)
- Auto-indexed: SQLite FTS5 database ready
- Knowledge extracted: Decisions, patterns, error solutions
LLM Source Breakdown:
| llm_source | Count | % |
|------------|-------|---|
| claude | X | X%|
| codex | Y | Y%|
| gemini | Z | Z%|
[Top files with new messages if any]
The context database is now fully updated and ready for querying with /cxq.
Success Output
When extraction completes successfully:
✅ COMMAND COMPLETE: /cx
- LLMs: Claude ✓ Codex ✓ Gemini ✓
- Processed: X JSONL session files + Y export files
- New messages extracted: N unique messages
- Total messages in store: T
- Auto-indexed: SQLite FTS5 database ready
LLM Source Distribution:
claude: X messages (X%)
codex: Y messages (Y%)
gemini: Z messages (Z%)
⚡ Parallel execution completed in X.Xs:
- Trajectory: Z new tool calls (duplicates skipped)
- MCP Reindex: N functions, M edges
- Classify: K changed files classified
- Session Logs: E entries from F files (P projects)
Completion Checklist
Before marking complete:
- All session files processed
- Export files archived
- Database indexed for search
- Summary displayed to user
Failure Indicators
This command has FAILED if:
- ❌ No session files found
- ❌ Write permission denied to context-storage/
- ❌ Indexing failed
- ❌ Archive directory not created
When NOT to Use
Do NOT use when:
- In the middle of active session (wait until end)
- Context storage is locked by another process
- Disk space critically low
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Running mid-session | Incomplete capture | Run at session end |
| Skipping --index | Search unavailable | Always auto-index (default) |
| Not checking output | Miss extraction errors | Verify message counts |
Principles
This command embodies:
- #1 Recycle, Extend, Re-Use - Preserves knowledge for reuse
- #3 Complete Execution - Full extraction pipeline
- #9 Based on Facts - Captures actual session data
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Script: .coditect/scripts/unified-message-extractor.py (via git root)
Version: 4.0.0 (Project Attribution - ADR-156)
Last Updated: 2026-02-04
Related ADRs: ADR-020 (Context Extraction), ADR-114 (User Data Paths), ADR-122 (Unified LLM Architecture), ADR-156 (Project-Scoped Context)