Skip to main content

/cx - Context Extraction (Unified Multi-LLM Session Capture)

Extract and deduplicate messages from Claude, Codex, and Gemini sessions into a unified store with LLM source identification and automatic indexing for instant searchability. Supports ADR-122 multi-LLM architecture.

Usage

# Process ALL LLMs (Claude + Codex + Gemini) - DEFAULT
/cx

# Process specific LLM only
/cx --llm claude # Claude sessions only
/cx --llm codex # Codex sessions only
/cx --llm gemini # Gemini sessions only

# Project attribution (ADR-156)
/cx --project CUST-avivatec-fpa # Attribute to specific project
/cx # Auto-detects from CODITECT_PROJECT env var

# Also generate semantic embeddings for RAG search
/cx --with-embeddings

# Process single file (auto-detects type AND LLM source)
/cx FILE

# Only process large JSONL files (>10MB)
/cx --min-size 10

# Skip auto-indexing (extraction only)
/cx --no-index

# Skip call graph reindex (reindex is ON by default)
/cx --no-reindex

# Skip incremental document classification
/cx --no-classify

# Dry run (show what would be processed)
/cx --dry-run

# Keep export files in place (don't archive)
/cx --no-archive

# Verify file type detection
/cx FILE --verify

# Reindex specific path instead of default
/cx --reindex-path /path/to/code

# Show LLM source statistics
/cx --stats

# Standalone embedding backfill (J.16) — embeds ALL unembedded messages
/cx --embed-all

# Embedding backfill with custom batch size
/cx --embed-all --embedding-batch-size 500

System Prompt

⚠️ EXECUTION DIRECTIVE: When the user invokes /cx with ANY arguments (or no arguments), you MUST:

  1. IMMEDIATELY execute the command - no questions, no explanations first
  2. ALWAYS show full output from the script execution
  3. ALWAYS provide summary with key metrics after execution completes

DO NOT:

  • Say "I don't need to take action" - you ALWAYS execute when invoked
  • Ask for confirmation - the user invoking the command IS the confirmation
  • Skip execution even if it seems redundant - run it anyway

You are executing the unified message extractor for CODITECT session preservation.

Multi-LLM Architecture (ADR-122):

LLMSession LocationPending Exportsllm_source
Claude~/.claude/projects/exports-pending/claude
Codex~/.codex/history.jsonlsessions-export-pending-codex/codex
Gemini~/.gemini/sessions/sessions-export-pending-gemini/gemini

Storage Locations (ADR-114 + ADR-122):

  • Extracted messages: ~/PROJECTS/.coditect-data/context-storage/unified_messages.jsonl
  • Claude exports pending: ~/PROJECTS/.coditect-data/context-storage/exports-pending/
  • Codex exports pending: ~/PROJECTS/.coditect-data/sessions-export-pending-codex/
  • Gemini exports pending: ~/PROJECTS/.coditect-data/sessions-export-pending-gemini/
  • Export archive: ~/PROJECTS/.coditect-data/context-storage/exports-archive/

Export Pipeline:

Claude:  session-exporter.py     → exports-pending/         → /cx → exports-archive/
Codex: extract-codex-session.py → sessions-export-pending-codex/ → /cx → exports-archive/
Gemini: extract-gemini-session.py → sessions-export-pending-gemini/ → /cx → exports-archive/

Note: The context-storage/ directory and subdirectories are created automatically.

Default Behavior (no arguments):

  • Processes ALL LLM sources (Claude, Codex, Gemini)
  • Claude: JSONL session files from ~/.claude/projects/
  • Codex: History from ~/.codex/history.jsonl
  • Gemini: Sessions from ~/.gemini/sessions/
  • Processes ALL export TXT/JSONL files from pending directories
  • Archives export files after processing (moved to exports-archive/)
  • Session files are READ-ONLY (never moved or modified)
  • All unique messages extracted with llm_source and llm_model identification
  • AUTO-INDEXES into SQLite (FTS5 full-text search)
  • AUTO-EXTRACTS knowledge (decisions, patterns, error solutions)
  • AUTO-EXTRACTS trajectory (tool calls with hash-based deduplication)

LLM Auto-Detection: The script auto-detects LLM source by:

  1. Path pattern: /.claude/ → claude, /.codex/ → codex, /.gemini/ → gemini
  2. File markers:
    • Claude: ASCII art banner (▐▛███▜▌), "Claude Code v", Model identifiers (Opus, Sonnet)
    • Codex: "type": "codex" in JSONL, o1/o3 model references
    • Gemini: "type": "gemini" in JSONL, gemini-2.0 model references
  3. Export metadata: LLM source embedded in export files

Execute the unified message extractor (works from any directory in a CODITECT-enabled repo):

python3 "$(git rev-parse --show-toplevel)/.coditect/scripts/unified-message-extractor.py" $ARGS

Trajectory extraction runs automatically (H.5.6) - extracts tool_use entries from session files with hash-based deduplication into sessions.db tool_analytics table (ADR-118 Tier 3).

Options

LLM Selection (ADR-122)

OptionDescription
--llm LLMProcess specific LLM only: claude, codex, gemini, or all (default: all)
--statsShow LLM source distribution statistics

Project Attribution (ADR-156)

OptionDescription
--project PROJECT_IDAttribute extracted messages to specific project (e.g., CUST-avivatec-fpa)
--no-projectDisable project attribution (global scope)

Resolution order: --project flag → CODITECT_PROJECT env var → discover_project() → None (global)

File Processing

OptionDescription
FILESingle file to process (auto-detects JSONL/TXT and LLM source)
--min-size MBMinimum JSONL file size in MB (default: 0)
--archive-dir PATHCustom archive directory (default: context-storage/exports-archive)
--no-archiveKeep export files in place after processing
--output PATHCustom output directory (default: context-storage)
--mergeMerge existing legacy dedup stores into unified store
--dry-runShow what would be processed without making changes
--verifyTest file type and LLM detection without processing
--no-indexSkip auto-indexing (extraction only, no SQLite)
--with-embeddingsAlso generate semantic embeddings for RAG search
--embed-allStandalone embedding backfill: embed ALL unembedded messages with no timeout (J.16)
--embedding-batch-size NMessages per batch for embedding generation (default: 1000)
--no-reindexSkip call graph reindex (reindex is ON by default)
--reindex-path PATHSpecific path to reindex (default: ~/.coditect/scripts)

Session Log Indexing (J.32)

OptionDescription
--no-session-logsSkip session log indexing (default: enabled)
--session-logs-onlyOnly index session logs, skip everything else
--session-logs-root PATHCustom session logs root directory

Session Log Indexing Details (J.32):

  • Discovers SESSION-LOG-*.md files under ~/.coditect-data/session-logs/projects/
  • Extracts structured entries (timestamps, task IDs, authors, files modified)
  • Stores in session_log_entries table with FTS5 index in sessions.db
  • Cross-references in messages table for unified /cxq search
  • Incremental: mtime-based change detection skips unchanged files
  • Project-scoped: extracts project_id from directory path

Project Operations (ADR-118 TIER 4, J.15.3-4)

OptionDescription
--register-project PATHRegister a project directory for indexing and search
--index-project PATH_OR_NAMEIndex a registered project's source files (J.15.3)
--embed-project PATH_OR_NAMEGenerate semantic embeddings for a project (J.15.4)
--reembed-allForce re-embed all files (use with --embed-project)
--list-projectsList all registered projects
--project-statsShow projects.db statistics

Project Indexing Details (J.15.3):

  • J.15.3.1: File discovery with exclude patterns (.git, node_modules, venv, etc.)
  • J.15.3.2: Content type detection (code, document, config, data, binary)
  • J.15.3.3: Content hashing for change detection
  • J.15.3.5: Incremental indexing (only changed files re-indexed)

Project Embedding Details (J.15.4):

  • J.15.4.1: Content-type-specific chunking (code: function boundaries, docs: paragraphs)
  • J.15.4.2: SentenceTransformer embeddings (all-MiniLM-L6-v2, 384 dimensions)
  • J.15.4.3: Stored in project_embeddings table
  • J.15.4.4: Hash-based invalidation (only re-embed changed files)

Examples

Process Everything (Default)

/cx

Result: Processes all JSONL sessions + all exports, archives exports, auto-indexes to SQLite, extracts knowledge

Process Single Export

/cx 2025-12-10-EXPORT-session.txt

Result: Extracts messages, archives to exports-archive/

Only Large Sessions

/cx --min-size 10

Result: Processes JSONL >10MB + all exports

Preview First

/cx --dry-run

Result: Shows what would be processed without changes

Verify File Detection

/cx README.md --verify

Result: Shows "unknown" (not a Claude export)

Merge Legacy Stores

/cx --merge

Result: Imports from existing dedup_state stores into unified store

With Semantic Embeddings

/cx --with-embeddings

Result: Extract + index + embeddings for semantic/RAG search (slower)

Fast Extraction Only (No Indexing)

/cx --no-index

Result: Extract messages only, skip SQLite indexing (use when you'll index later)

Register, Index, and Embed a Project (J.15.3-4)

# Step 1: Register project
/cx --register-project ~/my-project

# Step 2: Index source files
/cx --index-project my-project

# Step 3: Generate semantic embeddings
/cx --embed-project my-project

# Force re-embed all files
/cx --embed-project my-project --reembed-all

# List all registered projects
/cx --list-projects

Result: Full project indexing with semantic embeddings for similarity search

Output Format

Unified messages are stored as JSONL with full provenance, LLM source identification, and project attribution (ADR-156):

{
"hash": "sha256...",
"content": "Full message text...",
"role": "assistant",
"llm_source": "claude",
"llm_model": "claude-opus-4-5",
"project_id": "CUST-avivatec-fpa",
"provenance": {
"source_type": "export",
"source_file": "/path/to/file.txt",
"source_line": 42,
"session_id": null,
"checkpoint": "2025-12-10-session"
},
"timestamps": {
"occurred": "2025-12-10T12:00:00Z",
"extracted_at": "2025-12-10T19:00:00Z"
},
"metadata": {
"content_length": 1247,
"has_code": true,
"has_markdown": true
}
}

LLM Source Values:

llm_sourcellm_model examples
claudeclaude-opus-4-5, claude-sonnet-4, claude-haiku-4.5
codexo1-pro, o3, gpt-4o
geminigemini-2.0-flash, gemini-2.0-pro

Workflow

/export              # Export current session to TXT
/cx # Process all + auto-index (ONE COMMAND does everything!)
/cxq "search term" # Search immediately - database ready!

Simplified Pipeline: Before: /export/cx/cxq --index/cxq --extract/cxq Now: /export/cx/cxq (auto-indexing is automatic!)

Project-Scoped Workflow (ADR-156)

# Set project context for the session
export CODITECT_PROJECT=CUST-avivatec-fpa

# Or use /sx with project flag
/sx --llm claude --project CUST-avivatec-fpa

# Process with project attribution
/cx --project CUST-avivatec-fpa

# Query only this project's context
/cxq --decisions --project CUST-avivatec-fpa

Integration

Works with:

  • /export - Create export files to capture
  • /cxq - Query the SQLite database
  • /trajectory - View execution trajectories (ADR-079)

Parallel Post-Processing Pipeline (v5.0.0):

/cx Pipeline
├── Sequential (must be first):
│ ├── Message extraction from JSONL/export
│ ├── Deduplication (hash-based)
│ └── Analytics save to sessions.db (Tier 3)

└── Parallel (different tables/resources):
├── Trajectory Extraction (H.5.6) → tool_analytics table
├── MCP Call Graph Reindex (H.5.5) → functions/edges tables
├── Incremental Classify (H.5.7) → .md files only
└── Session Log Indexing (J.32) → session_log_entries table
StageTargetRuns In
Trajectorytool_analytics tableParallel
MCP Reindexfunctions, edges tablesParallel
Classify.md files (no DB)Parallel
Session Logssession_log_entries + session_log_fts tablesParallel

Safe parallelization: WAL mode enabled, each stage writes to different tables/resources.


Required Summary Format

After execution completes, ALWAYS provide this summary:

✅ Context extraction complete!

Summary:
- LLMs processed: Claude, Codex, Gemini (or specific if --llm used)
- Processed: X JSONL session files + Y export files
- New messages extracted: N unique messages
- Total messages in store: T (was T-N)
- Auto-indexed: SQLite FTS5 database ready
- Knowledge extracted: Decisions, patterns, error solutions

LLM Source Breakdown:
| llm_source | Count | % |
|------------|-------|---|
| claude | X | X%|
| codex | Y | Y%|
| gemini | Z | Z%|

[Top files with new messages if any]

The context database is now fully updated and ready for querying with /cxq.

Success Output

When extraction completes successfully:

✅ COMMAND COMPLETE: /cx
- LLMs: Claude ✓ Codex ✓ Gemini ✓
- Processed: X JSONL session files + Y export files
- New messages extracted: N unique messages
- Total messages in store: T
- Auto-indexed: SQLite FTS5 database ready

LLM Source Distribution:
claude: X messages (X%)
codex: Y messages (Y%)
gemini: Z messages (Z%)

⚡ Parallel execution completed in X.Xs:
- Trajectory: Z new tool calls (duplicates skipped)
- MCP Reindex: N functions, M edges
- Classify: K changed files classified
- Session Logs: E entries from F files (P projects)

Completion Checklist

Before marking complete:

  • All session files processed
  • Export files archived
  • Database indexed for search
  • Summary displayed to user

Failure Indicators

This command has FAILED if:

  • ❌ No session files found
  • ❌ Write permission denied to context-storage/
  • ❌ Indexing failed
  • ❌ Archive directory not created

When NOT to Use

Do NOT use when:

  • In the middle of active session (wait until end)
  • Context storage is locked by another process
  • Disk space critically low

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Running mid-sessionIncomplete captureRun at session end
Skipping --indexSearch unavailableAlways auto-index (default)
Not checking outputMiss extraction errorsVerify message counts

Principles

This command embodies:

  • #1 Recycle, Extend, Re-Use - Preserves knowledge for reuse
  • #3 Complete Execution - Full extraction pipeline
  • #9 Based on Facts - Captures actual session data

Full Standard: CODITECT-STANDARD-AUTOMATION.md


Script: .coditect/scripts/unified-message-extractor.py (via git root) Version: 4.0.0 (Project Attribution - ADR-156) Last Updated: 2026-02-04 Related ADRs: ADR-020 (Context Extraction), ADR-114 (User Data Paths), ADR-122 (Unified LLM Architecture), ADR-156 (Project-Scoped Context)