/cx - Context Extraction (Unified Multi-LLM Session Capture)

Extract and deduplicate messages from Claude, Codex, and Gemini sessions into a unified store with LLM source identification and automatic indexing for instant searchability. Supports ADR-122 multi-LLM architecture.

Usage

# Process ALL LLMs (Claude + Codex + Gemini) - DEFAULT
/cx

# Process specific LLM only
/cx --llm claude                    # Claude sessions only
/cx --llm codex                     # Codex sessions only
/cx --llm gemini                    # Gemini sessions only

# Project attribution (ADR-156)
/cx --project CUST-avivatec-fpa     # Attribute to specific project
/cx                                  # Auto-detects from CODITECT_PROJECT env var

# Also generate semantic embeddings for RAG search
/cx --with-embeddings

# Process single file (auto-detects type AND LLM source)
/cx FILE

# Only process large JSONL files (>10MB)
/cx --min-size 10

# Skip auto-indexing (extraction only)
/cx --no-index

# Skip call graph reindex (reindex is ON by default)
/cx --no-reindex

# Skip incremental document classification
/cx --no-classify

# Dry run (show what would be processed)
/cx --dry-run

# Keep export files in place (don't archive)
/cx --no-archive

# Verify file type detection
/cx FILE --verify

# Reindex specific path instead of default
/cx --reindex-path /path/to/code

# Show LLM source statistics
/cx --stats

# Standalone embedding backfill (J.16) — embeds ALL unembedded messages
/cx --embed-all

# Embedding backfill with custom batch size
/cx --embed-all --embedding-batch-size 500

System Prompt

⚠️ EXECUTION DIRECTIVE: When the user invokes /cx with ANY arguments (or no arguments), you MUST:

IMMEDIATELY execute the command - no questions, no explanations first
ALWAYS show full output from the script execution
ALWAYS provide summary with key metrics after execution completes

DO NOT:

Say "I don't need to take action" - you ALWAYS execute when invoked
Ask for confirmation - the user invoking the command IS the confirmation
Skip execution even if it seems redundant - run it anyway

You are executing the unified message extractor for CODITECT session preservation.

Multi-LLM Architecture (ADR-122):

LLM	Session Location	Pending Exports	llm_source
Claude	`~/.claude/projects/`	`exports-pending/`	`claude`
Codex	`~/.codex/history.jsonl`	`sessions-export-pending-codex/`	`codex`
Gemini	`~/.gemini/sessions/`	`sessions-export-pending-gemini/`	`gemini`

Storage Locations (ADR-114 + ADR-122):

Extracted messages: ~/PROJECTS/.coditect-data/context-storage/unified_messages.jsonl
Claude exports pending: ~/PROJECTS/.coditect-data/context-storage/exports-pending/
Codex exports pending: ~/PROJECTS/.coditect-data/sessions-export-pending-codex/
Gemini exports pending: ~/PROJECTS/.coditect-data/sessions-export-pending-gemini/
Export archive: ~/PROJECTS/.coditect-data/context-storage/exports-archive/

Export Pipeline:

Claude:  session-exporter.py     → exports-pending/         → /cx → exports-archive/
Codex:   extract-codex-session.py → sessions-export-pending-codex/  → /cx → exports-archive/
Gemini:  extract-gemini-session.py → sessions-export-pending-gemini/ → /cx → exports-archive/

Note: The context-storage/ directory and subdirectories are created automatically.

Default Behavior (no arguments):

Processes ALL LLM sources (Claude, Codex, Gemini)
Claude: JSONL session files from ~/.claude/projects/
Codex: History from ~/.codex/history.jsonl
Gemini: Sessions from ~/.gemini/sessions/
Processes ALL export TXT/JSONL files from pending directories
Archives export files after processing (moved to exports-archive/)
Session files are READ-ONLY (never moved or modified)
All unique messages extracted with llm_source and llm_model identification
AUTO-INDEXES into SQLite (FTS5 full-text search)
AUTO-EXTRACTS knowledge (decisions, patterns, error solutions)
AUTO-EXTRACTS trajectory (tool calls with hash-based deduplication)

LLM Auto-Detection: The script auto-detects LLM source by:

Path pattern: /.claude/ → claude, /.codex/ → codex, /.gemini/ → gemini
File markers:
- Claude: ASCII art banner (▐▛███▜▌), "Claude Code v", Model identifiers (Opus, Sonnet)
- Codex: "type": "codex" in JSONL, o1/o3 model references
- Gemini: "type": "gemini" in JSONL, gemini-2.0 model references
Export metadata: LLM source embedded in export files

Execute the unified message extractor (works from any directory in a CODITECT-enabled repo):

python3 "$(git rev-parse --show-toplevel)/.coditect/scripts/unified-message-extractor.py" $ARGS

Trajectory extraction runs automatically (H.5.6) - extracts tool_use entries from session files with hash-based deduplication into sessions.db tool_analytics table (ADR-118 Tier 3).

Options

LLM Selection (ADR-122)

Option	Description
`--llm LLM`	Process specific LLM only: `claude`, `codex`, `gemini`, or `all` (default: all)
`--stats`	Show LLM source distribution statistics

Project Attribution (ADR-156)

Option	Description
`--project PROJECT_ID`	Attribute extracted messages to specific project (e.g., `CUST-avivatec-fpa`)
`--no-project`	Disable project attribution (global scope)

Resolution order: --project flag → CODITECT_PROJECT env var → discover_project() → None (global)

File Processing

Option	Description
`FILE`	Single file to process (auto-detects JSONL/TXT and LLM source)
`--min-size MB`	Minimum JSONL file size in MB (default: 0)
`--archive-dir PATH`	Custom archive directory (default: context-storage/exports-archive)
`--no-archive`	Keep export files in place after processing
`--output PATH`	Custom output directory (default: context-storage)
`--merge`	Merge existing legacy dedup stores into unified store
`--dry-run`	Show what would be processed without making changes
`--verify`	Test file type and LLM detection without processing
`--no-index`	Skip auto-indexing (extraction only, no SQLite)
`--with-embeddings`	Also generate semantic embeddings for RAG search
`--embed-all`	Standalone embedding backfill: embed ALL unembedded messages with no timeout (J.16)
`--embedding-batch-size N`	Messages per batch for embedding generation (default: 1000)
`--no-reindex`	Skip call graph reindex (reindex is ON by default)
`--reindex-path PATH`	Specific path to reindex (default: ~/.coditect/scripts)

Session Log Indexing (J.32)

Option	Description
`--no-session-logs`	Skip session log indexing (default: enabled)
`--session-logs-only`	Only index session logs, skip everything else
`--session-logs-root PATH`	Custom session logs root directory

Session Log Indexing Details (J.32):

Discovers SESSION-LOG-*.md files under ~/.coditect-data/session-logs/projects/
Extracts structured entries (timestamps, task IDs, authors, files modified)
Stores in session_log_entries table with FTS5 index in sessions.db
Cross-references in messages table for unified /cxq search
Incremental: mtime-based change detection skips unchanged files
Project-scoped: extracts project_id from directory path

Project Operations (ADR-118 TIER 4, J.15.3-4)

Option	Description
`--register-project PATH`	Register a project directory for indexing and search
`--index-project PATH_OR_NAME`	Index a registered project's source files (J.15.3)
`--embed-project PATH_OR_NAME`	Generate semantic embeddings for a project (J.15.4)
`--reembed-all`	Force re-embed all files (use with --embed-project)
`--list-projects`	List all registered projects
`--project-stats`	Show projects.db statistics

Project Indexing Details (J.15.3):

J.15.3.1: File discovery with exclude patterns (.git, node_modules, venv, etc.)
J.15.3.2: Content type detection (code, document, config, data, binary)
J.15.3.3: Content hashing for change detection
J.15.3.5: Incremental indexing (only changed files re-indexed)

Project Embedding Details (J.15.4):

J.15.4.1: Content-type-specific chunking (code: function boundaries, docs: paragraphs)
J.15.4.2: SentenceTransformer embeddings (all-MiniLM-L6-v2, 384 dimensions)
J.15.4.3: Stored in project_embeddings table
J.15.4.4: Hash-based invalidation (only re-embed changed files)

Examples

Process Everything (Default)

/cx

Result: Processes all JSONL sessions + all exports, archives exports, auto-indexes to SQLite, extracts knowledge

Process Single Export

/cx 2025-12-10-EXPORT-session.txt

Result: Extracts messages, archives to exports-archive/

Only Large Sessions

/cx --min-size 10

Result: Processes JSONL >10MB + all exports

Preview First

/cx --dry-run

Result: Shows what would be processed without changes

Verify File Detection

/cx README.md --verify

Result: Shows "unknown" (not a Claude export)

Merge Legacy Stores

/cx --merge

Result: Imports from existing dedup_state stores into unified store

With Semantic Embeddings

/cx --with-embeddings

Result: Extract + index + embeddings for semantic/RAG search (slower)

Fast Extraction Only (No Indexing)

/cx --no-index

Result: Extract messages only, skip SQLite indexing (use when you'll index later)

Register, Index, and Embed a Project (J.15.3-4)

# Step 1: Register project
/cx --register-project ~/my-project

# Step 2: Index source files
/cx --index-project my-project

# Step 3: Generate semantic embeddings
/cx --embed-project my-project

# Force re-embed all files
/cx --embed-project my-project --reembed-all

# List all registered projects
/cx --list-projects

Result: Full project indexing with semantic embeddings for similarity search

Output Format

Unified messages are stored as JSONL with full provenance, LLM source identification, and project attribution (ADR-156):

{
  "hash": "sha256...",
  "content": "Full message text...",
  "role": "assistant",
  "llm_source": "claude",
  "llm_model": "claude-opus-4-5",
  "project_id": "CUST-avivatec-fpa",
  "provenance": {
    "source_type": "export",
    "source_file": "/path/to/file.txt",
    "source_line": 42,
    "session_id": null,
    "checkpoint": "2025-12-10-session"
  },
  "timestamps": {
    "occurred": "2025-12-10T12:00:00Z",
    "extracted_at": "2025-12-10T19:00:00Z"
  },
  "metadata": {
    "content_length": 1247,
    "has_code": true,
    "has_markdown": true
  }
}

LLM Source Values:

llm_source	llm_model examples
`claude`	`claude-opus-4-5`, `claude-sonnet-4`, `claude-haiku-4.5`
`codex`	`o1-pro`, `o3`, `gpt-4o`
`gemini`	`gemini-2.0-flash`, `gemini-2.0-pro`

Workflow

/export              # Export current session to TXT
/cx                  # Process all + auto-index (ONE COMMAND does everything!)
/cxq "search term"   # Search immediately - database ready!

Simplified Pipeline: Before: /export → /cx → /cxq --index → /cxq --extract → /cxq Now: /export → /cx → /cxq (auto-indexing is automatic!)

Project-Scoped Workflow (ADR-156)

# Set project context for the session
export CODITECT_PROJECT=CUST-avivatec-fpa

# Or use /sx with project flag
/sx --llm claude --project CUST-avivatec-fpa

# Process with project attribution
/cx --project CUST-avivatec-fpa

# Query only this project's context
/cxq --decisions --project CUST-avivatec-fpa

Integration

Works with:

/export - Create export files to capture
/cxq - Query the SQLite database
/trajectory - View execution trajectories (ADR-079)

Parallel Post-Processing Pipeline (v5.0.0):

/cx Pipeline
├── Sequential (must be first):
│   ├── Message extraction from JSONL/export
│   ├── Deduplication (hash-based)
│   └── Analytics save to sessions.db (Tier 3)
│
└── Parallel (different tables/resources):
    ├── Trajectory Extraction (H.5.6) → tool_analytics table
    ├── MCP Call Graph Reindex (H.5.5) → functions/edges tables
    ├── Incremental Classify (H.5.7) → .md files only
    └── Session Log Indexing (J.32) → session_log_entries table

Stage	Target	Runs In
Trajectory	`tool_analytics` table	Parallel
MCP Reindex	`functions`, `edges` tables	Parallel
Classify	`.md` files (no DB)	Parallel
Session Logs	`session_log_entries` + `session_log_fts` tables	Parallel

Safe parallelization: WAL mode enabled, each stage writes to different tables/resources.

Required Summary Format

After execution completes, ALWAYS provide this summary:

✅ Context extraction complete!

Summary:
- LLMs processed: Claude, Codex, Gemini (or specific if --llm used)
- Processed: X JSONL session files + Y export files
- New messages extracted: N unique messages
- Total messages in store: T (was T-N)
- Auto-indexed: SQLite FTS5 database ready
- Knowledge extracted: Decisions, patterns, error solutions

LLM Source Breakdown:
| llm_source | Count | % |
|------------|-------|---|
| claude     | X     | X%|
| codex      | Y     | Y%|
| gemini     | Z     | Z%|

[Top files with new messages if any]

The context database is now fully updated and ready for querying with /cxq.

Success Output

When extraction completes successfully:

✅ COMMAND COMPLETE: /cx
- LLMs: Claude ✓ Codex ✓ Gemini ✓
- Processed: X JSONL session files + Y export files
- New messages extracted: N unique messages
- Total messages in store: T
- Auto-indexed: SQLite FTS5 database ready

LLM Source Distribution:
  claude:  X messages (X%)
  codex:   Y messages (Y%)
  gemini:  Z messages (Z%)

⚡ Parallel execution completed in X.Xs:
- Trajectory: Z new tool calls (duplicates skipped)
- MCP Reindex: N functions, M edges
- Classify: K changed files classified
- Session Logs: E entries from F files (P projects)

Completion Checklist

Before marking complete:

All session files processed
Export files archived
Database indexed for search
Summary displayed to user

Failure Indicators

This command has FAILED if:

❌ No session files found
❌ Write permission denied to context-storage/
❌ Indexing failed
❌ Archive directory not created

When NOT to Use

Do NOT use when:

In the middle of active session (wait until end)
Context storage is locked by another process
Disk space critically low

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Running mid-session	Incomplete capture	Run at session end
Skipping --index	Search unavailable	Always auto-index (default)
Not checking output	Miss extraction errors	Verify message counts

Principles

This command embodies:

#1 Recycle, Extend, Re-Use - Preserves knowledge for reuse
#3 Complete Execution - Full extraction pipeline
#9 Based on Facts - Captures actual session data

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Script: .coditect/scripts/unified-message-extractor.py (via git root) Version: 4.0.0 (Project Attribution - ADR-156) Last Updated: 2026-02-04 Related ADRs: ADR-020 (Context Extraction), ADR-114 (User Data Paths), ADR-122 (Unified LLM Architecture), ADR-156 (Project-Scoped Context)

Usage​

System Prompt​

Options​

LLM Selection (ADR-122)​

Project Attribution (ADR-156)​

File Processing​

Session Log Indexing (J.32)​

Project Operations (ADR-118 TIER 4, J.15.3-4)​

Examples​

Process Everything (Default)​

Process Single Export​

Only Large Sessions​

Preview First​

Verify File Detection​

Merge Legacy Stores​

With Semantic Embeddings​

Fast Extraction Only (No Indexing)​

Register, Index, and Embed a Project (J.15.3-4)​

Output Format​

Workflow​

Project-Scoped Workflow (ADR-156)​

Integration​

Required Summary Format​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​