Skip to main content

ADR-181: Incremental Context Extraction and Watcher Role Redefinition

Status

Accepted - 2026-02-12

Context

The Problem

/cx (unified-message-extractor.py) re-reads every session file on every run — ~3,383 JSONL files totaling 2.7 GB. While hash-based deduplication prevents duplicate messages in the store, the I/O cost of reading unchanged files is wasteful:

MetricCurrent Behavior
Files read per run~3,383 (all discovered)
Data read per run~2.7 GB
Typical run time60-120 seconds
New messages per runUsually <100 (from active sessions)
Wasted reads>95% of I/O is re-reading unchanged files

Separately, the codi-watcher daemon (ADR-134) has been copying complete session files to cusf-archive/, producing 22 GB of redundant snapshots — every export is a full session dump at a point in time, with no deduplication. Lossless verification (2026-02-12) confirmed 100% of archive data already exists in sessions.db.

Root Causes

  1. No file-level tracking: /cx discovers files via find_all_llm_sessions() but has no record of which files it has already processed or their last-seen sizes.

  2. No seek-based extraction: Even if a file grew by 100 KB since last run, /cx reads the entire file from byte 0.

  3. Watcher copies entire files: export.rs:execute_builtin_export() runs std::fs::copy() — creating a full snapshot each time a session exceeds 75% context usage. A session that grows from 1 MB to 20 MB generates ~10 snapshots.

  4. Format mismatch: Watcher exports raw Claude JSONL (type: "user"), but the CUSF processor expects type: "message". Zero messages have ever been extracted from these watcher exports.

Existing Incremental Patterns

The codebase already has proven incremental patterns:

PatternSourceTechnique
Session log indexer (J.32)unified-message-extractor.py:697-920mtime-based skip
Project indexer (J.15.3)project_indexer.pySHA-256 content hash + mtime + UPSERT + prune
Component indexercomponent-indexer.pyFile stat + hash comparison

Decision

1. Add processed_session_files Table to sessions.db

Track every session file /cx has processed, with byte offsets for seek-based resume:

CREATE TABLE IF NOT EXISTS processed_session_files (
file_path TEXT PRIMARY KEY,
file_size INTEGER NOT NULL,
file_mtime REAL NOT NULL,
last_processed_offset INTEGER NOT NULL DEFAULT 0,
messages_extracted INTEGER NOT NULL DEFAULT 0,
llm_source TEXT NOT NULL,
first_seen_at TEXT NOT NULL,
last_processed_at TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'active'
);
CREATE INDEX idx_psf_llm ON processed_session_files(llm_source);
CREATE INDEX idx_psf_status ON processed_session_files(status);

This table lives in sessions.db (Tier 3 — regenerable per ADR-118). If lost, /cx --force-full rebuilds it.

2. Implement 4-Phase Incremental Processing

Replace the current "read everything" loop with a classification-first approach:

Phase 1: DISCOVERYstat() all session files (no reads, fast)

Phase 2: CLASSIFICATION — Compare each file against processed_session_files:

ClassificationConditionAction
NEWfile_path not in tableFull extraction from byte 0
GROWNfile_size > stored sizeSeek to last_processed_offset, extract new bytes only
UNCHANGEDfile_size == stored AND file_mtime == storedSkip entirely
SHRUNKfile_size < stored sizeLog warning, full re-extract (anomaly)
DELETEDIn table but not on diskMark status = 'deleted'

Phase 3: EXTRACTION — Process only NEW and GROWN files

Phase 4: POST-PROCESSING — Unchanged (trajectory, call graph, classify, session logs)

3. Seek-Based Append Optimization

Key insight: Claude Code JSONL session files are append-only. They never truncate, compact, or rewrite existing content. A file that was 5 MB last run and is now 8 MB has exactly 3 MB of new content appended at the end.

def extract_incremental(file_path, last_offset):
with open(file_path, 'rb') as f:
f.seek(last_offset)
new_data = f.read()
# Find first complete JSONL line boundary after seek point
# Parse and extract only new entries
# Return new_messages, new_offset

This avoids re-parsing megabytes of already-extracted content.

4. Redefine Watcher Role

Remove: Session file copying for Claude (std::fs::copy() in export.rs) Keep: Context usage monitoring, session detection, token economics, alerting Keep: Kimi/Codex/Gemini CUSF export pipeline (their format works correctly)

The watcher remains valuable for real-time monitoring — it just stops creating redundant file copies that /cx already handles more efficiently via direct reads.

5. Add Override Flags

  • --force-full: Ignore processed_session_files table, process all files from byte 0. Use for rebuilds or after corruption.
  • --incremental-stats: Show file classification breakdown (NEW/GROWN/UNCHANGED/SHRUNK/DELETED counts).

Consequences

Positive

ImpactBeforeAfter
Files read per /cx run~3,383~10-50 (new/grown only)
Data read per run~2.7 GB~50-200 MB
Run time60-120s5-15s
Archive growth~22 GB/week0 (disabled)
Disk waste22 GB cusf-archive0 (cleaned)
  • Hash-based message dedup remains as a safety layer (belt and suspenders)
  • --force-full provides escape hatch for any edge cases
  • Pattern reuses proven J.15.3 and J.32 approaches

Negative

  • Additional table to maintain in sessions.db (low cost — auto-migrated, regenerable)
  • Seek-based extraction adds code complexity (~200 lines)
  • Must handle SHRUNK edge case (file replaced or corrupted)

Neutral

  • Watcher continues running for monitoring/alerting — no process change
  • Export pipeline for Kimi/Codex/Gemini unchanged
  • All existing /cx flags preserved (--llm, --no-index, --dry-run, etc.)

Implementation

Phase 1: Core Incremental Engine (J.33.2)

Modify unified-message-extractor.py:

  1. Add ensure_processed_session_files_table() — auto-migrate on first run
  2. Add FileClassifier class:
    • classify_files(discovered_files) → dict of {path: classification}
    • Uses single SELECT * FROM processed_session_files + in-memory comparison
  3. Modify find_all_llm_sessions() to return (path, stat_result) tuples
  4. Modify batch processing loop:
    • UNCHANGED → skip (log at debug level)
    • NEW → full extract, INSERT into processed_session_files
    • GROWN → seek extract, UPDATE processed_session_files
    • SHRUNK → warn + full re-extract, UPDATE
    • DELETED → UPDATE status='deleted'
  5. Add --force-full and --incremental-stats flags

Phase 2: Watcher Role Change (J.33.3)

Modify tools/context-watcher/src/export.rs:

  • Replace std::fs::copy() for Claude sessions with no-op or notification
  • Keep monitoring, alerting, and session detection logic unchanged

Phase 3: Archive Cleanup (J.33.5)

  • Manifest already exists: cusf-archive-manifest-2026-02-12.json
  • Lossless verification complete: 100% redundant (0 unique entries)
  • Clean cusf-archive/ (22 GB, 992 files)
  • Clean sessions-export-pending-anthropic/ (47 MB, 3 files)

Alternatives Considered

A. Fix CUSF Processor to Handle Raw JSONL Format

Rejected. Would create duplicate extraction paths — /cx already reads the same files directly. The watcher's per-snapshot copies also waste disk without dedup.

B. Full Content Hashing Per File (Like J.15.3)

Partially adopted. We use file_size + mtime as a cheap proxy for content change detection (same approach as J.32 session log indexer). Full SHA-256 hashing of multi-megabyte JSONL files would be slower than the stat-based approach and unnecessary given the append-only guarantee.

C. inotify/FSEvents File Watching

Deferred. OS-level file change notifications would be more efficient than polling with stat(), but adds platform-specific complexity. The stat-based approach is fast enough (stat 3,000 files < 100ms) and cross-platform. Can be added later if needed.

D. Separate Tracking Database

Rejected. Adding another .db file would increase backup complexity and violate the ADR-118 principle of minimizing database count. The processed_session_files table fits naturally in sessions.db (Tier 3, regenerable).

References

  • Assessment: internal/analysis/context-watcher/context-watcher-cusf-pipeline-assessment-2026-02-12.md
  • Plan: /Users/halcasteel/.claude/plans/velvet-wishing-tarjan.md
  • TRACK: J.33 in TRACK-J-MEMORY-INTELLIGENCE.md
  • Script: ~/.coditect/scripts/unified-message-extractor.py (v5.8.0)
  • Watcher: tools/context-watcher/src/export.rs