ADR-181: Incremental Context Extraction and Watcher Role Redefinition

Status

Accepted - 2026-02-12

Context

The Problem

/cx (unified-message-extractor.py) re-reads every session file on every run — ~3,383 JSONL files totaling 2.7 GB. While hash-based deduplication prevents duplicate messages in the store, the I/O cost of reading unchanged files is wasteful:

Metric	Current Behavior
Files read per run	~3,383 (all discovered)
Data read per run	~2.7 GB
Typical run time	60-120 seconds
New messages per run	Usually <100 (from active sessions)
Wasted reads	>95% of I/O is re-reading unchanged files

Separately, the codi-watcher daemon (ADR-134) has been copying complete session files to cusf-archive/, producing 22 GB of redundant snapshots — every export is a full session dump at a point in time, with no deduplication. Lossless verification (2026-02-12) confirmed 100% of archive data already exists in sessions.db.

Root Causes

No file-level tracking: /cx discovers files via find_all_llm_sessions() but has no record of which files it has already processed or their last-seen sizes.
No seek-based extraction: Even if a file grew by 100 KB since last run, /cx reads the entire file from byte 0.
Watcher copies entire files: export.rs:execute_builtin_export() runs std::fs::copy() — creating a full snapshot each time a session exceeds 75% context usage. A session that grows from 1 MB to 20 MB generates ~10 snapshots.
Format mismatch: Watcher exports raw Claude JSONL (type: "user"), but the CUSF processor expects type: "message". Zero messages have ever been extracted from these watcher exports.

Existing Incremental Patterns

The codebase already has proven incremental patterns:

Pattern	Source	Technique
Session log indexer (J.32)	`unified-message-extractor.py:697-920`	mtime-based skip
Project indexer (J.15.3)	`project_indexer.py`	SHA-256 content hash + mtime + UPSERT + prune
Component indexer	`component-indexer.py`	File stat + hash comparison

Decision

1. Add `processed_session_files` Table to sessions.db

Track every session file /cx has processed, with byte offsets for seek-based resume:

CREATE TABLE IF NOT EXISTS processed_session_files (
    file_path TEXT PRIMARY KEY,
    file_size INTEGER NOT NULL,
    file_mtime REAL NOT NULL,
    last_processed_offset INTEGER NOT NULL DEFAULT 0,
    messages_extracted INTEGER NOT NULL DEFAULT 0,
    llm_source TEXT NOT NULL,
    first_seen_at TEXT NOT NULL,
    last_processed_at TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'active'
);
CREATE INDEX idx_psf_llm ON processed_session_files(llm_source);
CREATE INDEX idx_psf_status ON processed_session_files(status);

This table lives in sessions.db (Tier 3 — regenerable per ADR-118). If lost, /cx --force-full rebuilds it.

2. Implement 4-Phase Incremental Processing

Replace the current "read everything" loop with a classification-first approach:

Phase 1: DISCOVERY — stat() all session files (no reads, fast)

Phase 2: CLASSIFICATION — Compare each file against processed_session_files:

Classification	Condition	Action
NEW	file_path not in table	Full extraction from byte 0
GROWN	file_size > stored size	Seek to `last_processed_offset`, extract new bytes only
UNCHANGED	file_size == stored AND file_mtime == stored	Skip entirely
SHRUNK	file_size < stored size	Log warning, full re-extract (anomaly)
DELETED	In table but not on disk	Mark `status = 'deleted'`

Phase 3: EXTRACTION — Process only NEW and GROWN files

Phase 4: POST-PROCESSING — Unchanged (trajectory, call graph, classify, session logs)

3. Seek-Based Append Optimization

Key insight: Claude Code JSONL session files are append-only. They never truncate, compact, or rewrite existing content. A file that was 5 MB last run and is now 8 MB has exactly 3 MB of new content appended at the end.

def extract_incremental(file_path, last_offset):
    with open(file_path, 'rb') as f:
        f.seek(last_offset)
        new_data = f.read()
    # Find first complete JSONL line boundary after seek point
    # Parse and extract only new entries
    # Return new_messages, new_offset

This avoids re-parsing megabytes of already-extracted content.

4. Redefine Watcher Role

Remove: Session file copying for Claude (std::fs::copy() in export.rs) Keep: Context usage monitoring, session detection, token economics, alerting Keep: Kimi/Codex/Gemini CUSF export pipeline (their format works correctly)

The watcher remains valuable for real-time monitoring — it just stops creating redundant file copies that /cx already handles more efficiently via direct reads.

5. Add Override Flags

--force-full: Ignore processed_session_files table, process all files from byte 0. Use for rebuilds or after corruption.
--incremental-stats: Show file classification breakdown (NEW/GROWN/UNCHANGED/SHRUNK/DELETED counts).

Consequences

Positive

Impact	Before	After
Files read per /cx run	~3,383	~10-50 (new/grown only)
Data read per run	~2.7 GB	~50-200 MB
Run time	60-120s	5-15s
Archive growth	~22 GB/week	0 (disabled)
Disk waste	22 GB cusf-archive	0 (cleaned)

Hash-based message dedup remains as a safety layer (belt and suspenders)
--force-full provides escape hatch for any edge cases
Pattern reuses proven J.15.3 and J.32 approaches

Negative

Additional table to maintain in sessions.db (low cost — auto-migrated, regenerable)
Seek-based extraction adds code complexity (~200 lines)
Must handle SHRUNK edge case (file replaced or corrupted)

Neutral

Watcher continues running for monitoring/alerting — no process change
Export pipeline for Kimi/Codex/Gemini unchanged
All existing /cx flags preserved (--llm, --no-index, --dry-run, etc.)

Implementation

Phase 1: Core Incremental Engine (J.33.2)

Modify unified-message-extractor.py:

Add ensure_processed_session_files_table() — auto-migrate on first run
Add FileClassifier class:
- classify_files(discovered_files) → dict of {path: classification}
- Uses single SELECT * FROM processed_session_files + in-memory comparison
Modify find_all_llm_sessions() to return (path, stat_result) tuples
Modify batch processing loop:
- UNCHANGED → skip (log at debug level)
- NEW → full extract, INSERT into processed_session_files
- GROWN → seek extract, UPDATE processed_session_files
- SHRUNK → warn + full re-extract, UPDATE
- DELETED → UPDATE status='deleted'
Add --force-full and --incremental-stats flags

Phase 2: Watcher Role Change (J.33.3)

Modify tools/context-watcher/src/export.rs:

Replace std::fs::copy() for Claude sessions with no-op or notification
Keep monitoring, alerting, and session detection logic unchanged

Phase 3: Archive Cleanup (J.33.5)

Manifest already exists: cusf-archive-manifest-2026-02-12.json
Lossless verification complete: 100% redundant (0 unique entries)
Clean cusf-archive/ (22 GB, 992 files)
Clean sessions-export-pending-anthropic/ (47 MB, 3 files)

Alternatives Considered

A. Fix CUSF Processor to Handle Raw JSONL Format

Rejected. Would create duplicate extraction paths — /cx already reads the same files directly. The watcher's per-snapshot copies also waste disk without dedup.

B. Full Content Hashing Per File (Like J.15.3)

Partially adopted. We use file_size + mtime as a cheap proxy for content change detection (same approach as J.32 session log indexer). Full SHA-256 hashing of multi-megabyte JSONL files would be slower than the stat-based approach and unnecessary given the append-only guarantee.

C. inotify/FSEvents File Watching

Deferred. OS-level file change notifications would be more efficient than polling with stat(), but adds platform-specific complexity. The stat-based approach is fast enough (stat 3,000 files < 100ms) and cross-platform. Can be added later if needed.

D. Separate Tracking Database

Rejected. Adding another .db file would increase backup complexity and violate the ADR-118 principle of minimizing database count. The processed_session_files table fits naturally in sessions.db (Tier 3, regenerable).

References

Assessment: internal/analysis/context-watcher/context-watcher-cusf-pipeline-assessment-2026-02-12.md
Plan: /Users/halcasteel/.claude/plans/velvet-wishing-tarjan.md
TRACK: J.33 in TRACK-J-MEMORY-INTELLIGENCE.md
Script: ~/.coditect/scripts/unified-message-extractor.py (v5.8.0)
Watcher: tools/context-watcher/src/export.rs

Status​

Context​

The Problem​

Root Causes​

Existing Incremental Patterns​

Decision​

1. Add processed_session_files Table to sessions.db​

2. Implement 4-Phase Incremental Processing​

3. Seek-Based Append Optimization​

4. Redefine Watcher Role​

5. Add Override Flags​

Consequences​

Positive​

Negative​

Neutral​

Implementation​

Phase 1: Core Incremental Engine (J.33.2)​

Phase 2: Watcher Role Change (J.33.3)​

Phase 3: Archive Cleanup (J.33.5)​

Alternatives Considered​

A. Fix CUSF Processor to Handle Raw JSONL Format​

B. Full Content Hashing Per File (Like J.15.3)​

C. inotify/FSEvents File Watching​

D. Separate Tracking Database​

References​