ADR-181: Incremental Context Extraction and Watcher Role Redefinition
Status
Accepted - 2026-02-12
Context
The Problem
/cx (unified-message-extractor.py) re-reads every session file on every run — ~3,383 JSONL files totaling 2.7 GB. While hash-based deduplication prevents duplicate messages in the store, the I/O cost of reading unchanged files is wasteful:
| Metric | Current Behavior |
|---|---|
| Files read per run | ~3,383 (all discovered) |
| Data read per run | ~2.7 GB |
| Typical run time | 60-120 seconds |
| New messages per run | Usually <100 (from active sessions) |
| Wasted reads | >95% of I/O is re-reading unchanged files |
Separately, the codi-watcher daemon (ADR-134) has been copying complete session files to cusf-archive/, producing 22 GB of redundant snapshots — every export is a full session dump at a point in time, with no deduplication. Lossless verification (2026-02-12) confirmed 100% of archive data already exists in sessions.db.
Root Causes
-
No file-level tracking:
/cxdiscovers files viafind_all_llm_sessions()but has no record of which files it has already processed or their last-seen sizes. -
No seek-based extraction: Even if a file grew by 100 KB since last run,
/cxreads the entire file from byte 0. -
Watcher copies entire files:
export.rs:execute_builtin_export()runsstd::fs::copy()— creating a full snapshot each time a session exceeds 75% context usage. A session that grows from 1 MB to 20 MB generates ~10 snapshots. -
Format mismatch: Watcher exports raw Claude JSONL (
type: "user"), but the CUSF processor expectstype: "message". Zero messages have ever been extracted from these watcher exports.
Existing Incremental Patterns
The codebase already has proven incremental patterns:
| Pattern | Source | Technique |
|---|---|---|
| Session log indexer (J.32) | unified-message-extractor.py:697-920 | mtime-based skip |
| Project indexer (J.15.3) | project_indexer.py | SHA-256 content hash + mtime + UPSERT + prune |
| Component indexer | component-indexer.py | File stat + hash comparison |
Decision
1. Add processed_session_files Table to sessions.db
Track every session file /cx has processed, with byte offsets for seek-based resume:
CREATE TABLE IF NOT EXISTS processed_session_files (
file_path TEXT PRIMARY KEY,
file_size INTEGER NOT NULL,
file_mtime REAL NOT NULL,
last_processed_offset INTEGER NOT NULL DEFAULT 0,
messages_extracted INTEGER NOT NULL DEFAULT 0,
llm_source TEXT NOT NULL,
first_seen_at TEXT NOT NULL,
last_processed_at TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'active'
);
CREATE INDEX idx_psf_llm ON processed_session_files(llm_source);
CREATE INDEX idx_psf_status ON processed_session_files(status);
This table lives in sessions.db (Tier 3 — regenerable per ADR-118). If lost, /cx --force-full rebuilds it.
2. Implement 4-Phase Incremental Processing
Replace the current "read everything" loop with a classification-first approach:
Phase 1: DISCOVERY — stat() all session files (no reads, fast)
Phase 2: CLASSIFICATION — Compare each file against processed_session_files:
| Classification | Condition | Action |
|---|---|---|
| NEW | file_path not in table | Full extraction from byte 0 |
| GROWN | file_size > stored size | Seek to last_processed_offset, extract new bytes only |
| UNCHANGED | file_size == stored AND file_mtime == stored | Skip entirely |
| SHRUNK | file_size < stored size | Log warning, full re-extract (anomaly) |
| DELETED | In table but not on disk | Mark status = 'deleted' |
Phase 3: EXTRACTION — Process only NEW and GROWN files
Phase 4: POST-PROCESSING — Unchanged (trajectory, call graph, classify, session logs)
3. Seek-Based Append Optimization
Key insight: Claude Code JSONL session files are append-only. They never truncate, compact, or rewrite existing content. A file that was 5 MB last run and is now 8 MB has exactly 3 MB of new content appended at the end.
def extract_incremental(file_path, last_offset):
with open(file_path, 'rb') as f:
f.seek(last_offset)
new_data = f.read()
# Find first complete JSONL line boundary after seek point
# Parse and extract only new entries
# Return new_messages, new_offset
This avoids re-parsing megabytes of already-extracted content.
4. Redefine Watcher Role
Remove: Session file copying for Claude (std::fs::copy() in export.rs)
Keep: Context usage monitoring, session detection, token economics, alerting
Keep: Kimi/Codex/Gemini CUSF export pipeline (their format works correctly)
The watcher remains valuable for real-time monitoring — it just stops creating redundant file copies that /cx already handles more efficiently via direct reads.
5. Add Override Flags
--force-full: Ignoreprocessed_session_filestable, process all files from byte 0. Use for rebuilds or after corruption.--incremental-stats: Show file classification breakdown (NEW/GROWN/UNCHANGED/SHRUNK/DELETED counts).
Consequences
Positive
| Impact | Before | After |
|---|---|---|
| Files read per /cx run | ~3,383 | ~10-50 (new/grown only) |
| Data read per run | ~2.7 GB | ~50-200 MB |
| Run time | 60-120s | 5-15s |
| Archive growth | ~22 GB/week | 0 (disabled) |
| Disk waste | 22 GB cusf-archive | 0 (cleaned) |
- Hash-based message dedup remains as a safety layer (belt and suspenders)
--force-fullprovides escape hatch for any edge cases- Pattern reuses proven J.15.3 and J.32 approaches
Negative
- Additional table to maintain in sessions.db (low cost — auto-migrated, regenerable)
- Seek-based extraction adds code complexity (~200 lines)
- Must handle SHRUNK edge case (file replaced or corrupted)
Neutral
- Watcher continues running for monitoring/alerting — no process change
- Export pipeline for Kimi/Codex/Gemini unchanged
- All existing
/cxflags preserved (--llm,--no-index,--dry-run, etc.)
Implementation
Phase 1: Core Incremental Engine (J.33.2)
Modify unified-message-extractor.py:
- Add
ensure_processed_session_files_table()— auto-migrate on first run - Add
FileClassifierclass:classify_files(discovered_files)→ dict of {path: classification}- Uses single
SELECT * FROM processed_session_files+ in-memory comparison
- Modify
find_all_llm_sessions()to return(path, stat_result)tuples - Modify batch processing loop:
- UNCHANGED → skip (log at debug level)
- NEW → full extract, INSERT into processed_session_files
- GROWN → seek extract, UPDATE processed_session_files
- SHRUNK → warn + full re-extract, UPDATE
- DELETED → UPDATE status='deleted'
- Add
--force-fulland--incremental-statsflags
Phase 2: Watcher Role Change (J.33.3)
Modify tools/context-watcher/src/export.rs:
- Replace
std::fs::copy()for Claude sessions with no-op or notification - Keep monitoring, alerting, and session detection logic unchanged
Phase 3: Archive Cleanup (J.33.5)
- Manifest already exists:
cusf-archive-manifest-2026-02-12.json - Lossless verification complete: 100% redundant (0 unique entries)
- Clean
cusf-archive/(22 GB, 992 files) - Clean
sessions-export-pending-anthropic/(47 MB, 3 files)
Alternatives Considered
A. Fix CUSF Processor to Handle Raw JSONL Format
Rejected. Would create duplicate extraction paths — /cx already reads the same files directly. The watcher's per-snapshot copies also waste disk without dedup.
B. Full Content Hashing Per File (Like J.15.3)
Partially adopted. We use file_size + mtime as a cheap proxy for content change detection (same approach as J.32 session log indexer). Full SHA-256 hashing of multi-megabyte JSONL files would be slower than the stat-based approach and unnecessary given the append-only guarantee.
C. inotify/FSEvents File Watching
Deferred. OS-level file change notifications would be more efficient than polling with stat(), but adds platform-specific complexity. The stat-based approach is fast enough (stat 3,000 files < 100ms) and cross-platform. Can be added later if needed.
D. Separate Tracking Database
Rejected. Adding another .db file would increase backup complexity and violate the ADR-118 principle of minimizing database count. The processed_session_files table fits naturally in sessions.db (Tier 3, regenerable).
References
- Assessment:
internal/analysis/context-watcher/context-watcher-cusf-pipeline-assessment-2026-02-12.md - Plan:
/Users/halcasteel/.claude/plans/velvet-wishing-tarjan.md - TRACK: J.33 in
TRACK-J-MEMORY-INTELLIGENCE.md - Script:
~/.coditect/scripts/unified-message-extractor.py(v5.8.0) - Watcher:
tools/context-watcher/src/export.rs