JSONL Session Processor with Deduplication and Watermarking
Purpose:
Process large Claude Code JSONL session files (>10MB) with:
- Read-only access (original files never modified)
- Streaming chunk processing (memory efficient)
- SHA-256 deduplication against global hash pool
- Watermark tracking for resume capability
- Complete statistics and provenance tracking
Safety Guarantees:
- Source files are READ-ONLY (no modifications ever)
- All processing uses streaming (no full file load)
- Watermarks enable resume after failures
- All output goes to MEMORY-CONTEXT/dedup_state/
Output Structure:
MEMORY-CONTEXT/dedup_state/ ├── global_hashes.json (global dedup hash pool) ├── unique_messages.jsonl (append-only unique messages) ├── session_watermarks.json (resume tracking) └── processing_logs/ (detailed execution logs)
Author: Claude + AZ1.AI License: MIT
File: jsonl_session_processor.py
Classes
SessionWatermark
Track processing progress for resume capability
ProcessingStats
Processing statistics for a session
JSONLSessionProcessor
Process JSONL session files with deduplication and watermarking.
Functions
find_large_sessions(projects_dir, min_size_mb)
Find all JSONL session files larger than threshold.
process_session(session_file, chunk_size, resume)
Process a single JSONL session file.
batch_process_sessions(session_files, chunk_size)
Process multiple session files in batch.
Usage
python jsonl_session_processor.py