Skip to main content

JSONL Session Processor with Deduplication and Watermarking

Purpose:

Process large Claude Code JSONL session files (>10MB) with:

  • Read-only access (original files never modified)
  • Streaming chunk processing (memory efficient)
  • SHA-256 deduplication against global hash pool
  • Watermark tracking for resume capability
  • Complete statistics and provenance tracking

Safety Guarantees:

  1. Source files are READ-ONLY (no modifications ever)
  2. All processing uses streaming (no full file load)
  3. Watermarks enable resume after failures
  4. All output goes to MEMORY-CONTEXT/dedup_state/

Output Structure:

MEMORY-CONTEXT/dedup_state/ ├── global_hashes.json (global dedup hash pool) ├── unique_messages.jsonl (append-only unique messages) ├── session_watermarks.json (resume tracking) └── processing_logs/ (detailed execution logs)

Author: Claude + AZ1.AI License: MIT

File: jsonl_session_processor.py

Classes

SessionWatermark

Track processing progress for resume capability

ProcessingStats

Processing statistics for a session

JSONLSessionProcessor

Process JSONL session files with deduplication and watermarking.

Functions

find_large_sessions(projects_dir, min_size_mb)

Find all JSONL session files larger than threshold.

process_session(session_file, chunk_size, resume)

Process a single JSONL session file.

batch_process_sessions(session_files, chunk_size)

Process multiple session files in batch.

Usage

python jsonl_session_processor.py