Global Message-Level Deduplicator
Simple deduplication system that only cares about unique message content. No session tracking - just one global pool of all unique messages ever seen.
Key principle: If we've seen this exact message content before, it's a duplicate. Period. No complex session detection needed.
Usage: dedup = MessageDeduplicator(storage_dir='dedup_state')
# Process any export - no session ID needed!
new_messages, stats = dedup.process_export(export_data)
# Optional: Link to checkpoint for organization
new_messages, stats = dedup.process_export(
export_data,
checkpoint_id="week1-day2" # For your reference only
)
Author: Claude + AZ1.AI License: MIT
File: message_deduplicator.py
Classes
DeduplicationError
Base exception for deduplication errors.
StorageError
Raised when storage operations fail.
ParseError
Raised when export parsing fails.
MessageDeduplicator
Global message-level deduplication.
Functions
parse_claude_export_file(filepath)
Parse Claude Code conversation export file.
process_export(export_data, checkpoint_id, dry_run)
Process export and return only new unique messages.
get_statistics()
Get global deduplication statistics.
get_checkpoint_messages(checkpoint_id)
Get message hashes for a specific checkpoint.
get_all_checkpoints()
Get list of all tracked checkpoint IDs
reindex(backup)
Rebuild all indices from the unique_messages.jsonl source file.
Usage
python message_deduplicator.py