Skip to main content

Global Message-Level Deduplicator

Simple deduplication system that only cares about unique message content. No session tracking - just one global pool of all unique messages ever seen.

Key principle: If we've seen this exact message content before, it's a duplicate. Period. No complex session detection needed.

Usage: dedup = MessageDeduplicator(storage_dir='dedup_state')

# Process any export - no session ID needed!
new_messages, stats = dedup.process_export(export_data)

# Optional: Link to checkpoint for organization
new_messages, stats = dedup.process_export(
export_data,
checkpoint_id="week1-day2" # For your reference only
)

Author: Claude + AZ1.AI License: MIT

File: message_deduplicator.py

Classes

DeduplicationError

Base exception for deduplication errors.

StorageError

Raised when storage operations fail.

ParseError

Raised when export parsing fails.

MessageDeduplicator

Global message-level deduplication.

Functions

parse_claude_export_file(filepath)

Parse Claude Code conversation export file.

process_export(export_data, checkpoint_id, dry_run)

Process export and return only new unique messages.

get_statistics()

Get global deduplication statistics.

get_checkpoint_messages(checkpoint_id)

Get message hashes for a specific checkpoint.

get_all_checkpoints()

Get list of all tracked checkpoint IDs

reindex(backup)

Rebuild all indices from the unique_messages.jsonl source file.

Usage

python message_deduplicator.py