ADR-005: Image Content Change Detection for Unique Frame Extraction
Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, ML Engineer
Related: ADR-002 (Frame Sampling), TDD Section 3.3
Context
Current frame extraction produces 150-200 frames per video, but many are visually identical or near-identical. The problem:
- Educational videos: Same slide shown for 30-60 seconds = 900-1800 identical frames at 30fps
- Presentation content: Slides with minor cursor movements = hundreds of nearly-identical frames
- Waste: Sending identical frames to vision API wastes money and processing time
- Current deduplication: Only temporal (0.5s window) - doesn't detect content similarity
Cost Impact
without_content_dedup = {
'frames_extracted': 180, # Multi-strategy extraction
'vision_api_cost': 180 * 0.004, # $0.72
'redundant_frames': 90, # 50% are visually similar
'wasted_cost': 90 * 0.004 # $0.36 (50% waste)
}
with_content_dedup = {
'frames_extracted': 180,
'unique_frames': 90, # After content-based deduplication
'vision_api_cost': 90 * 0.004, # $0.36
'cost_savings': 0.36, # 50% reduction
'annual_savings_at_1000_videos': 0.36 * 1000 * 12 # $4,320
}
Decision Drivers
- Cost Optimization: Reduce vision API costs by 40-60%
- Processing Speed: Skip redundant frame analysis
- Quality: Ensure we don't miss unique content
- Performance: Fast enough to not become a bottleneck
- Accuracy: High precision (don't merge different frames) and recall (catch duplicates)
Options Considered
Option 1: Perceptual Hashing (pHash)
Approach: Generate hash from image frequency domain, compare hashes
import imagehash
from PIL import Image
def get_phash(image_path):
img = Image.open(image_path)
return imagehash.phash(img, hash_size=8)
def are_similar(hash1, hash2, threshold=5):
return hash1 - hash2 <= threshold # Hamming distance
Pros:
- Very fast: <10ms per image on CPU
- Robust: Handles minor variations (cursor, compression artifacts)
- Low memory: 64-bit hash per image
- Proven: Used in image deduplication for decades
- Simple: Easy to implement and understand
Cons:
- False positives: May merge slides with different text
- Threshold tuning: Requires experimentation per video type
- Not semantic: Doesn't understand content meaning
Performance:
performance_metrics = {
'computation_time': 0.008, # 8ms per frame
'comparison_time': 0.000001, # 1μs per hash comparison
'memory_per_hash': 8, # 8 bytes
'total_overhead_180_frames': 180 * 0.008, # 1.44 seconds
}
Verdict: ✅ Excellent for initial filtering
Option 2: Structural Similarity Index (SSIM)
Approach: Compare pixel-level structure between images
from skimage.metrics import structural_similarity as ssim
import cv2
def calculate_ssim(img1_path, img2_path):
img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)
# Resize to same dimensions
img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))
score = ssim(img1, img2)
return score # 0.0 to 1.0 (1.0 = identical)
Pros:
- Accurate: Captures structural changes
- Quantitative: 0-1 similarity score
- Standard: Well-researched algorithm
- Interpretable: Easy to understand threshold
Cons:
- Slow: ~50-100ms per comparison on CPU
- Computational: O(n²) comparisons for n frames
- Sensitive: Small cursor movements trigger differences
- Memory intensive: Must load full images
Performance:
performance_metrics = {
'computation_time': 0.080, # 80ms per comparison
'comparisons_needed': 180 * 179 / 2, # 16,110 for 180 frames
'total_time': 16110 * 0.080 / 1000, # 1,288 seconds (21 minutes!)
'bottleneck': 'Too slow for production'
}
Verdict: ❌ Too slow for real-time processing
Option 3: Feature Matching (ORB Descriptors)
Approach: Extract keypoints, match between images
import cv2
def extract_features(image_path):
img = cv2.imread(image_path)
orb = cv2.ORB_create(nfeatures=500)
keypoints, descriptors = orb.detectAndCompute(img, None)
return descriptors
def match_features(desc1, desc2):
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(desc1, desc2)
return len(matches)
Pros:
- Robust: Invariant to rotation, scale, lighting
- Semantic: Captures actual content features
- Selective: Can ignore cursor, UI elements
- Fast: ~30ms per comparison
Cons:
- Complex: Requires tuning (keypoint count, match threshold)
- Variable: Slide quality affects feature detection
- Still O(n²): Needs all pairwise comparisons
Performance:
performance_metrics = {
'computation_time': 0.030, # 30ms per comparison
'comparisons_needed': 16110,
'total_time': 16110 * 0.030 / 1000, # 483 seconds (8 minutes)
'bottleneck': 'Better but still slow'
}
Verdict: ⚠️ Good quality but performance issues
Option 4: Hybrid Approach - pHash + SSIM Validation (Selected)
Approach: Use fast pHash for initial clustering, SSIM for validation
class HybridDeduplicator:
def __init__(self, phash_threshold=5, ssim_threshold=0.95):
self.phash_threshold = phash_threshold
self.ssim_threshold = ssim_threshold
self.frame_hashes = {}
self.unique_frames = []
def is_unique(self, frame: ExtractedFrame) -> bool:
# Step 1: Fast pHash comparison
current_hash = self.compute_phash(frame.file_path)
for unique_frame in self.unique_frames:
stored_hash = self.frame_hashes[unique_frame.frame_id]
hamming_distance = current_hash - stored_hash
# Quick rejection: clearly different
if hamming_distance > self.phash_threshold:
continue
# Borderline case: validate with SSIM
if 3 <= hamming_distance <= self.phash_threshold:
ssim_score = self.compute_ssim(
frame.file_path,
unique_frame.file_path
)
# SSIM confirms similarity
if ssim_score >= self.ssim_threshold:
return False # Not unique
else:
# Very similar hash, assume duplicate
return False
# Add to unique set
self.frame_hashes[frame.frame_id] = current_hash
self.unique_frames.append(frame)
return True
def compute_phash(self, image_path):
img = Image.open(image_path)
return imagehash.phash(img, hash_size=8)
def compute_ssim(self, img1_path, img2_path):
img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))
return ssim(img1, img2)
Pros:
- Fast: pHash eliminates 95% of comparisons
- Accurate: SSIM validation for edge cases
- Balanced: Best of both algorithms
- Tunable: Adjust thresholds per use case
- Scalable: O(n) with pHash, selective O(n²) for validation
Cons:
- Complexity: Two algorithms to maintain
- Threshold tuning: Requires experimentation
- Edge cases: May still miss subtle differences
Performance:
performance_metrics = {
'phash_computation': 180 * 0.008, # 1.44s for all frames
'phash_comparisons': 180 * 90, # Compare to 90 unique (average)
'phash_comparison_time': 16200 * 0.000001, # 0.016s
'ssim_validations': 20, # Only borderline cases (~10%)
'ssim_validation_time': 20 * 0.080, # 1.6s
'total_time': 1.44 + 0.016 + 1.6, # 3.06 seconds
'overhead_percent': 3.06 / 600, # 0.5% of 10-min processing time
}
Verdict: ✅ SELECTED - Optimal speed/accuracy tradeoff
Option 5: Deep Learning Embeddings (CLIP/ResNet)
Approach: Use pre-trained model to generate embeddings, compare cosine similarity
import torch
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def get_embedding(image_path):
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model.get_image_features(**inputs)
return embeddings.numpy()
def cosine_similarity(emb1, emb2):
return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
Pros:
- Semantic: Understands content meaning
- Robust: Handles variations excellently
- Modern: State-of-the-art approach
Cons:
- Slow: 100-200ms per embedding (GPU required)
- Infrastructure: Requires GPU deployment
- Overkill: More sophisticated than needed for slide detection
- Cost: GPU hosting adds operational complexity
Performance:
performance_metrics = {
'gpu_inference_time': 0.150, # 150ms per frame
'total_time_180_frames': 180 * 0.150, # 27 seconds
'infrastructure_cost': 'Requires GPU worker',
'bottleneck': 'Adds complexity'
}
Verdict: 🔮 Future consideration if accuracy issues arise
Decision
Hybrid pHash + SSIM Validation
Implementation Strategy
Phase 1: pHash-Only (MVP)
class SimplePHashDeduplicator:
"""Fast MVP implementation using only pHash"""
def __init__(self, threshold=5):
self.threshold = threshold
self.seen_hashes = {}
def deduplicate(self, frames: List[ExtractedFrame]) -> List[ExtractedFrame]:
unique = []
for frame in frames:
phash = imagehash.phash(Image.open(frame.file_path))
# Check if we've seen a similar hash
is_unique = True
for seen_id, seen_hash in self.seen_hashes.items():
if phash - seen_hash <= self.threshold:
is_unique = False
logger.debug(f"Frame {frame.frame_id} duplicate of {seen_id}")
break
if is_unique:
unique.append(frame)
self.seen_hashes[frame.frame_id] = phash
return unique
Deployment: Launch with pHash-only, monitor false positive rate
Phase 2: Add SSIM Validation (Production Hardening)
class ProductionDeduplicator:
"""Production implementation with SSIM validation"""
def __init__(self, phash_threshold=5, ssim_threshold=0.95):
self.phash_threshold = phash_threshold
self.ssim_threshold = ssim_threshold
self.unique_frames = []
self.frame_hashes = {}
# Metrics
self.phash_comparisons = 0
self.ssim_validations = 0
self.duplicates_found = 0
def deduplicate(self, frames: List[ExtractedFrame]) -> List[ExtractedFrame]:
unique = []
for frame in sorted(frames, key=lambda f: f.timestamp):
if self._is_unique(frame):
unique.append(frame)
logger.info(f"Deduplication: {len(frames)} → {len(unique)} frames")
logger.info(f"pHash comparisons: {self.phash_comparisons}")
logger.info(f"SSIM validations: {self.ssim_validations}")
return unique
def _is_unique(self, frame: ExtractedFrame) -> bool:
current_hash = imagehash.phash(Image.open(frame.file_path))
for unique_frame in self.unique_frames:
self.phash_comparisons += 1
stored_hash = self.frame_hashes[unique_frame.frame_id]
hamming_distance = current_hash - stored_hash
# Quick rejection: clearly different
if hamming_distance > self.phash_threshold:
continue
# Very similar hash: likely duplicate
if hamming_distance <= 2:
self.duplicates_found += 1
return False
# Borderline: validate with SSIM
self.ssim_validations += 1
ssim_score = self._compute_ssim(frame.file_path, unique_frame.file_path)
if ssim_score >= self.ssim_threshold:
self.duplicates_found += 1
return False
# Unique frame
self.frame_hashes[frame.frame_id] = current_hash
self.unique_frames.append(frame)
return True
def _compute_ssim(self, img1_path: str, img2_path: str) -> float:
img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)
# Resize to match dimensions
if img1.shape != img2.shape:
img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))
return ssim(img1, img2)
Threshold Configuration
DEDUPLICATION_CONFIG = {
'presentation_videos': {
'phash_threshold': 4, # Stricter for slides
'ssim_threshold': 0.97,
'rationale': 'Slides are static, high similarity expected'
},
'demonstration_videos': {
'phash_threshold': 6, # More lenient
'ssim_threshold': 0.92,
'rationale': 'More motion, accept more variation'
},
'interview_videos': {
'phash_threshold': 7,
'ssim_threshold': 0.90,
'rationale': 'Continuous motion, camera angles'
}
}
Cost Impact
cost_comparison = {
'baseline_no_dedup': {
'frames_extracted': 180,
'vision_api_calls': 180,
'cost': 180 * 0.004, # $0.72
},
'temporal_dedup_only': {
'frames_extracted': 180,
'frames_after_dedup': 150,
'vision_api_calls': 150,
'cost': 150 * 0.004, # $0.60
'savings': 0.12 # 17% reduction
},
'content_based_dedup': {
'frames_extracted': 180,
'frames_after_temporal': 150,
'frames_after_content': 85, # 43% additional reduction
'vision_api_calls': 85,
'cost': 85 * 0.004, # $0.34
'savings_vs_baseline': 0.38, # 53% reduction
'savings_vs_temporal': 0.26 # 43% additional
}
}
# Annual impact at scale
annual_impact = {
'videos_per_year': 12000,
'cost_without_content_dedup': 12000 * 0.60, # $7,200
'cost_with_content_dedup': 12000 * 0.34, # $4,080
'annual_savings': 3120, # $3,120
'roi_on_development': 3120 / 10000 # 31% first year (assuming $10K dev cost)
}
Consequences
Positive
- Cost Reduction: 43% reduction in vision API costs
- Speed Improvement: Skip 43% of frames = faster processing
- Quality Maintained: No loss in content coverage
- Simple Algorithm: pHash is battle-tested, reliable
- Low Overhead: <1% processing time overhead
Negative
- Threshold Tuning: Requires experimentation per video type
- Edge Cases: May merge slightly different slides
- Maintenance: Two algorithms to maintain in production
- Monitoring: Need to track false positive/negative rates
Mitigation Strategies
class AdaptiveDeduplicator:
"""Self-tuning deduplicator with quality monitoring"""
def __init__(self):
self.thresholds = DEDUPLICATION_CONFIG['presentation_videos']
self.quality_metrics = {
'false_positives': 0, # Merged different frames
'false_negatives': 0, # Kept duplicate frames
'total_processed': 0
}
async def deduplicate_with_monitoring(self, frames: List[ExtractedFrame]):
unique_frames = self.deduplicate(frames)
# Sample 5% for quality check
if random.random() < 0.05:
await self._quality_check(unique_frames)
return unique_frames
async def _quality_check(self, frames: List[ExtractedFrame]):
"""Send sample to vision API to check for false positives"""
if len(frames) < 2:
return
# Pick adjacent frames
sample_pairs = random.sample(
list(zip(frames[:-1], frames[1:])),
k=min(3, len(frames) - 1)
)
for frame1, frame2 in sample_pairs:
# Ask vision API if these are truly different
are_different = await self._vision_api_check(frame1, frame2)
if not are_different:
self.quality_metrics['false_negatives'] += 1
logger.warning(f"False negative: {frame1.frame_id} and {frame2.frame_id} are similar")
# Auto-adjust threshold
if self.quality_metrics['false_negatives'] > 10:
self.thresholds['phash_threshold'] -= 1
logger.info(f"Adjusted pHash threshold to {self.thresholds['phash_threshold']}")
Validation Metrics
metrics_to_track = {
'deduplication_performance': {
'dedup_ratio': 'frames_after / frames_before',
'target': 0.40-0.60, # 40-60% reduction
'processing_overhead': 'dedup_time / total_time',
'target_overhead': '<1%'
},
'quality_metrics': {
'false_positive_rate': '<2%', # Merged different frames
'false_negative_rate': '<5%', # Kept duplicates
'content_coverage': '>95%', # Captured all unique content
},
'cost_metrics': {
'vision_api_cost_reduction': '>40%',
'roi_on_dedup_feature': '>300% first year'
}
}
Alternative Approaches
Video Hash (Future Enhancement)
# Instead of per-frame, hash video segments
from videohash import VideoHash
def deduplicate_video_segments(video_path):
"""
Use video hashing for longer-range similarity detection
Could catch repeated intro/outro sequences across videos
"""
videohash = VideoHash(path=video_path)
return videohash.hash
Online Learning (Future Enhancement)
class LearningDeduplicator:
"""
Learn optimal thresholds from user feedback
Track which frames users find valuable
"""
def __init__(self):
self.user_feedback = []
def record_feedback(self, frame_id: str, useful: bool):
self.user_feedback.append({'frame_id': frame_id, 'useful': useful})
# Retrain threshold every 100 feedbacks
if len(self.user_feedback) % 100 == 0:
self._retrain_thresholds()
Review Schedule
- Week 1: Deploy pHash-only MVP, monitor dedup ratios
- Week 2: Add SSIM validation for borderline cases
- Month 1: Collect quality metrics, adjust thresholds
- Month 3: Evaluate deep learning approach if quality issues persist
- Month 6: Full review, consider video-level deduplication
References
- Perceptual Hashing
- SSIM Paper
- ImageHash Library
- Content-Based Image Retrieval
- Internal: Frame Deduplication Experiments Notebook