Skip to main content

ADR-005: Image Content Change Detection for Unique Frame Extraction

Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, ML Engineer
Related: ADR-002 (Frame Sampling), TDD Section 3.3


Context

Current frame extraction produces 150-200 frames per video, but many are visually identical or near-identical. The problem:

  • Educational videos: Same slide shown for 30-60 seconds = 900-1800 identical frames at 30fps
  • Presentation content: Slides with minor cursor movements = hundreds of nearly-identical frames
  • Waste: Sending identical frames to vision API wastes money and processing time
  • Current deduplication: Only temporal (0.5s window) - doesn't detect content similarity

Cost Impact

without_content_dedup = {
'frames_extracted': 180, # Multi-strategy extraction
'vision_api_cost': 180 * 0.004, # $0.72
'redundant_frames': 90, # 50% are visually similar
'wasted_cost': 90 * 0.004 # $0.36 (50% waste)
}

with_content_dedup = {
'frames_extracted': 180,
'unique_frames': 90, # After content-based deduplication
'vision_api_cost': 90 * 0.004, # $0.36
'cost_savings': 0.36, # 50% reduction
'annual_savings_at_1000_videos': 0.36 * 1000 * 12 # $4,320
}

Decision Drivers

  1. Cost Optimization: Reduce vision API costs by 40-60%
  2. Processing Speed: Skip redundant frame analysis
  3. Quality: Ensure we don't miss unique content
  4. Performance: Fast enough to not become a bottleneck
  5. Accuracy: High precision (don't merge different frames) and recall (catch duplicates)

Options Considered

Option 1: Perceptual Hashing (pHash)

Approach: Generate hash from image frequency domain, compare hashes

import imagehash
from PIL import Image

def get_phash(image_path):
img = Image.open(image_path)
return imagehash.phash(img, hash_size=8)

def are_similar(hash1, hash2, threshold=5):
return hash1 - hash2 <= threshold # Hamming distance

Pros:

  • Very fast: <10ms per image on CPU
  • Robust: Handles minor variations (cursor, compression artifacts)
  • Low memory: 64-bit hash per image
  • Proven: Used in image deduplication for decades
  • Simple: Easy to implement and understand

Cons:

  • False positives: May merge slides with different text
  • Threshold tuning: Requires experimentation per video type
  • Not semantic: Doesn't understand content meaning

Performance:

performance_metrics = {
'computation_time': 0.008, # 8ms per frame
'comparison_time': 0.000001, # 1μs per hash comparison
'memory_per_hash': 8, # 8 bytes
'total_overhead_180_frames': 180 * 0.008, # 1.44 seconds
}

Verdict: ✅ Excellent for initial filtering


Option 2: Structural Similarity Index (SSIM)

Approach: Compare pixel-level structure between images

from skimage.metrics import structural_similarity as ssim
import cv2

def calculate_ssim(img1_path, img2_path):
img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)

# Resize to same dimensions
img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))

score = ssim(img1, img2)
return score # 0.0 to 1.0 (1.0 = identical)

Pros:

  • Accurate: Captures structural changes
  • Quantitative: 0-1 similarity score
  • Standard: Well-researched algorithm
  • Interpretable: Easy to understand threshold

Cons:

  • Slow: ~50-100ms per comparison on CPU
  • Computational: O(n²) comparisons for n frames
  • Sensitive: Small cursor movements trigger differences
  • Memory intensive: Must load full images

Performance:

performance_metrics = {
'computation_time': 0.080, # 80ms per comparison
'comparisons_needed': 180 * 179 / 2, # 16,110 for 180 frames
'total_time': 16110 * 0.080 / 1000, # 1,288 seconds (21 minutes!)
'bottleneck': 'Too slow for production'
}

Verdict: ❌ Too slow for real-time processing


Option 3: Feature Matching (ORB Descriptors)

Approach: Extract keypoints, match between images

import cv2

def extract_features(image_path):
img = cv2.imread(image_path)
orb = cv2.ORB_create(nfeatures=500)
keypoints, descriptors = orb.detectAndCompute(img, None)
return descriptors

def match_features(desc1, desc2):
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(desc1, desc2)
return len(matches)

Pros:

  • Robust: Invariant to rotation, scale, lighting
  • Semantic: Captures actual content features
  • Selective: Can ignore cursor, UI elements
  • Fast: ~30ms per comparison

Cons:

  • Complex: Requires tuning (keypoint count, match threshold)
  • Variable: Slide quality affects feature detection
  • Still O(n²): Needs all pairwise comparisons

Performance:

performance_metrics = {
'computation_time': 0.030, # 30ms per comparison
'comparisons_needed': 16110,
'total_time': 16110 * 0.030 / 1000, # 483 seconds (8 minutes)
'bottleneck': 'Better but still slow'
}

Verdict: ⚠️ Good quality but performance issues


Option 4: Hybrid Approach - pHash + SSIM Validation (Selected)

Approach: Use fast pHash for initial clustering, SSIM for validation

class HybridDeduplicator:
def __init__(self, phash_threshold=5, ssim_threshold=0.95):
self.phash_threshold = phash_threshold
self.ssim_threshold = ssim_threshold
self.frame_hashes = {}
self.unique_frames = []

def is_unique(self, frame: ExtractedFrame) -> bool:
# Step 1: Fast pHash comparison
current_hash = self.compute_phash(frame.file_path)

for unique_frame in self.unique_frames:
stored_hash = self.frame_hashes[unique_frame.frame_id]
hamming_distance = current_hash - stored_hash

# Quick rejection: clearly different
if hamming_distance > self.phash_threshold:
continue

# Borderline case: validate with SSIM
if 3 <= hamming_distance <= self.phash_threshold:
ssim_score = self.compute_ssim(
frame.file_path,
unique_frame.file_path
)

# SSIM confirms similarity
if ssim_score >= self.ssim_threshold:
return False # Not unique
else:
# Very similar hash, assume duplicate
return False

# Add to unique set
self.frame_hashes[frame.frame_id] = current_hash
self.unique_frames.append(frame)
return True

def compute_phash(self, image_path):
img = Image.open(image_path)
return imagehash.phash(img, hash_size=8)

def compute_ssim(self, img1_path, img2_path):
img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))
return ssim(img1, img2)

Pros:

  • Fast: pHash eliminates 95% of comparisons
  • Accurate: SSIM validation for edge cases
  • Balanced: Best of both algorithms
  • Tunable: Adjust thresholds per use case
  • Scalable: O(n) with pHash, selective O(n²) for validation

Cons:

  • Complexity: Two algorithms to maintain
  • Threshold tuning: Requires experimentation
  • Edge cases: May still miss subtle differences

Performance:

performance_metrics = {
'phash_computation': 180 * 0.008, # 1.44s for all frames
'phash_comparisons': 180 * 90, # Compare to 90 unique (average)
'phash_comparison_time': 16200 * 0.000001, # 0.016s
'ssim_validations': 20, # Only borderline cases (~10%)
'ssim_validation_time': 20 * 0.080, # 1.6s
'total_time': 1.44 + 0.016 + 1.6, # 3.06 seconds
'overhead_percent': 3.06 / 600, # 0.5% of 10-min processing time
}

Verdict: ✅ SELECTED - Optimal speed/accuracy tradeoff


Option 5: Deep Learning Embeddings (CLIP/ResNet)

Approach: Use pre-trained model to generate embeddings, compare cosine similarity

import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def get_embedding(image_path):
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model.get_image_features(**inputs)
return embeddings.numpy()

def cosine_similarity(emb1, emb2):
return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

Pros:

  • Semantic: Understands content meaning
  • Robust: Handles variations excellently
  • Modern: State-of-the-art approach

Cons:

  • Slow: 100-200ms per embedding (GPU required)
  • Infrastructure: Requires GPU deployment
  • Overkill: More sophisticated than needed for slide detection
  • Cost: GPU hosting adds operational complexity

Performance:

performance_metrics = {
'gpu_inference_time': 0.150, # 150ms per frame
'total_time_180_frames': 180 * 0.150, # 27 seconds
'infrastructure_cost': 'Requires GPU worker',
'bottleneck': 'Adds complexity'
}

Verdict: 🔮 Future consideration if accuracy issues arise

Decision

Hybrid pHash + SSIM Validation

Implementation Strategy

Phase 1: pHash-Only (MVP)

class SimplePHashDeduplicator:
"""Fast MVP implementation using only pHash"""

def __init__(self, threshold=5):
self.threshold = threshold
self.seen_hashes = {}

def deduplicate(self, frames: List[ExtractedFrame]) -> List[ExtractedFrame]:
unique = []

for frame in frames:
phash = imagehash.phash(Image.open(frame.file_path))

# Check if we've seen a similar hash
is_unique = True
for seen_id, seen_hash in self.seen_hashes.items():
if phash - seen_hash <= self.threshold:
is_unique = False
logger.debug(f"Frame {frame.frame_id} duplicate of {seen_id}")
break

if is_unique:
unique.append(frame)
self.seen_hashes[frame.frame_id] = phash

return unique

Deployment: Launch with pHash-only, monitor false positive rate

Phase 2: Add SSIM Validation (Production Hardening)

class ProductionDeduplicator:
"""Production implementation with SSIM validation"""

def __init__(self, phash_threshold=5, ssim_threshold=0.95):
self.phash_threshold = phash_threshold
self.ssim_threshold = ssim_threshold
self.unique_frames = []
self.frame_hashes = {}

# Metrics
self.phash_comparisons = 0
self.ssim_validations = 0
self.duplicates_found = 0

def deduplicate(self, frames: List[ExtractedFrame]) -> List[ExtractedFrame]:
unique = []

for frame in sorted(frames, key=lambda f: f.timestamp):
if self._is_unique(frame):
unique.append(frame)

logger.info(f"Deduplication: {len(frames)}{len(unique)} frames")
logger.info(f"pHash comparisons: {self.phash_comparisons}")
logger.info(f"SSIM validations: {self.ssim_validations}")

return unique

def _is_unique(self, frame: ExtractedFrame) -> bool:
current_hash = imagehash.phash(Image.open(frame.file_path))

for unique_frame in self.unique_frames:
self.phash_comparisons += 1
stored_hash = self.frame_hashes[unique_frame.frame_id]
hamming_distance = current_hash - stored_hash

# Quick rejection: clearly different
if hamming_distance > self.phash_threshold:
continue

# Very similar hash: likely duplicate
if hamming_distance <= 2:
self.duplicates_found += 1
return False

# Borderline: validate with SSIM
self.ssim_validations += 1
ssim_score = self._compute_ssim(frame.file_path, unique_frame.file_path)

if ssim_score >= self.ssim_threshold:
self.duplicates_found += 1
return False

# Unique frame
self.frame_hashes[frame.frame_id] = current_hash
self.unique_frames.append(frame)
return True

def _compute_ssim(self, img1_path: str, img2_path: str) -> float:
img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)

# Resize to match dimensions
if img1.shape != img2.shape:
img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))

return ssim(img1, img2)

Threshold Configuration

DEDUPLICATION_CONFIG = {
'presentation_videos': {
'phash_threshold': 4, # Stricter for slides
'ssim_threshold': 0.97,
'rationale': 'Slides are static, high similarity expected'
},
'demonstration_videos': {
'phash_threshold': 6, # More lenient
'ssim_threshold': 0.92,
'rationale': 'More motion, accept more variation'
},
'interview_videos': {
'phash_threshold': 7,
'ssim_threshold': 0.90,
'rationale': 'Continuous motion, camera angles'
}
}

Cost Impact

cost_comparison = {
'baseline_no_dedup': {
'frames_extracted': 180,
'vision_api_calls': 180,
'cost': 180 * 0.004, # $0.72
},
'temporal_dedup_only': {
'frames_extracted': 180,
'frames_after_dedup': 150,
'vision_api_calls': 150,
'cost': 150 * 0.004, # $0.60
'savings': 0.12 # 17% reduction
},
'content_based_dedup': {
'frames_extracted': 180,
'frames_after_temporal': 150,
'frames_after_content': 85, # 43% additional reduction
'vision_api_calls': 85,
'cost': 85 * 0.004, # $0.34
'savings_vs_baseline': 0.38, # 53% reduction
'savings_vs_temporal': 0.26 # 43% additional
}
}

# Annual impact at scale
annual_impact = {
'videos_per_year': 12000,
'cost_without_content_dedup': 12000 * 0.60, # $7,200
'cost_with_content_dedup': 12000 * 0.34, # $4,080
'annual_savings': 3120, # $3,120
'roi_on_development': 3120 / 10000 # 31% first year (assuming $10K dev cost)
}

Consequences

Positive

  1. Cost Reduction: 43% reduction in vision API costs
  2. Speed Improvement: Skip 43% of frames = faster processing
  3. Quality Maintained: No loss in content coverage
  4. Simple Algorithm: pHash is battle-tested, reliable
  5. Low Overhead: <1% processing time overhead

Negative

  1. Threshold Tuning: Requires experimentation per video type
  2. Edge Cases: May merge slightly different slides
  3. Maintenance: Two algorithms to maintain in production
  4. Monitoring: Need to track false positive/negative rates

Mitigation Strategies

class AdaptiveDeduplicator:
"""Self-tuning deduplicator with quality monitoring"""

def __init__(self):
self.thresholds = DEDUPLICATION_CONFIG['presentation_videos']
self.quality_metrics = {
'false_positives': 0, # Merged different frames
'false_negatives': 0, # Kept duplicate frames
'total_processed': 0
}

async def deduplicate_with_monitoring(self, frames: List[ExtractedFrame]):
unique_frames = self.deduplicate(frames)

# Sample 5% for quality check
if random.random() < 0.05:
await self._quality_check(unique_frames)

return unique_frames

async def _quality_check(self, frames: List[ExtractedFrame]):
"""Send sample to vision API to check for false positives"""
if len(frames) < 2:
return

# Pick adjacent frames
sample_pairs = random.sample(
list(zip(frames[:-1], frames[1:])),
k=min(3, len(frames) - 1)
)

for frame1, frame2 in sample_pairs:
# Ask vision API if these are truly different
are_different = await self._vision_api_check(frame1, frame2)

if not are_different:
self.quality_metrics['false_negatives'] += 1
logger.warning(f"False negative: {frame1.frame_id} and {frame2.frame_id} are similar")

# Auto-adjust threshold
if self.quality_metrics['false_negatives'] > 10:
self.thresholds['phash_threshold'] -= 1
logger.info(f"Adjusted pHash threshold to {self.thresholds['phash_threshold']}")

Validation Metrics

metrics_to_track = {
'deduplication_performance': {
'dedup_ratio': 'frames_after / frames_before',
'target': 0.40-0.60, # 40-60% reduction
'processing_overhead': 'dedup_time / total_time',
'target_overhead': '<1%'
},
'quality_metrics': {
'false_positive_rate': '<2%', # Merged different frames
'false_negative_rate': '<5%', # Kept duplicates
'content_coverage': '>95%', # Captured all unique content
},
'cost_metrics': {
'vision_api_cost_reduction': '>40%',
'roi_on_dedup_feature': '>300% first year'
}
}

Alternative Approaches

Video Hash (Future Enhancement)

# Instead of per-frame, hash video segments
from videohash import VideoHash

def deduplicate_video_segments(video_path):
"""
Use video hashing for longer-range similarity detection
Could catch repeated intro/outro sequences across videos
"""
videohash = VideoHash(path=video_path)
return videohash.hash

Online Learning (Future Enhancement)

class LearningDeduplicator:
"""
Learn optimal thresholds from user feedback
Track which frames users find valuable
"""
def __init__(self):
self.user_feedback = []

def record_feedback(self, frame_id: str, useful: bool):
self.user_feedback.append({'frame_id': frame_id, 'useful': useful})

# Retrain threshold every 100 feedbacks
if len(self.user_feedback) % 100 == 0:
self._retrain_thresholds()

Review Schedule

  • Week 1: Deploy pHash-only MVP, monitor dedup ratios
  • Week 2: Add SSIM validation for borderline cases
  • Month 1: Collect quality metrics, adjust thresholds
  • Month 3: Evaluate deep learning approach if quality issues persist
  • Month 6: Full review, consider video-level deduplication

References