ADR-005: Image Content Change Detection for Unique Frame Extraction

Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, ML Engineer
Related: ADR-002 (Frame Sampling), TDD Section 3.3

Context

Current frame extraction produces 150-200 frames per video, but many are visually identical or near-identical. The problem:

Educational videos: Same slide shown for 30-60 seconds = 900-1800 identical frames at 30fps
Presentation content: Slides with minor cursor movements = hundreds of nearly-identical frames
Waste: Sending identical frames to vision API wastes money and processing time
Current deduplication: Only temporal (0.5s window) - doesn't detect content similarity

Cost Impact

without_content_dedup = {
    'frames_extracted': 180,  # Multi-strategy extraction
    'vision_api_cost': 180 * 0.004,  # $0.72
    'redundant_frames': 90,  # 50% are visually similar
    'wasted_cost': 90 * 0.004  # $0.36 (50% waste)
}

with_content_dedup = {
    'frames_extracted': 180,
    'unique_frames': 90,  # After content-based deduplication
    'vision_api_cost': 90 * 0.004,  # $0.36
    'cost_savings': 0.36,  # 50% reduction
    'annual_savings_at_1000_videos': 0.36 * 1000 * 12  # $4,320
}

Decision Drivers

Cost Optimization: Reduce vision API costs by 40-60%
Processing Speed: Skip redundant frame analysis
Quality: Ensure we don't miss unique content
Performance: Fast enough to not become a bottleneck
Accuracy: High precision (don't merge different frames) and recall (catch duplicates)

Options Considered

Option 1: Perceptual Hashing (pHash)

Approach: Generate hash from image frequency domain, compare hashes

import imagehash
from PIL import Image

def get_phash(image_path):
    img = Image.open(image_path)
    return imagehash.phash(img, hash_size=8)

def are_similar(hash1, hash2, threshold=5):
    return hash1 - hash2 <= threshold  # Hamming distance

Pros:

Very fast: <10ms per image on CPU
Robust: Handles minor variations (cursor, compression artifacts)
Low memory: 64-bit hash per image
Proven: Used in image deduplication for decades
Simple: Easy to implement and understand

Cons:

False positives: May merge slides with different text
Threshold tuning: Requires experimentation per video type
Not semantic: Doesn't understand content meaning

Performance:

performance_metrics = {
    'computation_time': 0.008,  # 8ms per frame
    'comparison_time': 0.000001,  # 1μs per hash comparison
    'memory_per_hash': 8,  # 8 bytes
    'total_overhead_180_frames': 180 * 0.008,  # 1.44 seconds
}

Verdict: ✅ Excellent for initial filtering

Option 2: Structural Similarity Index (SSIM)

Approach: Compare pixel-level structure between images

from skimage.metrics import structural_similarity as ssim
import cv2

def calculate_ssim(img1_path, img2_path):
    img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
    img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)
    
    # Resize to same dimensions
    img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))
    
    score = ssim(img1, img2)
    return score  # 0.0 to 1.0 (1.0 = identical)

Pros:

Accurate: Captures structural changes
Quantitative: 0-1 similarity score
Standard: Well-researched algorithm
Interpretable: Easy to understand threshold

Cons:

Slow: ~50-100ms per comparison on CPU
Computational: O(n²) comparisons for n frames
Sensitive: Small cursor movements trigger differences
Memory intensive: Must load full images

Performance:

performance_metrics = {
    'computation_time': 0.080,  # 80ms per comparison
    'comparisons_needed': 180 * 179 / 2,  # 16,110 for 180 frames
    'total_time': 16110 * 0.080 / 1000,  # 1,288 seconds (21 minutes!)
    'bottleneck': 'Too slow for production'
}

Verdict: ❌ Too slow for real-time processing

Option 3: Feature Matching (ORB Descriptors)

Approach: Extract keypoints, match between images

import cv2

def extract_features(image_path):
    img = cv2.imread(image_path)
    orb = cv2.ORB_create(nfeatures=500)
    keypoints, descriptors = orb.detectAndCompute(img, None)
    return descriptors

def match_features(desc1, desc2):
    bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
    matches = bf.match(desc1, desc2)
    return len(matches)

Pros:

Robust: Invariant to rotation, scale, lighting
Semantic: Captures actual content features
Selective: Can ignore cursor, UI elements
Fast: ~30ms per comparison

Cons:

Complex: Requires tuning (keypoint count, match threshold)
Variable: Slide quality affects feature detection
Still O(n²): Needs all pairwise comparisons

Performance:

performance_metrics = {
    'computation_time': 0.030,  # 30ms per comparison
    'comparisons_needed': 16110,
    'total_time': 16110 * 0.030 / 1000,  # 483 seconds (8 minutes)
    'bottleneck': 'Better but still slow'
}

Verdict: ⚠️ Good quality but performance issues

Option 4: Hybrid Approach - pHash + SSIM Validation (Selected)

Approach: Use fast pHash for initial clustering, SSIM for validation

class HybridDeduplicator:
    def __init__(self, phash_threshold=5, ssim_threshold=0.95):
        self.phash_threshold = phash_threshold
        self.ssim_threshold = ssim_threshold
        self.frame_hashes = {}
        self.unique_frames = []
    
    def is_unique(self, frame: ExtractedFrame) -> bool:
        # Step 1: Fast pHash comparison
        current_hash = self.compute_phash(frame.file_path)
        
        for unique_frame in self.unique_frames:
            stored_hash = self.frame_hashes[unique_frame.frame_id]
            hamming_distance = current_hash - stored_hash
            
            # Quick rejection: clearly different
            if hamming_distance > self.phash_threshold:
                continue
            
            # Borderline case: validate with SSIM
            if 3 <= hamming_distance <= self.phash_threshold:
                ssim_score = self.compute_ssim(
                    frame.file_path,
                    unique_frame.file_path
                )
                
                # SSIM confirms similarity
                if ssim_score >= self.ssim_threshold:
                    return False  # Not unique
            else:
                # Very similar hash, assume duplicate
                return False
        
        # Add to unique set
        self.frame_hashes[frame.frame_id] = current_hash
        self.unique_frames.append(frame)
        return True
    
    def compute_phash(self, image_path):
        img = Image.open(image_path)
        return imagehash.phash(img, hash_size=8)
    
    def compute_ssim(self, img1_path, img2_path):
        img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
        img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)
        img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))
        return ssim(img1, img2)

Pros:

Fast: pHash eliminates 95% of comparisons
Accurate: SSIM validation for edge cases
Balanced: Best of both algorithms
Tunable: Adjust thresholds per use case
Scalable: O(n) with pHash, selective O(n²) for validation

Cons:

Complexity: Two algorithms to maintain
Threshold tuning: Requires experimentation
Edge cases: May still miss subtle differences

Performance:

performance_metrics = {
    'phash_computation': 180 * 0.008,  # 1.44s for all frames
    'phash_comparisons': 180 * 90,  # Compare to 90 unique (average)
    'phash_comparison_time': 16200 * 0.000001,  # 0.016s
    'ssim_validations': 20,  # Only borderline cases (~10%)
    'ssim_validation_time': 20 * 0.080,  # 1.6s
    'total_time': 1.44 + 0.016 + 1.6,  # 3.06 seconds
    'overhead_percent': 3.06 / 600,  # 0.5% of 10-min processing time
}

Verdict: ✅ SELECTED - Optimal speed/accuracy tradeoff

Option 5: Deep Learning Embeddings (CLIP/ResNet)

Approach: Use pre-trained model to generate embeddings, compare cosine similarity

import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def get_embedding(image_path):
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embeddings = model.get_image_features(**inputs)
    return embeddings.numpy()

def cosine_similarity(emb1, emb2):
    return np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

Pros:

Semantic: Understands content meaning
Robust: Handles variations excellently
Modern: State-of-the-art approach

Cons:

Slow: 100-200ms per embedding (GPU required)
Infrastructure: Requires GPU deployment
Overkill: More sophisticated than needed for slide detection
Cost: GPU hosting adds operational complexity

Performance:

performance_metrics = {
    'gpu_inference_time': 0.150,  # 150ms per frame
    'total_time_180_frames': 180 * 0.150,  # 27 seconds
    'infrastructure_cost': 'Requires GPU worker',
    'bottleneck': 'Adds complexity'
}

Verdict: 🔮 Future consideration if accuracy issues arise

Decision

Hybrid pHash + SSIM Validation

Implementation Strategy

Phase 1: pHash-Only (MVP)

class SimplePHashDeduplicator:
    """Fast MVP implementation using only pHash"""
    
    def __init__(self, threshold=5):
        self.threshold = threshold
        self.seen_hashes = {}
    
    def deduplicate(self, frames: List[ExtractedFrame]) -> List[ExtractedFrame]:
        unique = []
        
        for frame in frames:
            phash = imagehash.phash(Image.open(frame.file_path))
            
            # Check if we've seen a similar hash
            is_unique = True
            for seen_id, seen_hash in self.seen_hashes.items():
                if phash - seen_hash <= self.threshold:
                    is_unique = False
                    logger.debug(f"Frame {frame.frame_id} duplicate of {seen_id}")
                    break
            
            if is_unique:
                unique.append(frame)
                self.seen_hashes[frame.frame_id] = phash
        
        return unique

Deployment: Launch with pHash-only, monitor false positive rate

Phase 2: Add SSIM Validation (Production Hardening)

class ProductionDeduplicator:
    """Production implementation with SSIM validation"""
    
    def __init__(self, phash_threshold=5, ssim_threshold=0.95):
        self.phash_threshold = phash_threshold
        self.ssim_threshold = ssim_threshold
        self.unique_frames = []
        self.frame_hashes = {}
        
        # Metrics
        self.phash_comparisons = 0
        self.ssim_validations = 0
        self.duplicates_found = 0
    
    def deduplicate(self, frames: List[ExtractedFrame]) -> List[ExtractedFrame]:
        unique = []
        
        for frame in sorted(frames, key=lambda f: f.timestamp):
            if self._is_unique(frame):
                unique.append(frame)
        
        logger.info(f"Deduplication: {len(frames)} → {len(unique)} frames")
        logger.info(f"pHash comparisons: {self.phash_comparisons}")
        logger.info(f"SSIM validations: {self.ssim_validations}")
        
        return unique
    
    def _is_unique(self, frame: ExtractedFrame) -> bool:
        current_hash = imagehash.phash(Image.open(frame.file_path))
        
        for unique_frame in self.unique_frames:
            self.phash_comparisons += 1
            stored_hash = self.frame_hashes[unique_frame.frame_id]
            hamming_distance = current_hash - stored_hash
            
            # Quick rejection: clearly different
            if hamming_distance > self.phash_threshold:
                continue
            
            # Very similar hash: likely duplicate
            if hamming_distance <= 2:
                self.duplicates_found += 1
                return False
            
            # Borderline: validate with SSIM
            self.ssim_validations += 1
            ssim_score = self._compute_ssim(frame.file_path, unique_frame.file_path)
            
            if ssim_score >= self.ssim_threshold:
                self.duplicates_found += 1
                return False
        
        # Unique frame
        self.frame_hashes[frame.frame_id] = current_hash
        self.unique_frames.append(frame)
        return True
    
    def _compute_ssim(self, img1_path: str, img2_path: str) -> float:
        img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
        img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)
        
        # Resize to match dimensions
        if img1.shape != img2.shape:
            img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))
        
        return ssim(img1, img2)

Threshold Configuration

DEDUPLICATION_CONFIG = {
    'presentation_videos': {
        'phash_threshold': 4,  # Stricter for slides
        'ssim_threshold': 0.97,
        'rationale': 'Slides are static, high similarity expected'
    },
    'demonstration_videos': {
        'phash_threshold': 6,  # More lenient
        'ssim_threshold': 0.92,
        'rationale': 'More motion, accept more variation'
    },
    'interview_videos': {
        'phash_threshold': 7,
        'ssim_threshold': 0.90,
        'rationale': 'Continuous motion, camera angles'
    }
}

Cost Impact

cost_comparison = {
    'baseline_no_dedup': {
        'frames_extracted': 180,
        'vision_api_calls': 180,
        'cost': 180 * 0.004,  # $0.72
    },
    'temporal_dedup_only': {
        'frames_extracted': 180,
        'frames_after_dedup': 150,
        'vision_api_calls': 150,
        'cost': 150 * 0.004,  # $0.60
        'savings': 0.12  # 17% reduction
    },
    'content_based_dedup': {
        'frames_extracted': 180,
        'frames_after_temporal': 150,
        'frames_after_content': 85,  # 43% additional reduction
        'vision_api_calls': 85,
        'cost': 85 * 0.004,  # $0.34
        'savings_vs_baseline': 0.38,  # 53% reduction
        'savings_vs_temporal': 0.26  # 43% additional
    }
}

# Annual impact at scale
annual_impact = {
    'videos_per_year': 12000,
    'cost_without_content_dedup': 12000 * 0.60,  # $7,200
    'cost_with_content_dedup': 12000 * 0.34,  # $4,080
    'annual_savings': 3120,  # $3,120
    'roi_on_development': 3120 / 10000  # 31% first year (assuming $10K dev cost)
}

Consequences

Positive

Cost Reduction: 43% reduction in vision API costs
Speed Improvement: Skip 43% of frames = faster processing
Quality Maintained: No loss in content coverage
Simple Algorithm: pHash is battle-tested, reliable
Low Overhead: <1% processing time overhead

Negative

Threshold Tuning: Requires experimentation per video type
Edge Cases: May merge slightly different slides
Maintenance: Two algorithms to maintain in production
Monitoring: Need to track false positive/negative rates

Mitigation Strategies

class AdaptiveDeduplicator:
    """Self-tuning deduplicator with quality monitoring"""
    
    def __init__(self):
        self.thresholds = DEDUPLICATION_CONFIG['presentation_videos']
        self.quality_metrics = {
            'false_positives': 0,  # Merged different frames
            'false_negatives': 0,  # Kept duplicate frames
            'total_processed': 0
        }
    
    async def deduplicate_with_monitoring(self, frames: List[ExtractedFrame]):
        unique_frames = self.deduplicate(frames)
        
        # Sample 5% for quality check
        if random.random() < 0.05:
            await self._quality_check(unique_frames)
        
        return unique_frames
    
    async def _quality_check(self, frames: List[ExtractedFrame]):
        """Send sample to vision API to check for false positives"""
        if len(frames) < 2:
            return
        
        # Pick adjacent frames
        sample_pairs = random.sample(
            list(zip(frames[:-1], frames[1:])),
            k=min(3, len(frames) - 1)
        )
        
        for frame1, frame2 in sample_pairs:
            # Ask vision API if these are truly different
            are_different = await self._vision_api_check(frame1, frame2)
            
            if not are_different:
                self.quality_metrics['false_negatives'] += 1
                logger.warning(f"False negative: {frame1.frame_id} and {frame2.frame_id} are similar")
                
                # Auto-adjust threshold
                if self.quality_metrics['false_negatives'] > 10:
                    self.thresholds['phash_threshold'] -= 1
                    logger.info(f"Adjusted pHash threshold to {self.thresholds['phash_threshold']}")

Validation Metrics

metrics_to_track = {
    'deduplication_performance': {
        'dedup_ratio': 'frames_after / frames_before',
        'target': 0.40-0.60,  # 40-60% reduction
        'processing_overhead': 'dedup_time / total_time',
        'target_overhead': '<1%'
    },
    'quality_metrics': {
        'false_positive_rate': '<2%',  # Merged different frames
        'false_negative_rate': '<5%',  # Kept duplicates
        'content_coverage': '>95%',  # Captured all unique content
    },
    'cost_metrics': {
        'vision_api_cost_reduction': '>40%',
        'roi_on_dedup_feature': '>300% first year'
    }
}

Alternative Approaches

Video Hash (Future Enhancement)

# Instead of per-frame, hash video segments
from videohash import VideoHash

def deduplicate_video_segments(video_path):
    """
    Use video hashing for longer-range similarity detection
    Could catch repeated intro/outro sequences across videos
    """
    videohash = VideoHash(path=video_path)
    return videohash.hash

Online Learning (Future Enhancement)

class LearningDeduplicator:
    """
    Learn optimal thresholds from user feedback
    Track which frames users find valuable
    """
    def __init__(self):
        self.user_feedback = []
    
    def record_feedback(self, frame_id: str, useful: bool):
        self.user_feedback.append({'frame_id': frame_id, 'useful': useful})
        
        # Retrain threshold every 100 feedbacks
        if len(self.user_feedback) % 100 == 0:
            self._retrain_thresholds()

Review Schedule

Week 1: Deploy pHash-only MVP, monitor dedup ratios
Week 2: Add SSIM validation for borderline cases
Month 1: Collect quality metrics, adjust thresholds
Month 3: Evaluate deep learning approach if quality issues persist
Month 6: Full review, consider video-level deduplication

References

Perceptual Hashing
SSIM Paper
ImageHash Library
Content-Based Image Retrieval
Internal: Frame Deduplication Experiments Notebook

Context​

Cost Impact​

Decision Drivers​

Options Considered​

Option 1: Perceptual Hashing (pHash)​

Option 2: Structural Similarity Index (SSIM)​

Option 3: Feature Matching (ORB Descriptors)​

Option 4: Hybrid Approach - pHash + SSIM Validation (Selected)​

Option 5: Deep Learning Embeddings (CLIP/ResNet)​

Decision​

Implementation Strategy​

Phase 1: pHash-Only (MVP)​

Phase 2: Add SSIM Validation (Production Hardening)​

Threshold Configuration​

Cost Impact​

Consequences​

Positive​

Negative​

Mitigation Strategies​

Validation Metrics​

Alternative Approaches​

Video Hash (Future Enhancement)​

Online Learning (Future Enhancement)​

Review Schedule​

References​

Context

Cost Impact

Decision Drivers

Options Considered

Option 1: Perceptual Hashing (pHash)

Option 2: Structural Similarity Index (SSIM)

Option 3: Feature Matching (ORB Descriptors)

Option 4: Hybrid Approach - pHash + SSIM Validation (Selected)

Option 5: Deep Learning Embeddings (CLIP/ResNet)

Decision

Implementation Strategy

Phase 1: pHash-Only (MVP)

Phase 2: Add SSIM Validation (Production Hardening)

Threshold Configuration

Cost Impact

Consequences

Positive

Negative

Mitigation Strategies

Validation Metrics

Alternative Approaches

Video Hash (Future Enhancement)

Online Learning (Future Enhancement)

Review Schedule

References