ADR-002: Multi-Strategy Frame Sampling Approach

Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, ML Engineer
Related: ADR-001 (Vision Model Choice)

Context

Video analysis requires extracting representative frames for vision model processing. Key challenges:

Cost Constraint: Each frame costs $0.004-$0.008 (API fees)
Redundancy Problem: Adjacent frames in video are highly similar
Content Diversity: Educational videos have slides, diagrams, scenes, transitions
Completeness: Must capture all important visual information
Processing Time: More frames = longer processing

Problem Statement

Given a 60-minute educational video at 30fps (108,000 frames):

Naive sampling (1 fps) → 3,600 frames → $14.40 → unacceptable cost
Fixed interval (5s) → 720 frames → $2.88 → still too high
Need: Extract 100-200 frames that capture all significant content

Decision Drivers

Capture Slide Content: Educational videos have static slides for 5-30 seconds
Detect Transitions: Scene changes indicate new content
Minimize Redundancy: Adjacent frames with <5% difference are redundant
Maximize Information Density: Prioritize frames with text/diagrams
Cost Efficiency: Target 120-180 frames per hour of video

Options Considered

Option 1: Fixed-Interval Sampling

Approach: Extract every N seconds

frames_per_video = duration_seconds / interval_seconds
# 60-min video, 5s interval → 720 frames

Pros:

Simple to implement
Predictable frame count
Uniform temporal coverage

Cons:

High redundancy: Samples many similar frames during static slides
Misses transitions: May skip important scene changes
Wasteful: Processes many low-information frames
Cost: 720 frames × $0.004 = $2.88 per video

Verdict: ❌ Too expensive and redundant

Option 2: Scene Change Detection Only

Approach: Extract frames only at detected scene transitions

# FFmpeg select filter
scene_threshold = 0.4
# Typical educational video: 30-80 scene changes

Pros:

Low redundancy: Only captures transitions
Cost-efficient: 50-80 frames per video
Fast: FFmpeg implements efficiently
Good for dynamic content: Captures all scene changes

Cons:

Misses slide content: Static slides may have no scene changes
Under-samples presentations: Educational videos have long static periods
Incomplete: May miss important textual content

Verdict: ❌ Insufficient for educational content

Option 3: Content-Aware Sampling (Slide Detection)

Approach: Detect static content periods and sample once per stable period

# Algorithm:
# 1. Track frame-to-frame similarity
# 2. When similarity > 95% for N frames, extract one frame
# 3. Reset on significant change

Pros:

Excellent for slides: One frame per slide (5-30s static content)
Low redundancy: Only samples distinct content
Complete capture: Gets all slide content

Cons:

Misses transitions: May skip scene changes between slides
Computational cost: Requires frame comparison
Tuning required: Similarity threshold varies by content

Verdict: ✅ Good for presentations, but incomplete alone

Option 4: Text Density Sampling

Approach: Use edge detection to identify frames with high text density

# Algorithm:
# 1. Apply Canny edge detection
# 2. Calculate edge pixel ratio
# 3. Sample frames with ratio > threshold (5%)

Pros:

Targets informative content: Text indicates important information
Works for diagrams: Charts/diagrams have high edge density
Fast: Edge detection is computationally cheap

Cons:

False positives: Textured scenes trigger false detection
Misses visual-only content: Photographs, demonstrations
Overlap with slide detection: Similar frames captured

Verdict: ✅ Good complement to other strategies

Option 5: Multi-Strategy Hybrid (Selected)

Approach: Combine multiple strategies, then deduplicate

strategies = [
    'scene_change',      # Capture transitions
    'slide_detection',   # Capture static slides
    'fixed_interval',    # Ensure temporal coverage
    'text_density'       # Prioritize informative frames
]

# Extract frames using all strategies
all_frames = []
for strategy in strategies:
    frames = extract_with_strategy(strategy)
    all_frames.extend(frames)

# Deduplicate: Remove frames within 0.5s of each other
final_frames = deduplicate(all_frames, tolerance=0.5)

Pros:

Comprehensive: Captures all important content types
Redundancy control: Deduplication removes overlap
Flexible: Can adjust strategy weights per video type
Robust: Multiple strategies compensate for each other's weaknesses

Cons:

Complexity: More code to maintain
Computational cost: Multiple passes over video
Tuning: Requires parameter optimization

Verdict: ✅ SELECTED - Best balance of completeness and efficiency

Decision

Multi-Strategy Hybrid Sampling

Configuration

SAMPLING_CONFIG = {
    'strategies': [
        {
            'name': 'scene_change',
            'enabled': True,
            'threshold': 0.4,
            'weight': 1.0
        },
        {
            'name': 'fixed_interval',
            'enabled': True,
            'interval_seconds': 5.0,
            'weight': 0.5  # Lower priority
        },
        {
            'name': 'slide_detection',
            'enabled': True,
            'stability_duration': 2.0,  # 2s of stable content
            'weight': 1.5  # Higher priority
        },
        {
            'name': 'text_density',
            'enabled': True,
            'edge_threshold': 0.05,  # 5% edge pixels
            'sample_interval': 2.0,  # Check every 2s
            'weight': 1.2
        }
    ],
    'deduplication': {
        'enabled': True,
        'temporal_tolerance': 0.5,  # 0.5s window
        'similarity_threshold': 0.95  # 95% similar = duplicate
    },
    'limits': {
        'max_frames_per_video': 500,
        'min_frames_per_minute': 1,
        'max_frames_per_minute': 5
    }
}

Expected Frame Counts

# 60-minute educational video analysis
frame_extraction_estimates = {
    'scene_change': 40,      # ~40 scene transitions
    'fixed_interval_5s': 720,  # Every 5 seconds (before weight)
    'slide_detection': 60,   # ~60 distinct slides
    'text_density': 180,     # ~3 per minute
    
    'raw_total': 1000,       # Sum before dedup
    'after_dedup': 150,      # ~75% reduction
    'final_count': 150       # Target achieved
}

cost_analysis = {
    'frames': 150,
    'cost_per_frame': 0.004,
    'total_cost': 0.60,
    'target': 1.50,
    'margin': '60% under budget'
}

Consequences

Positive

Complete Coverage: All strategy types compensate for each other's blind spots
Cost Efficient: 150 frames vs 720 (79% reduction)
Quality: Captures all slides, transitions, and key moments
Flexibility: Can disable strategies for specific video types
Robust: System continues if one strategy fails

Negative

Complexity: More code paths to test and maintain
Computational Cost: Multiple passes increase processing time by ~30%
Parameter Tuning: Requires experimentation to optimize per-domain
Storage: Temporary storage increases (all raw frames before dedup)

Mitigation Strategies

Parallel Processing: Run strategies concurrently where possible
Smart Ordering: Run fast strategies first (scene change) before slow (slide detection)
Early Exit: Apply frame limit during extraction, not just after
Profiling: Instrument each strategy to identify bottlenecks

Implementation Notes

Frame Deduplication Algorithm

def deduplicate_frames(frames: List[Frame], tolerance: float = 0.5) -> List[Frame]:
    """
    Remove frames within temporal tolerance
    
    Priority order:
    1. slide_detection > text_density > scene_change > fixed_interval
    2. Earlier timestamp preferred within tolerance window
    """
    if not frames:
        return frames
    
    # Sort by priority then timestamp
    priority_map = {
        'slide_detection': 3,
        'text_density': 2,
        'scene_change': 1,
        'fixed_interval': 0
    }
    
    frames.sort(key=lambda f: (
        f.timestamp,
        -priority_map.get(f.extraction_method, 0)
    ))
    
    deduplicated = [frames[0]]
    
    for frame in frames[1:]:
        # Check temporal proximity to any kept frame
        too_close = any(
            abs(frame.timestamp - kept.timestamp) < tolerance
            for kept in deduplicated
        )
        
        if not too_close:
            deduplicated.append(frame)
    
    return deduplicated

Performance Optimization

# Parallel strategy execution
async def extract_all_strategies(video_path, H.P.009-CONFIG):
    tasks = [
        extract_scene_changes(video_path, H.P.009-CONFIG),
        extract_fixed_interval(video_path, H.P.009-CONFIG),
        extract_slide_content(video_path, H.P.009-CONFIG),
        extract_text_dense(video_path, H.P.009-CONFIG)
    ]
    
    results = await asyncio.gather(*tasks, return_exceptions=True)
    
    # Merge results, filter exceptions
    all_frames = []
    for result in results:
        if isinstance(result, Exception):
            logger.error(f"Strategy failed: {result}")
        else:
            all_frames.extend(result)
    
    return deduplicate_frames(all_frames)

Validation Metrics

Track these metrics to validate the decision:

metrics_to_track = {
    'extraction_performance': {
        'frames_extracted_per_strategy': Counter(),
        'frames_after_deduplication': int,
        'deduplication_rate': float,  # (raw - final) / raw
    },
    'cost_metrics': {
        'frames_per_video': float,
        'cost_per_video': float,
        'cost_per_minute': float
    },
    'quality_metrics': {
        'slide_capture_rate': float,  # % of slides captured
        'transition_capture_rate': float,  # % of transitions captured
        'redundancy_score': float,  # Avg similarity between frames
    },
    'performance': {
        'extraction_time_seconds': float,
        'time_per_strategy': Dict[str, float]
    }
}

Alternative Approaches for Future

Adaptive Sampling

# Future enhancement: ML-based frame importance scoring
class AdaptiveSampler:
    def __init__(self):
        self.importance_model = load_pretrained_model()
    
    def sample(self, video_path):
        # Use lightweight ML model to score frame importance
        # Extract only top-K frames
        # Could reduce frames by additional 30-50%
        pass

Content-Type Detection

# Detect video type and adjust strategy weights
video_types = {
    'lecture_presentation': {
        'slide_detection': 2.0,
        'scene_change': 0.5,
        'fixed_interval': 0.3
    },
    'demonstration': {
        'scene_change': 2.0,
        'slide_detection': 0.5,
        'fixed_interval': 1.0
    },
    'interview': {
        'scene_change': 1.5,
        'fixed_interval': 1.0,
        'slide_detection': 0.0
    }
}

Review Schedule

1 month: Evaluate frame counts and costs against actuals
3 months: Assess quality metrics (slide capture rate)
6 months: Consider ML-based adaptive sampling

References

FFmpeg Scene Detection: https://ffmpeg.org/ffmpeg-filters.html#select_002c-aselect
OpenCV Frame Differencing: https://docs.opencv.org/4.x/d7/d8b/tutorial_py_lucas_kanade.html
"Keyframe Extraction for Video Summarization" - IEEE 2020
Internal: Frame Sampling Experiments Notebook

Context​

Problem Statement​

Decision Drivers​

Options Considered​

Option 1: Fixed-Interval Sampling​

Option 2: Scene Change Detection Only​

Option 3: Content-Aware Sampling (Slide Detection)​

Option 4: Text Density Sampling​

Option 5: Multi-Strategy Hybrid (Selected)​

Decision​

Configuration​

Expected Frame Counts​

Consequences​

Positive​

Negative​

Mitigation Strategies​

Implementation Notes​

Frame Deduplication Algorithm​

Performance Optimization​

Validation Metrics​

Alternative Approaches for Future​

Adaptive Sampling​

Content-Type Detection​

Review Schedule​

References​

Context

Problem Statement

Decision Drivers

Options Considered

Option 1: Fixed-Interval Sampling

Option 2: Scene Change Detection Only

Option 3: Content-Aware Sampling (Slide Detection)

Option 4: Text Density Sampling

Option 5: Multi-Strategy Hybrid (Selected)

Decision

Configuration

Expected Frame Counts

Consequences

Positive

Negative

Mitigation Strategies

Implementation Notes

Frame Deduplication Algorithm

Performance Optimization

Validation Metrics

Alternative Approaches for Future

Adaptive Sampling

Content-Type Detection

Review Schedule

References