ADR-002: Multi-Strategy Frame Sampling Approach
Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, ML Engineer
Related: ADR-001 (Vision Model Choice)
Context
Video analysis requires extracting representative frames for vision model processing. Key challenges:
- Cost Constraint: Each frame costs $0.004-$0.008 (API fees)
- Redundancy Problem: Adjacent frames in video are highly similar
- Content Diversity: Educational videos have slides, diagrams, scenes, transitions
- Completeness: Must capture all important visual information
- Processing Time: More frames = longer processing
Problem Statement
Given a 60-minute educational video at 30fps (108,000 frames):
- Naive sampling (1 fps) → 3,600 frames → $14.40 → unacceptable cost
- Fixed interval (5s) → 720 frames → $2.88 → still too high
- Need: Extract 100-200 frames that capture all significant content
Decision Drivers
- Capture Slide Content: Educational videos have static slides for 5-30 seconds
- Detect Transitions: Scene changes indicate new content
- Minimize Redundancy: Adjacent frames with <5% difference are redundant
- Maximize Information Density: Prioritize frames with text/diagrams
- Cost Efficiency: Target 120-180 frames per hour of video
Options Considered
Option 1: Fixed-Interval Sampling
Approach: Extract every N seconds
frames_per_video = duration_seconds / interval_seconds
# 60-min video, 5s interval → 720 frames
Pros:
- Simple to implement
- Predictable frame count
- Uniform temporal coverage
Cons:
- High redundancy: Samples many similar frames during static slides
- Misses transitions: May skip important scene changes
- Wasteful: Processes many low-information frames
- Cost: 720 frames × $0.004 = $2.88 per video
Verdict: ❌ Too expensive and redundant
Option 2: Scene Change Detection Only
Approach: Extract frames only at detected scene transitions
# FFmpeg select filter
scene_threshold = 0.4
# Typical educational video: 30-80 scene changes
Pros:
- Low redundancy: Only captures transitions
- Cost-efficient: 50-80 frames per video
- Fast: FFmpeg implements efficiently
- Good for dynamic content: Captures all scene changes
Cons:
- Misses slide content: Static slides may have no scene changes
- Under-samples presentations: Educational videos have long static periods
- Incomplete: May miss important textual content
Verdict: ❌ Insufficient for educational content
Option 3: Content-Aware Sampling (Slide Detection)
Approach: Detect static content periods and sample once per stable period
# Algorithm:
# 1. Track frame-to-frame similarity
# 2. When similarity > 95% for N frames, extract one frame
# 3. Reset on significant change
Pros:
- Excellent for slides: One frame per slide (5-30s static content)
- Low redundancy: Only samples distinct content
- Complete capture: Gets all slide content
Cons:
- Misses transitions: May skip scene changes between slides
- Computational cost: Requires frame comparison
- Tuning required: Similarity threshold varies by content
Verdict: ✅ Good for presentations, but incomplete alone
Option 4: Text Density Sampling
Approach: Use edge detection to identify frames with high text density
# Algorithm:
# 1. Apply Canny edge detection
# 2. Calculate edge pixel ratio
# 3. Sample frames with ratio > threshold (5%)
Pros:
- Targets informative content: Text indicates important information
- Works for diagrams: Charts/diagrams have high edge density
- Fast: Edge detection is computationally cheap
Cons:
- False positives: Textured scenes trigger false detection
- Misses visual-only content: Photographs, demonstrations
- Overlap with slide detection: Similar frames captured
Verdict: ✅ Good complement to other strategies
Option 5: Multi-Strategy Hybrid (Selected)
Approach: Combine multiple strategies, then deduplicate
strategies = [
'scene_change', # Capture transitions
'slide_detection', # Capture static slides
'fixed_interval', # Ensure temporal coverage
'text_density' # Prioritize informative frames
]
# Extract frames using all strategies
all_frames = []
for strategy in strategies:
frames = extract_with_strategy(strategy)
all_frames.extend(frames)
# Deduplicate: Remove frames within 0.5s of each other
final_frames = deduplicate(all_frames, tolerance=0.5)
Pros:
- Comprehensive: Captures all important content types
- Redundancy control: Deduplication removes overlap
- Flexible: Can adjust strategy weights per video type
- Robust: Multiple strategies compensate for each other's weaknesses
Cons:
- Complexity: More code to maintain
- Computational cost: Multiple passes over video
- Tuning: Requires parameter optimization
Verdict: ✅ SELECTED - Best balance of completeness and efficiency
Decision
Multi-Strategy Hybrid Sampling
Configuration
SAMPLING_CONFIG = {
'strategies': [
{
'name': 'scene_change',
'enabled': True,
'threshold': 0.4,
'weight': 1.0
},
{
'name': 'fixed_interval',
'enabled': True,
'interval_seconds': 5.0,
'weight': 0.5 # Lower priority
},
{
'name': 'slide_detection',
'enabled': True,
'stability_duration': 2.0, # 2s of stable content
'weight': 1.5 # Higher priority
},
{
'name': 'text_density',
'enabled': True,
'edge_threshold': 0.05, # 5% edge pixels
'sample_interval': 2.0, # Check every 2s
'weight': 1.2
}
],
'deduplication': {
'enabled': True,
'temporal_tolerance': 0.5, # 0.5s window
'similarity_threshold': 0.95 # 95% similar = duplicate
},
'limits': {
'max_frames_per_video': 500,
'min_frames_per_minute': 1,
'max_frames_per_minute': 5
}
}
Expected Frame Counts
# 60-minute educational video analysis
frame_extraction_estimates = {
'scene_change': 40, # ~40 scene transitions
'fixed_interval_5s': 720, # Every 5 seconds (before weight)
'slide_detection': 60, # ~60 distinct slides
'text_density': 180, # ~3 per minute
'raw_total': 1000, # Sum before dedup
'after_dedup': 150, # ~75% reduction
'final_count': 150 # Target achieved
}
cost_analysis = {
'frames': 150,
'cost_per_frame': 0.004,
'total_cost': 0.60,
'target': 1.50,
'margin': '60% under budget'
}
Consequences
Positive
- Complete Coverage: All strategy types compensate for each other's blind spots
- Cost Efficient: 150 frames vs 720 (79% reduction)
- Quality: Captures all slides, transitions, and key moments
- Flexibility: Can disable strategies for specific video types
- Robust: System continues if one strategy fails
Negative
- Complexity: More code paths to test and maintain
- Computational Cost: Multiple passes increase processing time by ~30%
- Parameter Tuning: Requires experimentation to optimize per-domain
- Storage: Temporary storage increases (all raw frames before dedup)
Mitigation Strategies
- Parallel Processing: Run strategies concurrently where possible
- Smart Ordering: Run fast strategies first (scene change) before slow (slide detection)
- Early Exit: Apply frame limit during extraction, not just after
- Profiling: Instrument each strategy to identify bottlenecks
Implementation Notes
Frame Deduplication Algorithm
def deduplicate_frames(frames: List[Frame], tolerance: float = 0.5) -> List[Frame]:
"""
Remove frames within temporal tolerance
Priority order:
1. slide_detection > text_density > scene_change > fixed_interval
2. Earlier timestamp preferred within tolerance window
"""
if not frames:
return frames
# Sort by priority then timestamp
priority_map = {
'slide_detection': 3,
'text_density': 2,
'scene_change': 1,
'fixed_interval': 0
}
frames.sort(key=lambda f: (
f.timestamp,
-priority_map.get(f.extraction_method, 0)
))
deduplicated = [frames[0]]
for frame in frames[1:]:
# Check temporal proximity to any kept frame
too_close = any(
abs(frame.timestamp - kept.timestamp) < tolerance
for kept in deduplicated
)
if not too_close:
deduplicated.append(frame)
return deduplicated
Performance Optimization
# Parallel strategy execution
async def extract_all_strategies(video_path, H.P.009-CONFIG):
tasks = [
extract_scene_changes(video_path, H.P.009-CONFIG),
extract_fixed_interval(video_path, H.P.009-CONFIG),
extract_slide_content(video_path, H.P.009-CONFIG),
extract_text_dense(video_path, H.P.009-CONFIG)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Merge results, filter exceptions
all_frames = []
for result in results:
if isinstance(result, Exception):
logger.error(f"Strategy failed: {result}")
else:
all_frames.extend(result)
return deduplicate_frames(all_frames)
Validation Metrics
Track these metrics to validate the decision:
metrics_to_track = {
'extraction_performance': {
'frames_extracted_per_strategy': Counter(),
'frames_after_deduplication': int,
'deduplication_rate': float, # (raw - final) / raw
},
'cost_metrics': {
'frames_per_video': float,
'cost_per_video': float,
'cost_per_minute': float
},
'quality_metrics': {
'slide_capture_rate': float, # % of slides captured
'transition_capture_rate': float, # % of transitions captured
'redundancy_score': float, # Avg similarity between frames
},
'performance': {
'extraction_time_seconds': float,
'time_per_strategy': Dict[str, float]
}
}
Alternative Approaches for Future
Adaptive Sampling
# Future enhancement: ML-based frame importance scoring
class AdaptiveSampler:
def __init__(self):
self.importance_model = load_pretrained_model()
def sample(self, video_path):
# Use lightweight ML model to score frame importance
# Extract only top-K frames
# Could reduce frames by additional 30-50%
pass
Content-Type Detection
# Detect video type and adjust strategy weights
video_types = {
'lecture_presentation': {
'slide_detection': 2.0,
'scene_change': 0.5,
'fixed_interval': 0.3
},
'demonstration': {
'scene_change': 2.0,
'slide_detection': 0.5,
'fixed_interval': 1.0
},
'interview': {
'scene_change': 1.5,
'fixed_interval': 1.0,
'slide_detection': 0.0
}
}
Review Schedule
- 1 month: Evaluate frame counts and costs against actuals
- 3 months: Assess quality metrics (slide capture rate)
- 6 months: Consider ML-based adaptive sampling
References
- FFmpeg Scene Detection: https://ffmpeg.org/ffmpeg-filters.html#select_002c-aselect
- OpenCV Frame Differencing: https://docs.opencv.org/4.x/d7/d8b/tutorial_py_lucas_kanade.html
- "Keyframe Extraction for Video Summarization" - IEEE 2020
- Internal: Frame Sampling Experiments Notebook