Skip to main content

ADR-002: Multi-Strategy Frame Sampling Approach

Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, ML Engineer
Related: ADR-001 (Vision Model Choice)


Context

Video analysis requires extracting representative frames for vision model processing. Key challenges:

  1. Cost Constraint: Each frame costs $0.004-$0.008 (API fees)
  2. Redundancy Problem: Adjacent frames in video are highly similar
  3. Content Diversity: Educational videos have slides, diagrams, scenes, transitions
  4. Completeness: Must capture all important visual information
  5. Processing Time: More frames = longer processing

Problem Statement

Given a 60-minute educational video at 30fps (108,000 frames):

  • Naive sampling (1 fps) → 3,600 frames → $14.40 → unacceptable cost
  • Fixed interval (5s) → 720 frames → $2.88 → still too high
  • Need: Extract 100-200 frames that capture all significant content

Decision Drivers

  1. Capture Slide Content: Educational videos have static slides for 5-30 seconds
  2. Detect Transitions: Scene changes indicate new content
  3. Minimize Redundancy: Adjacent frames with <5% difference are redundant
  4. Maximize Information Density: Prioritize frames with text/diagrams
  5. Cost Efficiency: Target 120-180 frames per hour of video

Options Considered

Option 1: Fixed-Interval Sampling

Approach: Extract every N seconds

frames_per_video = duration_seconds / interval_seconds
# 60-min video, 5s interval → 720 frames

Pros:

  • Simple to implement
  • Predictable frame count
  • Uniform temporal coverage

Cons:

  • High redundancy: Samples many similar frames during static slides
  • Misses transitions: May skip important scene changes
  • Wasteful: Processes many low-information frames
  • Cost: 720 frames × $0.004 = $2.88 per video

Verdict: ❌ Too expensive and redundant


Option 2: Scene Change Detection Only

Approach: Extract frames only at detected scene transitions

# FFmpeg select filter
scene_threshold = 0.4
# Typical educational video: 30-80 scene changes

Pros:

  • Low redundancy: Only captures transitions
  • Cost-efficient: 50-80 frames per video
  • Fast: FFmpeg implements efficiently
  • Good for dynamic content: Captures all scene changes

Cons:

  • Misses slide content: Static slides may have no scene changes
  • Under-samples presentations: Educational videos have long static periods
  • Incomplete: May miss important textual content

Verdict: ❌ Insufficient for educational content


Option 3: Content-Aware Sampling (Slide Detection)

Approach: Detect static content periods and sample once per stable period

# Algorithm:
# 1. Track frame-to-frame similarity
# 2. When similarity > 95% for N frames, extract one frame
# 3. Reset on significant change

Pros:

  • Excellent for slides: One frame per slide (5-30s static content)
  • Low redundancy: Only samples distinct content
  • Complete capture: Gets all slide content

Cons:

  • Misses transitions: May skip scene changes between slides
  • Computational cost: Requires frame comparison
  • Tuning required: Similarity threshold varies by content

Verdict: ✅ Good for presentations, but incomplete alone


Option 4: Text Density Sampling

Approach: Use edge detection to identify frames with high text density

# Algorithm:
# 1. Apply Canny edge detection
# 2. Calculate edge pixel ratio
# 3. Sample frames with ratio > threshold (5%)

Pros:

  • Targets informative content: Text indicates important information
  • Works for diagrams: Charts/diagrams have high edge density
  • Fast: Edge detection is computationally cheap

Cons:

  • False positives: Textured scenes trigger false detection
  • Misses visual-only content: Photographs, demonstrations
  • Overlap with slide detection: Similar frames captured

Verdict: ✅ Good complement to other strategies


Option 5: Multi-Strategy Hybrid (Selected)

Approach: Combine multiple strategies, then deduplicate

strategies = [
'scene_change', # Capture transitions
'slide_detection', # Capture static slides
'fixed_interval', # Ensure temporal coverage
'text_density' # Prioritize informative frames
]

# Extract frames using all strategies
all_frames = []
for strategy in strategies:
frames = extract_with_strategy(strategy)
all_frames.extend(frames)

# Deduplicate: Remove frames within 0.5s of each other
final_frames = deduplicate(all_frames, tolerance=0.5)

Pros:

  • Comprehensive: Captures all important content types
  • Redundancy control: Deduplication removes overlap
  • Flexible: Can adjust strategy weights per video type
  • Robust: Multiple strategies compensate for each other's weaknesses

Cons:

  • Complexity: More code to maintain
  • Computational cost: Multiple passes over video
  • Tuning: Requires parameter optimization

Verdict: ✅ SELECTED - Best balance of completeness and efficiency

Decision

Multi-Strategy Hybrid Sampling

Configuration

SAMPLING_CONFIG = {
'strategies': [
{
'name': 'scene_change',
'enabled': True,
'threshold': 0.4,
'weight': 1.0
},
{
'name': 'fixed_interval',
'enabled': True,
'interval_seconds': 5.0,
'weight': 0.5 # Lower priority
},
{
'name': 'slide_detection',
'enabled': True,
'stability_duration': 2.0, # 2s of stable content
'weight': 1.5 # Higher priority
},
{
'name': 'text_density',
'enabled': True,
'edge_threshold': 0.05, # 5% edge pixels
'sample_interval': 2.0, # Check every 2s
'weight': 1.2
}
],
'deduplication': {
'enabled': True,
'temporal_tolerance': 0.5, # 0.5s window
'similarity_threshold': 0.95 # 95% similar = duplicate
},
'limits': {
'max_frames_per_video': 500,
'min_frames_per_minute': 1,
'max_frames_per_minute': 5
}
}

Expected Frame Counts

# 60-minute educational video analysis
frame_extraction_estimates = {
'scene_change': 40, # ~40 scene transitions
'fixed_interval_5s': 720, # Every 5 seconds (before weight)
'slide_detection': 60, # ~60 distinct slides
'text_density': 180, # ~3 per minute

'raw_total': 1000, # Sum before dedup
'after_dedup': 150, # ~75% reduction
'final_count': 150 # Target achieved
}

cost_analysis = {
'frames': 150,
'cost_per_frame': 0.004,
'total_cost': 0.60,
'target': 1.50,
'margin': '60% under budget'
}

Consequences

Positive

  1. Complete Coverage: All strategy types compensate for each other's blind spots
  2. Cost Efficient: 150 frames vs 720 (79% reduction)
  3. Quality: Captures all slides, transitions, and key moments
  4. Flexibility: Can disable strategies for specific video types
  5. Robust: System continues if one strategy fails

Negative

  1. Complexity: More code paths to test and maintain
  2. Computational Cost: Multiple passes increase processing time by ~30%
  3. Parameter Tuning: Requires experimentation to optimize per-domain
  4. Storage: Temporary storage increases (all raw frames before dedup)

Mitigation Strategies

  1. Parallel Processing: Run strategies concurrently where possible
  2. Smart Ordering: Run fast strategies first (scene change) before slow (slide detection)
  3. Early Exit: Apply frame limit during extraction, not just after
  4. Profiling: Instrument each strategy to identify bottlenecks

Implementation Notes

Frame Deduplication Algorithm

def deduplicate_frames(frames: List[Frame], tolerance: float = 0.5) -> List[Frame]:
"""
Remove frames within temporal tolerance

Priority order:
1. slide_detection > text_density > scene_change > fixed_interval
2. Earlier timestamp preferred within tolerance window
"""
if not frames:
return frames

# Sort by priority then timestamp
priority_map = {
'slide_detection': 3,
'text_density': 2,
'scene_change': 1,
'fixed_interval': 0
}

frames.sort(key=lambda f: (
f.timestamp,
-priority_map.get(f.extraction_method, 0)
))

deduplicated = [frames[0]]

for frame in frames[1:]:
# Check temporal proximity to any kept frame
too_close = any(
abs(frame.timestamp - kept.timestamp) < tolerance
for kept in deduplicated
)

if not too_close:
deduplicated.append(frame)

return deduplicated

Performance Optimization

# Parallel strategy execution
async def extract_all_strategies(video_path, H.P.009-CONFIG):
tasks = [
extract_scene_changes(video_path, H.P.009-CONFIG),
extract_fixed_interval(video_path, H.P.009-CONFIG),
extract_slide_content(video_path, H.P.009-CONFIG),
extract_text_dense(video_path, H.P.009-CONFIG)
]

results = await asyncio.gather(*tasks, return_exceptions=True)

# Merge results, filter exceptions
all_frames = []
for result in results:
if isinstance(result, Exception):
logger.error(f"Strategy failed: {result}")
else:
all_frames.extend(result)

return deduplicate_frames(all_frames)

Validation Metrics

Track these metrics to validate the decision:

metrics_to_track = {
'extraction_performance': {
'frames_extracted_per_strategy': Counter(),
'frames_after_deduplication': int,
'deduplication_rate': float, # (raw - final) / raw
},
'cost_metrics': {
'frames_per_video': float,
'cost_per_video': float,
'cost_per_minute': float
},
'quality_metrics': {
'slide_capture_rate': float, # % of slides captured
'transition_capture_rate': float, # % of transitions captured
'redundancy_score': float, # Avg similarity between frames
},
'performance': {
'extraction_time_seconds': float,
'time_per_strategy': Dict[str, float]
}
}

Alternative Approaches for Future

Adaptive Sampling

# Future enhancement: ML-based frame importance scoring
class AdaptiveSampler:
def __init__(self):
self.importance_model = load_pretrained_model()

def sample(self, video_path):
# Use lightweight ML model to score frame importance
# Extract only top-K frames
# Could reduce frames by additional 30-50%
pass

Content-Type Detection

# Detect video type and adjust strategy weights
video_types = {
'lecture_presentation': {
'slide_detection': 2.0,
'scene_change': 0.5,
'fixed_interval': 0.3
},
'demonstration': {
'scene_change': 2.0,
'slide_detection': 0.5,
'fixed_interval': 1.0
},
'interview': {
'scene_change': 1.5,
'fixed_interval': 1.0,
'slide_detection': 0.0
}
}

Review Schedule

  • 1 month: Evaluate frame counts and costs against actuals
  • 3 months: Assess quality metrics (slide capture rate)
  • 6 months: Consider ML-based adaptive sampling

References