Skip to main content

ADR-001: Choice of Vision-Language Model for Frame Analysis

Date: 2026-01-19
Status: Proposed
Decision Makers: System Architect
Stakeholders: Engineering, Operations, Finance


Context

The video analysis pipeline requires a vision-language model (VLM) to analyze extracted frames for content type classification, OCR, and semantic understanding. Key requirements:

  1. Multimodal Understanding: Process images + text H.P.007-PROMPTS
  2. Text Extraction: OCR capability for slides/diagrams
  3. Content Classification: Identify slides, diagrams, people, scenes
  4. Cost Efficiency: Process 100-500 frames per video
  5. Quality: High accuracy for educational/technical content
  6. API Reliability: Low latency, high availability

Decision Drivers

Quality Requirements

  • OCR Accuracy: >95% for printed text in slides
  • Content Classification: >90% accuracy for slide vs. scene
  • Semantic Understanding: Ability to summarize complex diagrams

Cost Constraints

  • Target: <$1.50 per video (100-200 frames)
  • Maximum: $5.00 per video

Performance Requirements

  • Latency: <5s per frame analysis
  • Throughput: Process 60-minute video in <10 minutes
  • Batch Support: Multiple frames per API call

Options Considered

Option 1: GPT-4 Vision (OpenAI)

Pros:

  • Excellent semantic understanding
  • Strong OCR capability
  • Good API reliability (99.9% uptime)
  • Batch support (up to 10 images per request)

Cons:

  • Cost: $0.00765 per image (1080p) = $0.77 per 100 frames
  • Higher cost than alternatives
  • Rate limits: 500 requests/day (free tier)

Evaluation:

cost_per_video = {
'frames': 150,
'cost_per_frame': 0.00765,
'total': 150 * 0.00765 # $1.15
}

# Quality: 9/10 for educational content
# Speed: 8/10 (good batching)
# Cost: 6/10 (higher end)

Option 2: Claude Sonnet 4.5 with Vision (Anthropic)

Pros:

  • Excellent reasoning: Best for complex diagrams
  • Strong OCR (competitive with GPT-4V)
  • Cost: $0.004 per image = $0.40 per 100 frames
  • 200K context window (excellent for batch processing)
  • Prompt caching for repeated instructions

Cons:

  • Slightly slower than GPT-4V
  • Rate limits: 50 requests/minute

Evaluation:

cost_per_video = {
'frames': 150,
'cost_per_frame': 0.004,
'total': 150 * 0.004, # $0.60
'with_prompt_caching': 150 * 0.0032 # $0.48 (20% savings)
}

# Quality: 9.5/10 for technical content
# Speed: 7/10 (good but not fastest)
# Cost: 9/10 (best value)

Option 3: Google Gemini 1.5 Vision

Pros:

  • Lowest cost: $0.002 per image = $0.20 per 100 frames
  • Fast inference
  • Multimodal native (video + audio directly)
  • 1M token context window

Cons:

  • OCR quality slightly lower than Claude/GPT-4V
  • Less proven for complex technical diagrams
  • API stability concerns (newer service)

Evaluation:

cost_per_video = {
'frames': 150,
'cost_per_frame': 0.002,
'total': 150 * 0.002 # $0.30
}

# Quality: 8/10 (good but not best)
# Speed: 9/10 (fastest)
# Cost: 10/10 (cheapest)

Option 4: Open Source (LLaVA, BLIP-2, Qwen-VL)

Pros:

  • Zero API cost (self-hosted)
  • No rate limits
  • Full control over deployment
  • Privacy (no data leaves infrastructure)

Cons:

  • Quality: 6-7/10 (significantly lower than commercial)
  • Infrastructure cost (GPU required)
  • Maintenance overhead
  • Slower inference without optimization

Evaluation:

cost_per_video = {
'api_cost': 0.00,
'gpu_cost_per_hour': 1.50, # AWS g4dn.xlarge
'processing_time_minutes': 15,
'total': 1.50 * (15/60) # $0.375
}

# Quality: 6.5/10 (acceptable for simple content)
# Speed: 5/10 (depends on hardware)
# Cost: 7/10 (free API but infrastructure cost)

Decision

Selected: Claude Sonnet 4.5 with Vision (Option 2)

Rationale

  1. Best Quality-Cost Ratio:

    • 9.5/10 quality at $0.48-$0.60 per video (with caching)
    • Better reasoning for complex technical content than GPT-4V
    • 33% cheaper than GPT-4V
  2. Technical Advantages:

    • 200K context window → batch 40-50 frames per request
    • Prompt caching → 20% cost reduction for repeated H.P.007-PROMPTS
    • Strong performance on diagrams, charts, code snippets
  3. Operational Benefits:

    • Anthropic's focus on safety/reliability
    • Good API documentation and support
    • Reasonable rate limits (50 req/min = 2500 frames/min)
  4. Flexibility:

    • Can fall back to GPT-4V for difficult cases
    • Can add Gemini as secondary model for cost optimization
    • Open source models remain viable for pre-filtering

Cost Analysis

# Average 60-minute video
baseline_scenario = {
'frames_extracted': 150,
'claude_cost': 150 * 0.004, # $0.60
'whisper_api_cost': 60 * 0.006, # $0.36
'synthesis_cost': 50000 * 0.000003, # $0.15
'total': 0.60 + 0.36 + 0.15 # $1.11
}

# With optimizations
optimized_scenario = {
'frames_extracted': 120, # Better sampling
'claude_cost_cached': 120 * 0.0032, # $0.38
'whisper_local': 0.00, # Self-hosted
'synthesis_cost': 50000 * 0.000003, # $0.15
'total': 0.38 + 0.15 # $0.53
}

Consequences

Positive

  • Quality: Excellent analysis of technical content
  • Cost-Effective: Meets $1.50 target comfortably
  • Scalability: Good rate limits for production use
  • Maintainability: Simple API integration, no infrastructure

Negative

  • Vendor Lock-in: Moderate (can switch to GPT-4V easily)
  • Rate Limits: Need queueing for high-throughput scenarios
  • API Dependency: Service outages affect pipeline

Mitigation Strategies

  1. Circuit Breaker Pattern: Fall back to GPT-4V on Claude failures
  2. Local Model Pre-filtering: Use LLaVA to identify frames worth analyzing
  3. Prompt Caching: Aggressive use to reduce costs 20%
  4. Batch Optimization: Pack maximum frames per request (40-50)

Implementation Notes

class VisionAnalyzer:
def __init__(self, H.P.009-CONFIG):
self.primary_client = Anthropic() # Claude Sonnet 4.5
self.fallback_client = OpenAI() # GPT-4V fallback
self.circuit_breaker = CircuitBreaker(threshold=5)

async def analyze_frames(self, frames: List[Frame]):
try:
return await self.circuit_breaker.call(
self._analyze_with_claude, frames
)
except CircuitBreakerOpen:
logger.warning("Claude circuit open, using GPT-4V fallback")
return await self._analyze_with_gpt4v(frames)

Review Schedule

  • 3 months: Evaluate cost actuals vs. projections
  • 6 months: Assess quality metrics and consider alternatives
  • 12 months: Full review including new models (GPT-5, Claude 4, etc.)

References