ADR-001: Choice of Vision-Language Model for Frame Analysis
Date: 2026-01-19
Status: Proposed
Decision Makers: System Architect
Stakeholders: Engineering, Operations, Finance
Context
The video analysis pipeline requires a vision-language model (VLM) to analyze extracted frames for content type classification, OCR, and semantic understanding. Key requirements:
- Multimodal Understanding: Process images + text H.P.007-PROMPTS
- Text Extraction: OCR capability for slides/diagrams
- Content Classification: Identify slides, diagrams, people, scenes
- Cost Efficiency: Process 100-500 frames per video
- Quality: High accuracy for educational/technical content
- API Reliability: Low latency, high availability
Decision Drivers
Quality Requirements
- OCR Accuracy: >95% for printed text in slides
- Content Classification: >90% accuracy for slide vs. scene
- Semantic Understanding: Ability to summarize complex diagrams
Cost Constraints
- Target: <$1.50 per video (100-200 frames)
- Maximum: $5.00 per video
Performance Requirements
- Latency: <5s per frame analysis
- Throughput: Process 60-minute video in <10 minutes
- Batch Support: Multiple frames per API call
Options Considered
Option 1: GPT-4 Vision (OpenAI)
Pros:
- Excellent semantic understanding
- Strong OCR capability
- Good API reliability (99.9% uptime)
- Batch support (up to 10 images per request)
Cons:
- Cost: $0.00765 per image (1080p) = $0.77 per 100 frames
- Higher cost than alternatives
- Rate limits: 500 requests/day (free tier)
Evaluation:
cost_per_video = {
'frames': 150,
'cost_per_frame': 0.00765,
'total': 150 * 0.00765 # $1.15
}
# Quality: 9/10 for educational content
# Speed: 8/10 (good batching)
# Cost: 6/10 (higher end)
Option 2: Claude Sonnet 4.5 with Vision (Anthropic)
Pros:
- Excellent reasoning: Best for complex diagrams
- Strong OCR (competitive with GPT-4V)
- Cost: $0.004 per image = $0.40 per 100 frames
- 200K context window (excellent for batch processing)
- Prompt caching for repeated instructions
Cons:
- Slightly slower than GPT-4V
- Rate limits: 50 requests/minute
Evaluation:
cost_per_video = {
'frames': 150,
'cost_per_frame': 0.004,
'total': 150 * 0.004, # $0.60
'with_prompt_caching': 150 * 0.0032 # $0.48 (20% savings)
}
# Quality: 9.5/10 for technical content
# Speed: 7/10 (good but not fastest)
# Cost: 9/10 (best value)
Option 3: Google Gemini 1.5 Vision
Pros:
- Lowest cost: $0.002 per image = $0.20 per 100 frames
- Fast inference
- Multimodal native (video + audio directly)
- 1M token context window
Cons:
- OCR quality slightly lower than Claude/GPT-4V
- Less proven for complex technical diagrams
- API stability concerns (newer service)
Evaluation:
cost_per_video = {
'frames': 150,
'cost_per_frame': 0.002,
'total': 150 * 0.002 # $0.30
}
# Quality: 8/10 (good but not best)
# Speed: 9/10 (fastest)
# Cost: 10/10 (cheapest)
Option 4: Open Source (LLaVA, BLIP-2, Qwen-VL)
Pros:
- Zero API cost (self-hosted)
- No rate limits
- Full control over deployment
- Privacy (no data leaves infrastructure)
Cons:
- Quality: 6-7/10 (significantly lower than commercial)
- Infrastructure cost (GPU required)
- Maintenance overhead
- Slower inference without optimization
Evaluation:
cost_per_video = {
'api_cost': 0.00,
'gpu_cost_per_hour': 1.50, # AWS g4dn.xlarge
'processing_time_minutes': 15,
'total': 1.50 * (15/60) # $0.375
}
# Quality: 6.5/10 (acceptable for simple content)
# Speed: 5/10 (depends on hardware)
# Cost: 7/10 (free API but infrastructure cost)
Decision
Selected: Claude Sonnet 4.5 with Vision (Option 2)
Rationale
-
Best Quality-Cost Ratio:
- 9.5/10 quality at $0.48-$0.60 per video (with caching)
- Better reasoning for complex technical content than GPT-4V
- 33% cheaper than GPT-4V
-
Technical Advantages:
- 200K context window → batch 40-50 frames per request
- Prompt caching → 20% cost reduction for repeated H.P.007-PROMPTS
- Strong performance on diagrams, charts, code snippets
-
Operational Benefits:
- Anthropic's focus on safety/reliability
- Good API documentation and support
- Reasonable rate limits (50 req/min = 2500 frames/min)
-
Flexibility:
- Can fall back to GPT-4V for difficult cases
- Can add Gemini as secondary model for cost optimization
- Open source models remain viable for pre-filtering
Cost Analysis
# Average 60-minute video
baseline_scenario = {
'frames_extracted': 150,
'claude_cost': 150 * 0.004, # $0.60
'whisper_api_cost': 60 * 0.006, # $0.36
'synthesis_cost': 50000 * 0.000003, # $0.15
'total': 0.60 + 0.36 + 0.15 # $1.11
}
# With optimizations
optimized_scenario = {
'frames_extracted': 120, # Better sampling
'claude_cost_cached': 120 * 0.0032, # $0.38
'whisper_local': 0.00, # Self-hosted
'synthesis_cost': 50000 * 0.000003, # $0.15
'total': 0.38 + 0.15 # $0.53
}
Consequences
Positive
- Quality: Excellent analysis of technical content
- Cost-Effective: Meets $1.50 target comfortably
- Scalability: Good rate limits for production use
- Maintainability: Simple API integration, no infrastructure
Negative
- Vendor Lock-in: Moderate (can switch to GPT-4V easily)
- Rate Limits: Need queueing for high-throughput scenarios
- API Dependency: Service outages affect pipeline
Mitigation Strategies
- Circuit Breaker Pattern: Fall back to GPT-4V on Claude failures
- Local Model Pre-filtering: Use LLaVA to identify frames worth analyzing
- Prompt Caching: Aggressive use to reduce costs 20%
- Batch Optimization: Pack maximum frames per request (40-50)
Implementation Notes
class VisionAnalyzer:
def __init__(self, H.P.009-CONFIG):
self.primary_client = Anthropic() # Claude Sonnet 4.5
self.fallback_client = OpenAI() # GPT-4V fallback
self.circuit_breaker = CircuitBreaker(threshold=5)
async def analyze_frames(self, frames: List[Frame]):
try:
return await self.circuit_breaker.call(
self._analyze_with_claude, frames
)
except CircuitBreakerOpen:
logger.warning("Claude circuit open, using GPT-4V fallback")
return await self._analyze_with_gpt4v(frames)
Review Schedule
- 3 months: Evaluate cost actuals vs. projections
- 6 months: Assess quality metrics and consider alternatives
- 12 months: Full review including new models (GPT-5, Claude 4, etc.)
References
- Anthropic Claude Vision Announcement
- OpenAI GPT-4V Pricing
- Gemini Vision Benchmarks
- Internal: Cost Analysis Spreadsheet v2.3