ADR-001: Choice of Vision-Language Model for Frame Analysis

Date: 2026-01-19
Status: Proposed
Decision Makers: System Architect
Stakeholders: Engineering, Operations, Finance

Context

The video analysis pipeline requires a vision-language model (VLM) to analyze extracted frames for content type classification, OCR, and semantic understanding. Key requirements:

Multimodal Understanding: Process images + text H.P.007-PROMPTS
Text Extraction: OCR capability for slides/diagrams
Content Classification: Identify slides, diagrams, people, scenes
Cost Efficiency: Process 100-500 frames per video
Quality: High accuracy for educational/technical content
API Reliability: Low latency, high availability

Decision Drivers

Quality Requirements

OCR Accuracy: >95% for printed text in slides
Content Classification: >90% accuracy for slide vs. scene
Semantic Understanding: Ability to summarize complex diagrams

Cost Constraints

Target: <$1.50 per video (100-200 frames)
Maximum: $5.00 per video

Performance Requirements

Latency: <5s per frame analysis
Throughput: Process 60-minute video in <10 minutes
Batch Support: Multiple frames per API call

Options Considered

Option 1: GPT-4 Vision (OpenAI)

Pros:

Excellent semantic understanding
Strong OCR capability
Good API reliability (99.9% uptime)
Batch support (up to 10 images per request)

Cons:

Cost: $0.00765 per image (1080p) = $0.77 per 100 frames
Higher cost than alternatives
Rate limits: 500 requests/day (free tier)

Evaluation:

cost_per_video = {
    'frames': 150,
    'cost_per_frame': 0.00765,
    'total': 150 * 0.00765  # $1.15
}

# Quality: 9/10 for educational content
# Speed: 8/10 (good batching)
# Cost: 6/10 (higher end)

Option 2: Claude Sonnet 4.5 with Vision (Anthropic)

Pros:

Excellent reasoning: Best for complex diagrams
Strong OCR (competitive with GPT-4V)
Cost: $0.004 per image = $0.40 per 100 frames
200K context window (excellent for batch processing)
Prompt caching for repeated instructions

Cons:

Slightly slower than GPT-4V
Rate limits: 50 requests/minute

Evaluation:

cost_per_video = {
    'frames': 150,
    'cost_per_frame': 0.004,
    'total': 150 * 0.004,  # $0.60
    'with_prompt_caching': 150 * 0.0032  # $0.48 (20% savings)
}

# Quality: 9.5/10 for technical content
# Speed: 7/10 (good but not fastest)
# Cost: 9/10 (best value)

Option 3: Google Gemini 1.5 Vision

Pros:

Lowest cost: $0.002 per image = $0.20 per 100 frames
Fast inference
Multimodal native (video + audio directly)
1M token context window

Cons:

OCR quality slightly lower than Claude/GPT-4V
Less proven for complex technical diagrams
API stability concerns (newer service)

Evaluation:

cost_per_video = {
    'frames': 150,
    'cost_per_frame': 0.002,
    'total': 150 * 0.002  # $0.30
}

# Quality: 8/10 (good but not best)
# Speed: 9/10 (fastest)
# Cost: 10/10 (cheapest)

Option 4: Open Source (LLaVA, BLIP-2, Qwen-VL)

Pros:

Zero API cost (self-hosted)
No rate limits
Full control over deployment
Privacy (no data leaves infrastructure)

Cons:

Quality: 6-7/10 (significantly lower than commercial)
Infrastructure cost (GPU required)
Maintenance overhead
Slower inference without optimization

Evaluation:

cost_per_video = {
    'api_cost': 0.00,
    'gpu_cost_per_hour': 1.50,  # AWS g4dn.xlarge
    'processing_time_minutes': 15,
    'total': 1.50 * (15/60)  # $0.375
}

# Quality: 6.5/10 (acceptable for simple content)
# Speed: 5/10 (depends on hardware)
# Cost: 7/10 (free API but infrastructure cost)

Decision

Selected: Claude Sonnet 4.5 with Vision (Option 2)

Rationale

Best Quality-Cost Ratio:
- 9.5/10 quality at $0.48-$0.60 per video (with caching)
- Better reasoning for complex technical content than GPT-4V
- 33% cheaper than GPT-4V
Technical Advantages:
- 200K context window → batch 40-50 frames per request
- Prompt caching → 20% cost reduction for repeated H.P.007-PROMPTS
- Strong performance on diagrams, charts, code snippets
Operational Benefits:
- Anthropic's focus on safety/reliability
- Good API documentation and support
- Reasonable rate limits (50 req/min = 2500 frames/min)
Flexibility:
- Can fall back to GPT-4V for difficult cases
- Can add Gemini as secondary model for cost optimization
- Open source models remain viable for pre-filtering

Cost Analysis

# Average 60-minute video
baseline_scenario = {
    'frames_extracted': 150,
    'claude_cost': 150 * 0.004,  # $0.60
    'whisper_api_cost': 60 * 0.006,  # $0.36
    'synthesis_cost': 50000 * 0.000003,  # $0.15
    'total': 0.60 + 0.36 + 0.15  # $1.11
}

# With optimizations
optimized_scenario = {
    'frames_extracted': 120,  # Better sampling
    'claude_cost_cached': 120 * 0.0032,  # $0.38
    'whisper_local': 0.00,  # Self-hosted
    'synthesis_cost': 50000 * 0.000003,  # $0.15
    'total': 0.38 + 0.15  # $0.53
}

Consequences

Positive

Quality: Excellent analysis of technical content
Cost-Effective: Meets $1.50 target comfortably
Scalability: Good rate limits for production use
Maintainability: Simple API integration, no infrastructure

Negative

Vendor Lock-in: Moderate (can switch to GPT-4V easily)
Rate Limits: Need queueing for high-throughput scenarios
API Dependency: Service outages affect pipeline

Mitigation Strategies

Circuit Breaker Pattern: Fall back to GPT-4V on Claude failures
Local Model Pre-filtering: Use LLaVA to identify frames worth analyzing
Prompt Caching: Aggressive use to reduce costs 20%
Batch Optimization: Pack maximum frames per request (40-50)

Implementation Notes

class VisionAnalyzer:
    def __init__(self, H.P.009-CONFIG):
        self.primary_client = Anthropic()  # Claude Sonnet 4.5
        self.fallback_client = OpenAI()    # GPT-4V fallback
        self.circuit_breaker = CircuitBreaker(threshold=5)
    
    async def analyze_frames(self, frames: List[Frame]):
        try:
            return await self.circuit_breaker.call(
                self._analyze_with_claude, frames
            )
        except CircuitBreakerOpen:
            logger.warning("Claude circuit open, using GPT-4V fallback")
            return await self._analyze_with_gpt4v(frames)

Review Schedule

3 months: Evaluate cost actuals vs. projections
6 months: Assess quality metrics and consider alternatives
12 months: Full review including new models (GPT-5, Claude 4, etc.)

References

Anthropic Claude Vision Announcement
OpenAI GPT-4V Pricing
Gemini Vision Benchmarks
Internal: Cost Analysis Spreadsheet v2.3

Context​

Decision Drivers​

Quality Requirements​

Cost Constraints​

Performance Requirements​

Options Considered​

Option 1: GPT-4 Vision (OpenAI)​

Option 2: Claude Sonnet 4.5 with Vision (Anthropic)​

Option 3: Google Gemini 1.5 Vision​

Option 4: Open Source (LLaVA, BLIP-2, Qwen-VL)​

Decision​

Rationale​

Cost Analysis​

Consequences​

Positive​

Negative​

Mitigation Strategies​

Implementation Notes​

Review Schedule​

References​

Context

Decision Drivers

Quality Requirements

Cost Constraints

Performance Requirements

Options Considered

Option 1: GPT-4 Vision (OpenAI)

Option 2: Claude Sonnet 4.5 with Vision (Anthropic)

Option 3: Google Gemini 1.5 Vision

Option 4: Open Source (LLaVA, BLIP-2, Qwen-VL)

Decision

Rationale

Cost Analysis

Consequences

Positive

Negative

Mitigation Strategies

Implementation Notes

Review Schedule

References