ADR 004: Use GPT-4V/Claude for Vision Analysis

Status

Accepted

Context

After deduplication, unique frames need analysis to extract visual content:

Text detection (OCR)
UI component identification
Diagram understanding
Context extraction

Decision

Use GPT-4V and Claude 3.5 Sonnet Vision for frame analysis.

Consequences

Positive

Excellent text recognition (built-in OCR)
Understands diagrams, charts, architecture
Context-aware descriptions
Structured output possible
Multi-modal reasoning

Negative

Cost (~$0.005-0.015 per image)
Rate limits (GPT-4V: 100 RPM, Claude: variable)
Requires API key
Hallucination possible for ambiguous content

Implementation Strategy

async def analyze_frame(frame_path, context=""):
    # Try Claude first (better accuracy/cost ratio)
    try:
        return await analyze_with_claude(frame_path, context)
    except RateLimitError:
        # Fallback to GPT-4V
        return await analyze_with_gpt4v(frame_path, context)

Prompt Template

Analyze this video frame. The video is about: {context}

Provide:
1. Brief description of what's shown
2. Any text visible (transcribe accurately)
3. UI components, diagrams, or code visible
4. Key information or data presented
5. How this relates to the overall topic

Be concise but thorough.

Alternatives Considered

Alternative	Pros	Cons
Tesseract OCR	Free, local	Poor on complex layouts
Google Vision API	Good OCR	No semantic understanding
Azure Computer Vision	Enterprise features	Cost, complexity
Local LLaVA	No API cost	Requires GPU, lower accuracy
EasyOCR	Free	Limited to text only

Cost Estimation

For 100 unique frames:

GPT-4V: ~$0.50-1.50
Claude 3.5 Sonnet: ~$0.30-0.80

Notes

Batch frames when possible. Cache results by perceptual hash to avoid re-analysis.

Status​

Context​

Decision​

Consequences​

Positive​

Negative​

Implementation Strategy​

Prompt Template​

Alternatives Considered​

Cost Estimation​

Notes​