Skip to main content

ADR 004: Use GPT-4V/Claude for Vision Analysis

Status

Accepted

Context

After deduplication, unique frames need analysis to extract visual content:

  • Text detection (OCR)
  • UI component identification
  • Diagram understanding
  • Context extraction

Decision

Use GPT-4V and Claude 3.5 Sonnet Vision for frame analysis.

Consequences

Positive

  • Excellent text recognition (built-in OCR)
  • Understands diagrams, charts, architecture
  • Context-aware descriptions
  • Structured output possible
  • Multi-modal reasoning

Negative

  • Cost (~$0.005-0.015 per image)
  • Rate limits (GPT-4V: 100 RPM, Claude: variable)
  • Requires API key
  • Hallucination possible for ambiguous content

Implementation Strategy

async def analyze_frame(frame_path, context=""):
# Try Claude first (better accuracy/cost ratio)
try:
return await analyze_with_claude(frame_path, context)
except RateLimitError:
# Fallback to GPT-4V
return await analyze_with_gpt4v(frame_path, context)

Prompt Template

Analyze this video frame. The video is about: {context}

Provide:
1. Brief description of what's shown
2. Any text visible (transcribe accurately)
3. UI components, diagrams, or code visible
4. Key information or data presented
5. How this relates to the overall topic

Be concise but thorough.

Alternatives Considered

AlternativeProsCons
Tesseract OCRFree, localPoor on complex layouts
Google Vision APIGood OCRNo semantic understanding
Azure Computer VisionEnterprise featuresCost, complexity
Local LLaVANo API costRequires GPU, lower accuracy
EasyOCRFreeLimited to text only

Cost Estimation

For 100 unique frames:

  • GPT-4V: ~$0.50-1.50
  • Claude 3.5 Sonnet: ~$0.30-0.80

Notes

Batch frames when possible. Cache results by perceptual hash to avoid re-analysis.