ADR 004: Use GPT-4V/Claude for Vision Analysis
Status
Accepted
Context
After deduplication, unique frames need analysis to extract visual content:
- Text detection (OCR)
- UI component identification
- Diagram understanding
- Context extraction
Decision
Use GPT-4V and Claude 3.5 Sonnet Vision for frame analysis.
Consequences
Positive
- Excellent text recognition (built-in OCR)
- Understands diagrams, charts, architecture
- Context-aware descriptions
- Structured output possible
- Multi-modal reasoning
Negative
- Cost (~$0.005-0.015 per image)
- Rate limits (GPT-4V: 100 RPM, Claude: variable)
- Requires API key
- Hallucination possible for ambiguous content
Implementation Strategy
async def analyze_frame(frame_path, context=""):
# Try Claude first (better accuracy/cost ratio)
try:
return await analyze_with_claude(frame_path, context)
except RateLimitError:
# Fallback to GPT-4V
return await analyze_with_gpt4v(frame_path, context)
Prompt Template
Analyze this video frame. The video is about: {context}
Provide:
1. Brief description of what's shown
2. Any text visible (transcribe accurately)
3. UI components, diagrams, or code visible
4. Key information or data presented
5. How this relates to the overall topic
Be concise but thorough.
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Tesseract OCR | Free, local | Poor on complex layouts |
| Google Vision API | Good OCR | No semantic understanding |
| Azure Computer Vision | Enterprise features | Cost, complexity |
| Local LLaVA | No API cost | Requires GPU, lower accuracy |
| EasyOCR | Free | Limited to text only |
Cost Estimation
For 100 unique frames:
- GPT-4V: ~$0.50-1.50
- Claude 3.5 Sonnet: ~$0.30-0.80
Notes
Batch frames when possible. Cache results by perceptual hash to avoid re-analysis.