Video-to-Knowledge Pipeline
A comprehensive system for converting video content into structured knowledge artifacts including SDD, TDD, ADR, C4 diagrams, interactive presentation decks, and more.
Features
- Multi-Source Ingestion: Download from YouTube, Vimeo, or process local videos
- Audio Intelligence: Extract and transcribe audio with Whisper (local or API)
- Visual Analysis: Sample frames, deduplicate, and analyze with Kimi or Claude Vision
- Content Synthesis: Merge transcript and visual analysis
- Artifact Generation: Auto-generate:
- Executive Summary
- Introduction & Outline
- Software Design Document (SDD)
- Technical Design Document (TDD)
- Architecture Decision Records (ADR)
- C4 Model diagrams (Context, Container, Component)
- Glossary of Terms
- Document Inventory
- Interactive Presentation Decks: HTML/Markdown/Reveal.js decks with synchronized transcript and images
- Interactive CLI: Guided setup with model management
- Configurable Whisper Models: tiny, base, small, medium, large-v3 with auto-selection
Architecture
🚀 Quick Start
One-Command Setup
# Clone or navigate to project
cd video_analysis_pipeline
# Run interactive setup (recommended)
python build.py
# Or use Make
make install
Manual Installation
See BUILD_README.md for detailed installation options.
Requirements
- Python 3.9+
- FFmpeg
- API Key (Kimi or Claude)
Install FFmpeg:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
Interactive Mode (Recommended)
# Launch interactive menu
python -m src.pipeline
# Or explicitly
python -m src.pipeline interactive
The interactive menu provides:
- 🎬 Process new video - Full guided setup
- ⚡ Quick process - Use saved defaults
- 📦 Manage Whisper models - Download/view models
- ⚙️ Configuration - View/reset settings
Quick Usage
# Process a YouTube video
python -m src.pipeline process "https://youtube.com/watch?v=..."
# Process local video
python -m src.pipeline process "./my_video.mp4"
# Custom output directory
python -m src.pipeline process "URL" --output ./my_output
# Adjust frame extraction rate
python -m src.pipeline process "URL" --fps 1.0
# Skip artifact generation
python -m src.pipeline process "URL" --no-artifacts
# Select specific Whisper model
python -m src.pipeline process "URL" --whisper-model medium
# Prioritize transcription speed over accuracy
python -m src.pipeline process "URL" --priority speed
# Use Kimi for vision analysis (instead of Claude)
python -m src.pipeline process "URL" --vision kimi
# Generate presentation deck
python -m src.pipeline process "URL" --deck --deck-format html
# Open generated deck in browser
python -m src.pipeline deck outputs/deck/presentation.html
Model Management
# List available Whisper models
python -m src.pipeline models
# Download specific model
python -m src.pipeline models --download small
# Download all models (~4.5GB)
python -m src.pipeline models --download-all
# Interactive model manager
python -m src.pipeline models --interactive
Pipeline Stages
1. Video Download (downloader.py)
Uses yt-dlp to download videos from platforms like YouTube, Vimeo.
2. Audio Processing (audio_processor.py)
- Extracts audio to MP3 using FFmpeg
- Transcribes with OpenAI Whisper
- Produces timestamped transcript segments
3. Frame Processing (frame_processor.py)
- Extracts frames at configurable rate (default: 0.5 fps)
- Deduplicates using perceptual hashing (pHash)
- Keeps only visually unique frames
4. Vision Analysis (vision_analyzer.py)
- Analyzes unique frames with Claude Vision
- Extracts text, diagrams, UI elements
- Generates structured descriptions
5. Content Synthesis (content_synthesizer.py)
- Aligns transcript segments with frame timestamps
- Creates semantic chunks
- Merges audio and visual context
6. Artifact Generation (artifact_generator.py)
Generates comprehensive documentation:
| Artifact | Purpose |
|---|---|
| Executive Summary | High-level overview for decision makers |
| SDD | System design and architecture |
| TDD | Technical implementation details |
| ADR | Architecture decision records |
| C4 Diagrams | Visual architecture at multiple levels |
| Glossary | Technical terms and definitions |
| Inventory | Document index and navigation |
Configuration
Edit PipelineConfig in src/models.py or pass options via CLI:
PipelineConfig(
download_quality="best",
audio_bitrate="192k",
frame_rate=0.5, # frames per second
dedup_threshold=10, # pHash hamming distance
chunk_size=4000, # tokens per chunk
web_search_enabled=True,
# Whisper transcription settings
transcription_backend="local", # "local" or "openai-api"
whisper_model="auto", # "tiny", "base", "small", "medium", "large-v3", "auto"
whisper_priority="balanced" # "speed", "quality", "balanced"
)
Whisper Model Selection
| Model | RAM | Speed | Accuracy | Best For |
|---|---|---|---|---|
| tiny | ~0.5GB | 32x | 58% | Quick tests, limited RAM |
| base | ~1GB | 16x | 68% | Default, balanced |
| small | ~2GB | 6x | 76% | Better accuracy |
| medium | ~5GB | 2x | 82% | High quality |
| large-v3 | ~10GB | 1x | 87% | Best quality |
Auto-selection logic:
< 1GBRAM → tiny1-2GBRAM → base2-5GBRAM → small5-10GBRAM → medium or large-v3 (by priority)> 10GBRAM → large-v3
Environment variables:
export WHISPER_MODEL_SIZE=medium # Force specific model
export WHISPER_PRIORITY=quality # Force priority
export WHISPER_DEVICE=cuda # Force GPU
Presentation Decks
The pipeline generates interactive HTML presentation decks that synchronize transcript with extracted video frames.
Features
- Synchronized Playback: Auto-advances slides based on transcript timing
- Dual View: Left side shows video frame, right side shows transcript
- Manual Navigation: Arrow keys or buttons to navigate
- Speed Control: 1x, 1.5x, 2x playback speeds
- Progress Tracking: Visual progress bar
Deck Formats
| Format | File | Description |
|---|---|---|
| HTML | presentation.html | Full-featured interactive deck (default) |
| Markdown | presentation.md | Simple markdown slides |
| Reveal.js | presentation_reveal.html | Reveal.js slideshow |
Using the HTML Deck
# Generate deck
python -m src.pipeline process "video.mp4" --deck
# Open in browser
python -m src.pipeline deck outputs/deck/presentation.html
# Or open manually
open outputs/deck/presentation.html
Keyboard Shortcuts:
→orSpace- Next slide←- Previous slideHome- First slideEnd- Last slide
Project Structure
video_analysis_pipeline/
├── docs/
│ ├── sdd/ # Software Design Document
│ ├── tdd/ # Technical Design Document
│ ├── adr/ # Architecture Decision Records
│ └── c4/ # C4 Model diagrams
├── src/
│ ├── __init__.py
│ ├── models.py # Pydantic data models
│ ├── downloader.py # Video download (yt-dlp)
│ ├── audio_processor.py # Audio extraction & transcription
│ ├── frame_processor.py # Frame extraction & deduplication
│ ├── vision_analyzer.py # Image analysis (Claude Vision)
│ ├── content_synthesizer.py # Content merging & chunking
│ ├── artifact_generator.py # Document generation
│ ├── web_search.py # Knowledge expansion
│ └── pipeline.py # Main orchestrator
├── outputs/
│ ├── videos/ # Downloaded videos
│ ├── audio/ # Extracted audio
│ ├── frames/ # Extracted & deduplicated frames
│ └── artifacts/ # Generated documentation
├── requirements.txt
└── README.md
API Keys
| Service | Key | Purpose | Required |
|---|---|---|---|
| OpenAI | OPENAI_API_KEY | Whisper API transcription | Optional (local Whisper available) |
| Anthropic | ANTHROPIC_API_KEY | Vision analysis, artifact generation | Optional (Kimi alternative) |
| Kimi/Moonshot | KIMI_API_KEY | Vision analysis | Recommended |
| Serper | SERPER_API_KEY | Web search expansion | Optional |
Kimi Vision (Recommended)
Kimi (Moonshot AI) provides excellent vision capabilities for analyzing video frames:
export KIMI_API_KEY="your-kimi-api-key"
# or
export MOONSHOT_API_KEY="your-kimi-api-key"
The pipeline will auto-detect and use Kimi if available, falling back to Claude.
Algorithms
Perceptual Deduplication
Uses pHash (perceptual hash) to detect visually similar frames:
- Compute 64-bit hash for each frame
- Compare consecutive frames with Hamming distance
- Keep frame if distance > threshold (content changed)
- Discard if distance ≤ threshold (visually similar)
Default threshold: 10 (tunable by content type)
Outputs
Generated artifacts are saved as Markdown files in outputs/artifacts/:
outputs/artifacts/
├── 00_document_inventory.md
├── 01_executive_summary.md
├── 02_introduction.md
├── 03_outline.md
├── 04_software_design_document.md
├── 05_technical_design_document.md
├── 06_architecture_decisions.md
├── 07_c4_system_context.md
├── 08_c4_container_level.md
├── 09_c4_component_level.md
└── 10_glossary.md
Development
# Run tests
pytest tests/
# Type checking
mypy src/
# Linting
ruff check src/
License
MIT
Credits
- yt-dlp: Video downloading
- OpenAI Whisper: Speech-to-text
- Anthropic Claude: Vision and text generation
- FFmpeg: Audio/video processing
🔑 API Key Options
Option 1: OpenAI Whisper API (Current Default)
Pros:
- Fast (optimized cloud infrastructure)
- No local GPU needed
- Always up-to-date models
Cons:
- Requires API key
- Costs money (~$0.006/minute)
- Requires internet
Setup:
export OPENAI_API_KEY="sk-..."
pip install -r requirements.txt
Option 2: Local Whisper (No API Key!)
Pros:
- ✅ No API key needed
- ✅ Free (no usage costs)
- ✅ Works offline
- ✅ Privacy (audio stays local)
Cons:
- Slower (depends on your hardware)
- Downloads ~1-2GB model files
- Requires more RAM/CPU (or GPU)
Setup:
# Use local requirements
pip install -r requirements-local.txt
# Replace audio_processor.py
mv src/audio_processor.py src/audio_processor_api.py
mv src/audio_processor_local.py src/audio_processor.py
# No API key needed for transcription!
export ANTHROPIC_API_KEY="sk-ant-..." # Still needed for vision
Model Sizes:
| Model | Size | VRAM | Speed | Accuracy |
|---|---|---|---|---|
| tiny | 39M | ~1GB | Fastest | Basic |
| base | 74M | ~1GB | Fast | Good |
| small | 244M | ~2GB | Medium | Better |
| medium | 769M | ~5GB | Slow | High |
| large-v3 | 1550M | ~10GB | Slowest | Best |
Edit src/audio_processor.py to change model:
def __init__(self, model_size: str = "base"): # Change to "small" or "medium"
Comparison
| Feature | API Whisper | Local Whisper |
|---|---|---|
| API Key | Yes | No |
| Cost | ~$0.006/min | Free |
| Internet | Required | Not needed |
| Speed | Very fast | Depends on HW |
| Privacy | Audio sent to OpenAI | Stays local |
| Setup | Simple | Downloads models |
| Hardware | Any | CPU/GPU intensive |
Recommended Setup
For development/testing:
- Use OpenAI API (faster iteration)
For production/bulk processing:
- Use Local Whisper (no ongoing costs)
For privacy-sensitive content:
- Use Local Whisper (keeps data local)