Video-to-Knowledge Pipeline

A comprehensive system for converting video content into structured knowledge artifacts including SDD, TDD, ADR, C4 diagrams, interactive presentation decks, and more.

Features

Multi-Source Ingestion: Download from YouTube, Vimeo, or process local videos
Audio Intelligence: Extract and transcribe audio with Whisper (local or API)
Visual Analysis: Sample frames, deduplicate, and analyze with Kimi or Claude Vision
Content Synthesis: Merge transcript and visual analysis
Artifact Generation: Auto-generate:
- Executive Summary
- Introduction & Outline
- Software Design Document (SDD)
- Technical Design Document (TDD)
- Architecture Decision Records (ADR)
- C4 Model diagrams (Context, Container, Component)
- Glossary of Terms
- Document Inventory
Interactive Presentation Decks: HTML/Markdown/Reveal.js decks with synchronized transcript and images
Interactive CLI: Guided setup with model management
Configurable Whisper Models: tiny, base, small, medium, large-v3 with auto-selection

Architecture

🚀 Quick Start

One-Command Setup

# Clone or navigate to project
cd video_analysis_pipeline

# Run interactive setup (recommended)
python build.py

# Or use Make
make install

Manual Installation

See BUILD_README.md for detailed installation options.

Requirements

Python 3.9+
FFmpeg
API Key (Kimi or Claude)

Install FFmpeg:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

Interactive Mode (Recommended)

# Launch interactive menu
python -m src.pipeline

# Or explicitly
python -m src.pipeline interactive

The interactive menu provides:

🎬 Process new video - Full guided setup
⚡ Quick process - Use saved defaults
📦 Manage Whisper models - Download/view models
⚙️ Configuration - View/reset settings

Quick Usage

# Process a YouTube video
python -m src.pipeline process "https://youtube.com/watch?v=..."

# Process local video
python -m src.pipeline process "./my_video.mp4"

# Custom output directory
python -m src.pipeline process "URL" --output ./my_output

# Adjust frame extraction rate
python -m src.pipeline process "URL" --fps 1.0

# Skip artifact generation
python -m src.pipeline process "URL" --no-artifacts

# Select specific Whisper model
python -m src.pipeline process "URL" --whisper-model medium

# Prioritize transcription speed over accuracy
python -m src.pipeline process "URL" --priority speed

# Use Kimi for vision analysis (instead of Claude)
python -m src.pipeline process "URL" --vision kimi

# Generate presentation deck
python -m src.pipeline process "URL" --deck --deck-format html

# Open generated deck in browser
python -m src.pipeline deck outputs/deck/presentation.html

Model Management

# List available Whisper models
python -m src.pipeline models

# Download specific model
python -m src.pipeline models --download small

# Download all models (~4.5GB)
python -m src.pipeline models --download-all

# Interactive model manager
python -m src.pipeline models --interactive

Pipeline Stages

1. Video Download (`downloader.py`)

Uses yt-dlp to download videos from platforms like YouTube, Vimeo.

2. Audio Processing (`audio_processor.py`)

Extracts audio to MP3 using FFmpeg
Transcribes with OpenAI Whisper
Produces timestamped transcript segments

3. Frame Processing (`frame_processor.py`)

Extracts frames at configurable rate (default: 0.5 fps)
Deduplicates using perceptual hashing (pHash)
Keeps only visually unique frames

4. Vision Analysis (`vision_analyzer.py`)

Analyzes unique frames with Claude Vision
Extracts text, diagrams, UI elements
Generates structured descriptions

5. Content Synthesis (`content_synthesizer.py`)

Aligns transcript segments with frame timestamps
Creates semantic chunks
Merges audio and visual context

6. Artifact Generation (`artifact_generator.py`)

Generates comprehensive documentation:

Artifact	Purpose
Executive Summary	High-level overview for decision makers
SDD	System design and architecture
TDD	Technical implementation details
ADR	Architecture decision records
C4 Diagrams	Visual architecture at multiple levels
Glossary	Technical terms and definitions
Inventory	Document index and navigation

Configuration

Edit PipelineConfig in src/models.py or pass options via CLI:

PipelineConfig(
    download_quality="best",
    audio_bitrate="192k",
    frame_rate=0.5,        # frames per second
    dedup_threshold=10,    # pHash hamming distance
    chunk_size=4000,       # tokens per chunk
    web_search_enabled=True,
    
    # Whisper transcription settings
    transcription_backend="local",  # "local" or "openai-api"
    whisper_model="auto",           # "tiny", "base", "small", "medium", "large-v3", "auto"
    whisper_priority="balanced"     # "speed", "quality", "balanced"
)

Whisper Model Selection

Model	RAM	Speed	Accuracy	Best For
tiny	~0.5GB	32x	58%	Quick tests, limited RAM
base	~1GB	16x	68%	Default, balanced
small	~2GB	6x	76%	Better accuracy
medium	~5GB	2x	82%	High quality
large-v3	~10GB	1x	87%	Best quality

Auto-selection logic:

< 1GB RAM → tiny
1-2GB RAM → base
2-5GB RAM → small
5-10GB RAM → medium or large-v3 (by priority)
> 10GB RAM → large-v3

Environment variables:

export WHISPER_MODEL_SIZE=medium      # Force specific model
export WHISPER_PRIORITY=quality       # Force priority
export WHISPER_DEVICE=cuda            # Force GPU

Presentation Decks

The pipeline generates interactive HTML presentation decks that synchronize transcript with extracted video frames.

Features

Synchronized Playback: Auto-advances slides based on transcript timing
Dual View: Left side shows video frame, right side shows transcript
Manual Navigation: Arrow keys or buttons to navigate
Speed Control: 1x, 1.5x, 2x playback speeds
Progress Tracking: Visual progress bar

Deck Formats

Format	File	Description
HTML	`presentation.html`	Full-featured interactive deck (default)
Markdown	`presentation.md`	Simple markdown slides
Reveal.js	`presentation_reveal.html`	Reveal.js slideshow

Using the HTML Deck

# Generate deck
python -m src.pipeline process "video.mp4" --deck

# Open in browser
python -m src.pipeline deck outputs/deck/presentation.html

# Or open manually
open outputs/deck/presentation.html

Keyboard Shortcuts:

→ or Space - Next slide
← - Previous slide
Home - First slide
End - Last slide

Project Structure

video_analysis_pipeline/
├── docs/
│   ├── sdd/                    # Software Design Document
│   ├── tdd/                    # Technical Design Document
│   ├── adr/                    # Architecture Decision Records
│   └── c4/                     # C4 Model diagrams
├── src/
│   ├── __init__.py
│   ├── models.py               # Pydantic data models
│   ├── downloader.py           # Video download (yt-dlp)
│   ├── audio_processor.py      # Audio extraction & transcription
│   ├── frame_processor.py      # Frame extraction & deduplication
│   ├── vision_analyzer.py      # Image analysis (Claude Vision)
│   ├── content_synthesizer.py  # Content merging & chunking
│   ├── artifact_generator.py   # Document generation
│   ├── web_search.py           # Knowledge expansion
│   └── pipeline.py             # Main orchestrator
├── outputs/
│   ├── videos/                 # Downloaded videos
│   ├── audio/                  # Extracted audio
│   ├── frames/                 # Extracted & deduplicated frames
│   └── artifacts/              # Generated documentation
├── requirements.txt
└── README.md

API Keys

Service	Key	Purpose	Required
OpenAI	`OPENAI_API_KEY`	Whisper API transcription	Optional (local Whisper available)
Anthropic	`ANTHROPIC_API_KEY`	Vision analysis, artifact generation	Optional (Kimi alternative)
Kimi/Moonshot	`KIMI_API_KEY`	Vision analysis	Recommended
Serper	`SERPER_API_KEY`	Web search expansion	Optional

Kimi Vision (Recommended)

Kimi (Moonshot AI) provides excellent vision capabilities for analyzing video frames:

export KIMI_API_KEY="your-kimi-api-key"
# or
export MOONSHOT_API_KEY="your-kimi-api-key"

The pipeline will auto-detect and use Kimi if available, falling back to Claude.

Algorithms

Perceptual Deduplication

Uses pHash (perceptual hash) to detect visually similar frames:

Compute 64-bit hash for each frame
Compare consecutive frames with Hamming distance
Keep frame if distance > threshold (content changed)
Discard if distance ≤ threshold (visually similar)

Default threshold: 10 (tunable by content type)

Outputs

Generated artifacts are saved as Markdown files in outputs/artifacts/:

outputs/artifacts/
├── 00_document_inventory.md
├── 01_executive_summary.md
├── 02_introduction.md
├── 03_outline.md
├── 04_software_design_document.md
├── 05_technical_design_document.md
├── 06_architecture_decisions.md
├── 07_c4_system_context.md
├── 08_c4_container_level.md
├── 09_c4_component_level.md
└── 10_glossary.md

Development

# Run tests
pytest tests/

# Type checking
mypy src/

# Linting
ruff check src/

License

MIT

Credits

yt-dlp: Video downloading
OpenAI Whisper: Speech-to-text
Anthropic Claude: Vision and text generation
FFmpeg: Audio/video processing

🔑 API Key Options

Option 1: OpenAI Whisper API (Current Default)

Pros:

Fast (optimized cloud infrastructure)
No local GPU needed
Always up-to-date models

Cons:

Requires API key
Costs money (~$0.006/minute)
Requires internet

Setup:

export OPENAI_API_KEY="sk-..."
pip install -r requirements.txt

Option 2: Local Whisper (No API Key!)

Pros:

✅ No API key needed
✅ Free (no usage costs)
✅ Works offline
✅ Privacy (audio stays local)

Cons:

Slower (depends on your hardware)
Downloads ~1-2GB model files
Requires more RAM/CPU (or GPU)

Setup:

# Use local requirements
pip install -r requirements-local.txt

# Replace audio_processor.py
mv src/audio_processor.py src/audio_processor_api.py
mv src/audio_processor_local.py src/audio_processor.py

# No API key needed for transcription!
export ANTHROPIC_API_KEY="sk-ant-..."  # Still needed for vision

Model Sizes:

Model	Size	VRAM	Speed	Accuracy
tiny	39M	~1GB	Fastest	Basic
base	74M	~1GB	Fast	Good
small	244M	~2GB	Medium	Better
medium	769M	~5GB	Slow	High
large-v3	1550M	~10GB	Slowest	Best

Edit src/audio_processor.py to change model:

def __init__(self, model_size: str = "base"):  # Change to "small" or "medium"

Comparison

Feature	API Whisper	Local Whisper
API Key	Yes	No
Cost	~$0.006/min	Free
Internet	Required	Not needed
Speed	Very fast	Depends on HW
Privacy	Audio sent to OpenAI	Stays local
Setup	Simple	Downloads models
Hardware	Any	CPU/GPU intensive

Recommended Setup

For development/testing:

Use OpenAI API (faster iteration)

For production/bulk processing:

Use Local Whisper (no ongoing costs)

For privacy-sensitive content:

Use Local Whisper (keeps data local)

Features​

Architecture​

🚀 Quick Start​

One-Command Setup​

Manual Installation​

Requirements​

Interactive Mode (Recommended)​

Quick Usage​

Model Management​

Pipeline Stages​

1. Video Download (downloader.py)​

2. Audio Processing (audio_processor.py)​

3. Frame Processing (frame_processor.py)​

4. Vision Analysis (vision_analyzer.py)​

5. Content Synthesis (content_synthesizer.py)​

6. Artifact Generation (artifact_generator.py)​

Configuration​

Whisper Model Selection​

Presentation Decks​

Features​

Deck Formats​

Using the HTML Deck​

Project Structure​

API Keys​

Kimi Vision (Recommended)​

Algorithms​

Perceptual Deduplication​

Outputs​

Development​

License​

Credits​

🔑 API Key Options​

Option 1: OpenAI Whisper API (Current Default)​

Option 2: Local Whisper (No API Key!)​

Comparison​

Recommended Setup​

Features

Architecture

🚀 Quick Start

One-Command Setup

Manual Installation

Requirements

Interactive Mode (Recommended)

Quick Usage

Model Management

Pipeline Stages

1. Video Download (`downloader.py`)

2. Audio Processing (`audio_processor.py`)

3. Frame Processing (`frame_processor.py`)

4. Vision Analysis (`vision_analyzer.py`)

5. Content Synthesis (`content_synthesizer.py`)

6. Artifact Generation (`artifact_generator.py`)

Configuration

Whisper Model Selection

Presentation Decks

Features

Deck Formats

Using the HTML Deck

Project Structure

API Keys

Kimi Vision (Recommended)

Algorithms

Perceptual Deduplication

Outputs

Development

License

Credits

🔑 API Key Options

Option 1: OpenAI Whisper API (Current Default)

Option 2: Local Whisper (No API Key!)

Comparison

Recommended Setup