Skip to main content

Video-to-Knowledge Pipeline

A comprehensive system for converting video content into structured knowledge artifacts including SDD, TDD, ADR, C4 diagrams, interactive presentation decks, and more.

Features

  • Multi-Source Ingestion: Download from YouTube, Vimeo, or process local videos
  • Audio Intelligence: Extract and transcribe audio with Whisper (local or API)
  • Visual Analysis: Sample frames, deduplicate, and analyze with Kimi or Claude Vision
  • Content Synthesis: Merge transcript and visual analysis
  • Artifact Generation: Auto-generate:
    • Executive Summary
    • Introduction & Outline
    • Software Design Document (SDD)
    • Technical Design Document (TDD)
    • Architecture Decision Records (ADR)
    • C4 Model diagrams (Context, Container, Component)
    • Glossary of Terms
    • Document Inventory
  • Interactive Presentation Decks: HTML/Markdown/Reveal.js decks with synchronized transcript and images
  • Interactive CLI: Guided setup with model management
  • Configurable Whisper Models: tiny, base, small, medium, large-v3 with auto-selection

Architecture

🚀 Quick Start

One-Command Setup

# Clone or navigate to project
cd video_analysis_pipeline

# Run interactive setup (recommended)
python build.py

# Or use Make
make install

Manual Installation

See BUILD_README.md for detailed installation options.

Requirements

  • Python 3.9+
  • FFmpeg
  • API Key (Kimi or Claude)

Install FFmpeg:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg
# Launch interactive menu
python -m src.pipeline

# Or explicitly
python -m src.pipeline interactive

The interactive menu provides:

  • 🎬 Process new video - Full guided setup
  • Quick process - Use saved defaults
  • 📦 Manage Whisper models - Download/view models
  • ⚙️ Configuration - View/reset settings

Quick Usage

# Process a YouTube video
python -m src.pipeline process "https://youtube.com/watch?v=..."

# Process local video
python -m src.pipeline process "./my_video.mp4"

# Custom output directory
python -m src.pipeline process "URL" --output ./my_output

# Adjust frame extraction rate
python -m src.pipeline process "URL" --fps 1.0

# Skip artifact generation
python -m src.pipeline process "URL" --no-artifacts

# Select specific Whisper model
python -m src.pipeline process "URL" --whisper-model medium

# Prioritize transcription speed over accuracy
python -m src.pipeline process "URL" --priority speed

# Use Kimi for vision analysis (instead of Claude)
python -m src.pipeline process "URL" --vision kimi

# Generate presentation deck
python -m src.pipeline process "URL" --deck --deck-format html

# Open generated deck in browser
python -m src.pipeline deck outputs/deck/presentation.html

Model Management

# List available Whisper models
python -m src.pipeline models

# Download specific model
python -m src.pipeline models --download small

# Download all models (~4.5GB)
python -m src.pipeline models --download-all

# Interactive model manager
python -m src.pipeline models --interactive

Pipeline Stages

1. Video Download (downloader.py)

Uses yt-dlp to download videos from platforms like YouTube, Vimeo.

2. Audio Processing (audio_processor.py)

  • Extracts audio to MP3 using FFmpeg
  • Transcribes with OpenAI Whisper
  • Produces timestamped transcript segments

3. Frame Processing (frame_processor.py)

  • Extracts frames at configurable rate (default: 0.5 fps)
  • Deduplicates using perceptual hashing (pHash)
  • Keeps only visually unique frames

4. Vision Analysis (vision_analyzer.py)

  • Analyzes unique frames with Claude Vision
  • Extracts text, diagrams, UI elements
  • Generates structured descriptions

5. Content Synthesis (content_synthesizer.py)

  • Aligns transcript segments with frame timestamps
  • Creates semantic chunks
  • Merges audio and visual context

6. Artifact Generation (artifact_generator.py)

Generates comprehensive documentation:

ArtifactPurpose
Executive SummaryHigh-level overview for decision makers
SDDSystem design and architecture
TDDTechnical implementation details
ADRArchitecture decision records
C4 DiagramsVisual architecture at multiple levels
GlossaryTechnical terms and definitions
InventoryDocument index and navigation

Configuration

Edit PipelineConfig in src/models.py or pass options via CLI:

PipelineConfig(
download_quality="best",
audio_bitrate="192k",
frame_rate=0.5, # frames per second
dedup_threshold=10, # pHash hamming distance
chunk_size=4000, # tokens per chunk
web_search_enabled=True,

# Whisper transcription settings
transcription_backend="local", # "local" or "openai-api"
whisper_model="auto", # "tiny", "base", "small", "medium", "large-v3", "auto"
whisper_priority="balanced" # "speed", "quality", "balanced"
)

Whisper Model Selection

ModelRAMSpeedAccuracyBest For
tiny~0.5GB32x58%Quick tests, limited RAM
base~1GB16x68%Default, balanced
small~2GB6x76%Better accuracy
medium~5GB2x82%High quality
large-v3~10GB1x87%Best quality

Auto-selection logic:

  • < 1GB RAM → tiny
  • 1-2GB RAM → base
  • 2-5GB RAM → small
  • 5-10GB RAM → medium or large-v3 (by priority)
  • > 10GB RAM → large-v3

Environment variables:

export WHISPER_MODEL_SIZE=medium      # Force specific model
export WHISPER_PRIORITY=quality # Force priority
export WHISPER_DEVICE=cuda # Force GPU

Presentation Decks

The pipeline generates interactive HTML presentation decks that synchronize transcript with extracted video frames.

Features

  • Synchronized Playback: Auto-advances slides based on transcript timing
  • Dual View: Left side shows video frame, right side shows transcript
  • Manual Navigation: Arrow keys or buttons to navigate
  • Speed Control: 1x, 1.5x, 2x playback speeds
  • Progress Tracking: Visual progress bar

Deck Formats

FormatFileDescription
HTMLpresentation.htmlFull-featured interactive deck (default)
Markdownpresentation.mdSimple markdown slides
Reveal.jspresentation_reveal.htmlReveal.js slideshow

Using the HTML Deck

# Generate deck
python -m src.pipeline process "video.mp4" --deck

# Open in browser
python -m src.pipeline deck outputs/deck/presentation.html

# Or open manually
open outputs/deck/presentation.html

Keyboard Shortcuts:

  • or Space - Next slide
  • - Previous slide
  • Home - First slide
  • End - Last slide

Project Structure

video_analysis_pipeline/
├── docs/
│ ├── sdd/ # Software Design Document
│ ├── tdd/ # Technical Design Document
│ ├── adr/ # Architecture Decision Records
│ └── c4/ # C4 Model diagrams
├── src/
│ ├── __init__.py
│ ├── models.py # Pydantic data models
│ ├── downloader.py # Video download (yt-dlp)
│ ├── audio_processor.py # Audio extraction & transcription
│ ├── frame_processor.py # Frame extraction & deduplication
│ ├── vision_analyzer.py # Image analysis (Claude Vision)
│ ├── content_synthesizer.py # Content merging & chunking
│ ├── artifact_generator.py # Document generation
│ ├── web_search.py # Knowledge expansion
│ └── pipeline.py # Main orchestrator
├── outputs/
│ ├── videos/ # Downloaded videos
│ ├── audio/ # Extracted audio
│ ├── frames/ # Extracted & deduplicated frames
│ └── artifacts/ # Generated documentation
├── requirements.txt
└── README.md

API Keys

ServiceKeyPurposeRequired
OpenAIOPENAI_API_KEYWhisper API transcriptionOptional (local Whisper available)
AnthropicANTHROPIC_API_KEYVision analysis, artifact generationOptional (Kimi alternative)
Kimi/MoonshotKIMI_API_KEYVision analysisRecommended
SerperSERPER_API_KEYWeb search expansionOptional

Kimi (Moonshot AI) provides excellent vision capabilities for analyzing video frames:

export KIMI_API_KEY="your-kimi-api-key"
# or
export MOONSHOT_API_KEY="your-kimi-api-key"

The pipeline will auto-detect and use Kimi if available, falling back to Claude.

Algorithms

Perceptual Deduplication

Uses pHash (perceptual hash) to detect visually similar frames:

  1. Compute 64-bit hash for each frame
  2. Compare consecutive frames with Hamming distance
  3. Keep frame if distance > threshold (content changed)
  4. Discard if distance ≤ threshold (visually similar)

Default threshold: 10 (tunable by content type)

Outputs

Generated artifacts are saved as Markdown files in outputs/artifacts/:

outputs/artifacts/
├── 00_document_inventory.md
├── 01_executive_summary.md
├── 02_introduction.md
├── 03_outline.md
├── 04_software_design_document.md
├── 05_technical_design_document.md
├── 06_architecture_decisions.md
├── 07_c4_system_context.md
├── 08_c4_container_level.md
├── 09_c4_component_level.md
└── 10_glossary.md

Development

# Run tests
pytest tests/

# Type checking
mypy src/

# Linting
ruff check src/

License

MIT

Credits

  • yt-dlp: Video downloading
  • OpenAI Whisper: Speech-to-text
  • Anthropic Claude: Vision and text generation
  • FFmpeg: Audio/video processing

🔑 API Key Options

Option 1: OpenAI Whisper API (Current Default)

Pros:

  • Fast (optimized cloud infrastructure)
  • No local GPU needed
  • Always up-to-date models

Cons:

  • Requires API key
  • Costs money (~$0.006/minute)
  • Requires internet

Setup:

export OPENAI_API_KEY="sk-..."
pip install -r requirements.txt

Option 2: Local Whisper (No API Key!)

Pros:

  • ✅ No API key needed
  • ✅ Free (no usage costs)
  • ✅ Works offline
  • ✅ Privacy (audio stays local)

Cons:

  • Slower (depends on your hardware)
  • Downloads ~1-2GB model files
  • Requires more RAM/CPU (or GPU)

Setup:

# Use local requirements
pip install -r requirements-local.txt

# Replace audio_processor.py
mv src/audio_processor.py src/audio_processor_api.py
mv src/audio_processor_local.py src/audio_processor.py

# No API key needed for transcription!
export ANTHROPIC_API_KEY="sk-ant-..." # Still needed for vision

Model Sizes:

ModelSizeVRAMSpeedAccuracy
tiny39M~1GBFastestBasic
base74M~1GBFastGood
small244M~2GBMediumBetter
medium769M~5GBSlowHigh
large-v31550M~10GBSlowestBest

Edit src/audio_processor.py to change model:

def __init__(self, model_size: str = "base"):  # Change to "small" or "medium"

Comparison

FeatureAPI WhisperLocal Whisper
API KeyYesNo
Cost~$0.006/minFree
InternetRequiredNot needed
SpeedVery fastDepends on HW
PrivacyAudio sent to OpenAIStays local
SetupSimpleDownloads models
HardwareAnyCPU/GPU intensive

For development/testing:

  • Use OpenAI API (faster iteration)

For production/bulk processing:

  • Use Local Whisper (no ongoing costs)

For privacy-sensitive content:

  • Use Local Whisper (keeps data local)