Unified Multimodal Pipeline

Complete video analysis pipeline using local models (LM Studio) with cloud fallback.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                  UNIFIED VIDEO PIPELINE                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   VISION     │  │    AUDIO     │  │   ARTIFACTS  │      │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤      │
│  │ Qwen3-VL-8B  │  │ AudioGemma   │  │ Gemma3-12B   │      │
│  │ GLM-4.6V     │  │ (local)      │  │ (local)      │      │
│  │ (LM Studio)  │  │              │  │              │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│         │                 │                 │               │
│         └─────────────────┼─────────────────┘               │
│                           ▼                                 │
│              ┌─────────────────────┐                        │
│              │  Content Synthesis  │                        │
│              └─────────────────────┘                        │
│                           │                                 │
│                           ▼                                 │
│              ┌─────────────────────┐                        │
│              │  11 Artifacts       │                        │
│              │  (SDD, TDD, C4, etc)│                        │
│              └─────────────────────┘                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Providers

Local (LM Studio)

Type	Model	Size	Status
Vision	Qwen3-VL-8B	8B	✅ Ready
Vision	GLM-4.6V-Flash	9B	✅ Ready
Audio	AudioGemma	?	⏳ Downloading
Text	Gemma3-12B	12B	✅ Ready

Cloud (Fallback)

Type	Model	Status
Vision	Claude 3 Haiku	✅ Available
Audio	Whisper (local)	✅ Ready
Text	Claude 3 Haiku	✅ Available

Quick Start

1. List Available Providers

./vp list-models

2. Process Video with Local Models

# Use best local models (auto-select)
./vp process video.mp4 --local

# Use specific vision model
./vp process video.mp4 --vision qwen
./vp process video.mp4 --vision glm

# YouTube URL
./vp process https://youtube.com/watch?v=xxx --local

3. Mix Local and Cloud

# Local vision, cloud artifacts
./vp process video.mp4 --vision qwen --artifacts claude

# Cloud processing
./vp process video.mp4 --cloud

Files Created

New Provider Modules

File	Purpose
`src/lmstudio_provider.py`	LM Studio integration (Vision + Text)
`src/audiogemma_provider.py`	AudioGemma audio analysis
`src/pipeline_unified.py`	Unified multimodal pipeline

CLI and Tests

File	Purpose
`unified_cli.py`	Standalone CLI for unified pipeline
`test_unified.py`	Provider availability test
`test_lmstudio_vision.py`	Vision model test
`test_audiogemma.py`	AudioGemma test

Updated

File	Changes
`vp`	Added `--local`, `--vision`, `--artifacts` flags
`src/kimi_vision.py`	Fixed model reference

Configuration

Environment Variables

# Control provider selection
export USE_LOCAL_MODELS=true      # Prefer local models
export VISION_MODEL=qwen          # qwen or glm

# API Keys (for cloud fallback)
export ANTHROPIC_API_KEY="..."    # Claude
export OPENAI_API_KEY="..."       # OpenAI
export KIMI_API_KEY="..."         # Moonshot

Pipeline Config

from src.pipeline_unified import UnifiedPipelineConfig

config = UnifiedPipelineConfig(
    vision_provider="lmstudio",      # or "claude", "kimi"
    audio_provider="audiogemma",     # or "whisper"
    artifact_provider="lmstudio",    # or "claude"
    lmstudio_vision_model="qwen",    # or "glm"
    output_dir=Path("outputs")
)

Features by Provider

Qwen3-VL-8B (Vision)

✅ High-quality image understanding
✅ Text recognition in images
✅ UI/component detection
✅ Fast inference

GLM-4.6V-Flash (Vision)

✅ Optimized for speed
✅ Good image comprehension
✅ Efficient for batch processing

AudioGemma (Audio)

🔄 Downloading...
Will support:
- Audio content understanding
- Sentiment analysis
- Speaker identification
- Direct audio summarization

Gemma3-12B (Text)

✅ Document generation
✅ Technical writing
✅ Architecture descriptions

Output Artifacts

All pipelines generate 11 artifacts:

Document Inventory - Index of all artifacts
Executive Summary - High-level overview
Introduction - Context and purpose
Content Outline - Hierarchical structure
Software Design Document (SDD) - Architecture
Technical Design Document (TDD) - API specs, data models
Architecture Decision Records (ADR) - Key decisions
C4 System Context - Level 1 diagram
C4 Container Level - Level 2 diagram
C4 Component Level - Level 3 diagram
Glossary - Terms and definitions

Troubleshooting

LM Studio Not Running

# Start LM Studio server
# 1. Open LM Studio app
# 2. Go to Developer tab
# 3. Click "Start Server"

Model Not Found

# Check available models
./vp list-models

# Load model in LM Studio if needed

AudioGemma Not Ready

# Wait for download to complete
# Currently showing .part files
ls ~/.lmstudio/models/mradermacher/Audiogemma-3N-finetune-GGUF/

Performance Comparison

Provider	Vision Speed	Artifact Speed	Quality
LM Studio Local	Fast	Fast	Good
Claude Cloud	Medium	Medium	Very Good

Recommendation: Use local models for speed and privacy, cloud for maximum quality.

Next Steps

✅ Wait for AudioGemma download - Then audio analysis will be available
✅ Test with different vision models - Try both Qwen and GLM
✅ Compare quality - Local vs cloud artifacts
✅ Batch processing - Process multiple videos

All Commands

# List providers
./vp list-models

# Process with local models
./vp process video.mp4 --local
./vp process video.mp4 --vision qwen
./vp process video.mp4 --vision glm

# Process with cloud
./vp process video.mp4 --cloud

# Mix providers
./vp process video.mp4 --vision qwen --artifacts claude

# Custom output
./vp process video.mp4 --output ./results --fps 1.0

# Skip artifacts
./vp process video.mp4 --no-artifacts

# Help
./vp help

Architecture​

Providers​

Local (LM Studio)​

Cloud (Fallback)​

Quick Start​

1. List Available Providers​

2. Process Video with Local Models​

3. Mix Local and Cloud​

Files Created​

New Provider Modules​

CLI and Tests​

Updated​

Configuration​

Environment Variables​

Pipeline Config​

Features by Provider​

Qwen3-VL-8B (Vision)​

GLM-4.6V-Flash (Vision)​

AudioGemma (Audio)​

Gemma3-12B (Text)​

Output Artifacts​

Troubleshooting​

LM Studio Not Running​

Model Not Found​

AudioGemma Not Ready​

Performance Comparison​

Next Steps​

All Commands​