Unified Multimodal Pipeline
Complete video analysis pipeline using local models (LM Studio) with cloud fallback.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ UNIFIED VIDEO PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ VISION │ │ AUDIO │ │ ARTIFACTS │ │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ Qwen3-VL-8B │ │ AudioGemma │ │ Gemma3-12B │ │
│ │ GLM-4.6V │ │ (local) │ │ (local) │ │
│ │ (LM Studio) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Content Synthesis │ │
│ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ 11 Artifacts │ │
│ │ (SDD, TDD, C4, etc)│ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Providers
Local (LM Studio)
| Type | Model | Size | Status |
|---|---|---|---|
| Vision | Qwen3-VL-8B | 8B | ✅ Ready |
| Vision | GLM-4.6V-Flash | 9B | ✅ Ready |
| Audio | AudioGemma | ? | ⏳ Downloading |
| Text | Gemma3-12B | 12B | ✅ Ready |
Cloud (Fallback)
| Type | Model | Status |
|---|---|---|
| Vision | Claude 3 Haiku | ✅ Available |
| Audio | Whisper (local) | ✅ Ready |
| Text | Claude 3 Haiku | ✅ Available |
Quick Start
1. List Available Providers
./vp list-models
2. Process Video with Local Models
# Use best local models (auto-select)
./vp process video.mp4 --local
# Use specific vision model
./vp process video.mp4 --vision qwen
./vp process video.mp4 --vision glm
# YouTube URL
./vp process https://youtube.com/watch?v=xxx --local
3. Mix Local and Cloud
# Local vision, cloud artifacts
./vp process video.mp4 --vision qwen --artifacts claude
# Cloud processing
./vp process video.mp4 --cloud
Files Created
New Provider Modules
| File | Purpose |
|---|---|
src/lmstudio_provider.py | LM Studio integration (Vision + Text) |
src/audiogemma_provider.py | AudioGemma audio analysis |
src/pipeline_unified.py | Unified multimodal pipeline |
CLI and Tests
| File | Purpose |
|---|---|
unified_cli.py | Standalone CLI for unified pipeline |
test_unified.py | Provider availability test |
test_lmstudio_vision.py | Vision model test |
test_audiogemma.py | AudioGemma test |
Updated
| File | Changes |
|---|---|
vp | Added --local, --vision, --artifacts flags |
src/kimi_vision.py | Fixed model reference |
Configuration
Environment Variables
# Control provider selection
export USE_LOCAL_MODELS=true # Prefer local models
export VISION_MODEL=qwen # qwen or glm
# API Keys (for cloud fallback)
export ANTHROPIC_API_KEY="..." # Claude
export OPENAI_API_KEY="..." # OpenAI
export KIMI_API_KEY="..." # Moonshot
Pipeline Config
from src.pipeline_unified import UnifiedPipelineConfig
config = UnifiedPipelineConfig(
vision_provider="lmstudio", # or "claude", "kimi"
audio_provider="audiogemma", # or "whisper"
artifact_provider="lmstudio", # or "claude"
lmstudio_vision_model="qwen", # or "glm"
output_dir=Path("outputs")
)
Features by Provider
Qwen3-VL-8B (Vision)
- ✅ High-quality image understanding
- ✅ Text recognition in images
- ✅ UI/component detection
- ✅ Fast inference
GLM-4.6V-Flash (Vision)
- ✅ Optimized for speed
- ✅ Good image comprehension
- ✅ Efficient for batch processing
AudioGemma (Audio)
- 🔄 Downloading...
- Will support:
- Audio content understanding
- Sentiment analysis
- Speaker identification
- Direct audio summarization
Gemma3-12B (Text)
- ✅ Document generation
- ✅ Technical writing
- ✅ Architecture descriptions
Output Artifacts
All pipelines generate 11 artifacts:
- Document Inventory - Index of all artifacts
- Executive Summary - High-level overview
- Introduction - Context and purpose
- Content Outline - Hierarchical structure
- Software Design Document (SDD) - Architecture
- Technical Design Document (TDD) - API specs, data models
- Architecture Decision Records (ADR) - Key decisions
- C4 System Context - Level 1 diagram
- C4 Container Level - Level 2 diagram
- C4 Component Level - Level 3 diagram
- Glossary - Terms and definitions
Troubleshooting
LM Studio Not Running
# Start LM Studio server
# 1. Open LM Studio app
# 2. Go to Developer tab
# 3. Click "Start Server"
Model Not Found
# Check available models
./vp list-models
# Load model in LM Studio if needed
AudioGemma Not Ready
# Wait for download to complete
# Currently showing .part files
ls ~/.lmstudio/models/mradermacher/Audiogemma-3N-finetune-GGUF/
Performance Comparison
| Provider | Vision Speed | Artifact Speed | Quality |
|---|---|---|---|
| LM Studio Local | Fast | Fast | Good |
| Claude Cloud | Medium | Medium | Very Good |
Recommendation: Use local models for speed and privacy, cloud for maximum quality.
Next Steps
- ✅ Wait for AudioGemma download - Then audio analysis will be available
- ✅ Test with different vision models - Try both Qwen and GLM
- ✅ Compare quality - Local vs cloud artifacts
- ✅ Batch processing - Process multiple videos
All Commands
# List providers
./vp list-models
# Process with local models
./vp process video.mp4 --local
./vp process video.mp4 --vision qwen
./vp process video.mp4 --vision glm
# Process with cloud
./vp process video.mp4 --cloud
# Mix providers
./vp process video.mp4 --vision qwen --artifacts claude
# Custom output
./vp process video.mp4 --output ./results --fps 1.0
# Skip artifacts
./vp process video.mp4 --no-artifacts
# Help
./vp help