Skip to main content

Unified Multimodal Pipeline

Complete video analysis pipeline using local models (LM Studio) with cloud fallback.

Architecture

┌─────────────────────────────────────────────────────────────┐
│ UNIFIED VIDEO PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ VISION │ │ AUDIO │ │ ARTIFACTS │ │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ Qwen3-VL-8B │ │ AudioGemma │ │ Gemma3-12B │ │
│ │ GLM-4.6V │ │ (local) │ │ (local) │ │
│ │ (LM Studio) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Content Synthesis │ │
│ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ 11 Artifacts │ │
│ │ (SDD, TDD, C4, etc)│ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

Providers

Local (LM Studio)

TypeModelSizeStatus
VisionQwen3-VL-8B8B✅ Ready
VisionGLM-4.6V-Flash9B✅ Ready
AudioAudioGemma?⏳ Downloading
TextGemma3-12B12B✅ Ready

Cloud (Fallback)

TypeModelStatus
VisionClaude 3 Haiku✅ Available
AudioWhisper (local)✅ Ready
TextClaude 3 Haiku✅ Available

Quick Start

1. List Available Providers

./vp list-models

2. Process Video with Local Models

# Use best local models (auto-select)
./vp process video.mp4 --local

# Use specific vision model
./vp process video.mp4 --vision qwen
./vp process video.mp4 --vision glm

# YouTube URL
./vp process https://youtube.com/watch?v=xxx --local

3. Mix Local and Cloud

# Local vision, cloud artifacts
./vp process video.mp4 --vision qwen --artifacts claude

# Cloud processing
./vp process video.mp4 --cloud

Files Created

New Provider Modules

FilePurpose
src/lmstudio_provider.pyLM Studio integration (Vision + Text)
src/audiogemma_provider.pyAudioGemma audio analysis
src/pipeline_unified.pyUnified multimodal pipeline

CLI and Tests

FilePurpose
unified_cli.pyStandalone CLI for unified pipeline
test_unified.pyProvider availability test
test_lmstudio_vision.pyVision model test
test_audiogemma.pyAudioGemma test

Updated

FileChanges
vpAdded --local, --vision, --artifacts flags
src/kimi_vision.pyFixed model reference

Configuration

Environment Variables

# Control provider selection
export USE_LOCAL_MODELS=true # Prefer local models
export VISION_MODEL=qwen # qwen or glm

# API Keys (for cloud fallback)
export ANTHROPIC_API_KEY="..." # Claude
export OPENAI_API_KEY="..." # OpenAI
export KIMI_API_KEY="..." # Moonshot

Pipeline Config

from src.pipeline_unified import UnifiedPipelineConfig

config = UnifiedPipelineConfig(
vision_provider="lmstudio", # or "claude", "kimi"
audio_provider="audiogemma", # or "whisper"
artifact_provider="lmstudio", # or "claude"
lmstudio_vision_model="qwen", # or "glm"
output_dir=Path("outputs")
)

Features by Provider

Qwen3-VL-8B (Vision)

  • ✅ High-quality image understanding
  • ✅ Text recognition in images
  • ✅ UI/component detection
  • ✅ Fast inference

GLM-4.6V-Flash (Vision)

  • ✅ Optimized for speed
  • ✅ Good image comprehension
  • ✅ Efficient for batch processing

AudioGemma (Audio)

  • 🔄 Downloading...
  • Will support:
    • Audio content understanding
    • Sentiment analysis
    • Speaker identification
    • Direct audio summarization

Gemma3-12B (Text)

  • ✅ Document generation
  • ✅ Technical writing
  • ✅ Architecture descriptions

Output Artifacts

All pipelines generate 11 artifacts:

  1. Document Inventory - Index of all artifacts
  2. Executive Summary - High-level overview
  3. Introduction - Context and purpose
  4. Content Outline - Hierarchical structure
  5. Software Design Document (SDD) - Architecture
  6. Technical Design Document (TDD) - API specs, data models
  7. Architecture Decision Records (ADR) - Key decisions
  8. C4 System Context - Level 1 diagram
  9. C4 Container Level - Level 2 diagram
  10. C4 Component Level - Level 3 diagram
  11. Glossary - Terms and definitions

Troubleshooting

LM Studio Not Running

# Start LM Studio server
# 1. Open LM Studio app
# 2. Go to Developer tab
# 3. Click "Start Server"

Model Not Found

# Check available models
./vp list-models

# Load model in LM Studio if needed

AudioGemma Not Ready

# Wait for download to complete
# Currently showing .part files
ls ~/.lmstudio/models/mradermacher/Audiogemma-3N-finetune-GGUF/

Performance Comparison

ProviderVision SpeedArtifact SpeedQuality
LM Studio LocalFastFastGood
Claude CloudMediumMediumVery Good

Recommendation: Use local models for speed and privacy, cloud for maximum quality.

Next Steps

  1. Wait for AudioGemma download - Then audio analysis will be available
  2. Test with different vision models - Try both Qwen and GLM
  3. Compare quality - Local vs cloud artifacts
  4. Batch processing - Process multiple videos

All Commands

# List providers
./vp list-models

# Process with local models
./vp process video.mp4 --local
./vp process video.mp4 --vision qwen
./vp process video.mp4 --vision glm

# Process with cloud
./vp process video.mp4 --cloud

# Mix providers
./vp process video.mp4 --vision qwen --artifacts claude

# Custom output
./vp process video.mp4 --output ./results --fps 1.0

# Skip artifacts
./vp process video.mp4 --no-artifacts

# Help
./vp help