ADR-001: Speech-to-Text Model Selection and Deployment Strategy
Status: Accepted Date: 2025-11-07 Deciders: Technical Team Context: Local ASR for YouTube-to-text processing
Context and Problem Statement
CODITECT Audio2Text requires a speech-to-text (ASR) engine capable of accurately transcribing audio extracted from YouTube videos. The solution must prioritize:
- Privacy: No data should leave the user's machine
- Accuracy: High-quality transcription across various content types
- Performance: Reasonable processing times on consumer hardware
- Cost: No ongoing cloud service fees
- Extensibility: Support for multiple languages and future enhancements
The key decision is which ASR technology to use and how to deploy it.
Decision Drivers
- Privacy and data sovereignty requirements
- Transcription accuracy across diverse content (technical talks, lectures, podcasts)
- Hardware resource constraints (CPU/GPU/RAM availability)
- Processing latency and throughput requirements
- Multi-language support
- Open-source licensing and community support
- Maintenance and update burden
Considered Options
Option 1: OpenAI Whisper (Local Deployment)
Pros:
- State-of-the-art accuracy (WER 14-23% for small models)
- Multiple model sizes for different hardware profiles
- Excellent multi-language support (90+ languages)
- Open-source (MIT license)
- Active community and ongoing development
- Runs entirely locally (CPU or GPU)
- Handles noisy and accented speech well
- Includes automatic language detection
Cons:
- Slower than real-time on CPU (2-10x playback time)
- Requires 1-10GB RAM/VRAM depending on model
- Initial model download required
- No streaming/real-time support in base implementation
Option 2: Cloud-based ASR (Google Speech-to-Text, AssemblyAI, etc.)
Pros:
- Very high accuracy
- Faster processing with cloud resources
- No local resource requirements
- Professional support
Cons:
- Privacy concerns (data leaves local machine)
- Ongoing costs per usage
- Internet dependency
- API rate limits
- Vendor lock-in
Option 3: Mozilla DeepSpeech
Pros:
- Open-source
- Lightweight models available
- Local deployment
- Retrainable
Cons:
- Lower accuracy than Whisper (WER 25-35%)
- Project deprecated in 2021
- Limited active development
- Fewer languages supported
Option 4: Vosk
Pros:
- Very lightweight (~50MB models)
- Fast processing
- Good for embedded systems
- Multiple language models
Cons:
- Lower accuracy than Whisper (WER 20-30%)
- Limited to smaller vocabulary models
- Less robust for noisy audio
Decision Outcome
Chosen option: OpenAI Whisper with local deployment and flexible model selection
Rationale
Whisper provides the best balance of:
- Privacy: Fully local processing with no external API calls
- Accuracy: Best-in-class open-source performance (16% WER for base model)
- Flexibility: Multiple model sizes allow users to trade accuracy for speed
- Language Support: 90+ languages with automatic detection
- Community: Active development and extensive community support
- Cost: One-time download, no ongoing fees
- Future-proof: Ongoing improvements from OpenAI and community
Implementation Strategy
Model Selection Matrix
| Hardware Profile | Default Model | VRAM Required | Expected Speed | Use Case |
|---|---|---|---|---|
| CPU-only (low-end) | tiny | ~1GB RAM | ~10x real-time | Quick drafts |
| CPU-only (mid-range) | base | ~1GB RAM | ~7x real-time | Balanced |
| CPU-only (high-end) | small | ~2GB RAM | ~4x real-time | Quality focus |
| GPU (4GB VRAM) | small | ~2GB VRAM | ~20x real-time | Fast processing |
| GPU (8GB+ VRAM) | medium | ~5GB VRAM | ~30x real-time | High quality |
Deployment Configuration
# Auto-detect best model based on available resources
def select_optimal_model():
if torch.cuda.is_available():
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
if vram_gb >= 8:
return "medium"
elif vram_gb >= 4:
return "small"
else:
return "base"
else:
# CPU fallback
import psutil
ram_gb = psutil.virtual_memory().total / 1e9
if ram_gb >= 16:
return "small"
else:
return "base"
Fallback Strategy
- Primary: Use user-selected or auto-detected model
- On OOM Error: Automatically retry with next smaller model
- Final Fallback: Use "tiny" model as last resort
- User Override: Allow manual model selection in settings
Consequences
Positive
- User Control: Complete data privacy and ownership
- No Ongoing Costs: One-time setup, unlimited usage
- Offline Capable: Works without internet after initial setup
- Scalable: Can process unlimited videos without API limits
- Customizable: Can fine-tune models for specific domains
- Transparent: Open-source model weights and code
Negative
- Initial Setup: Users must download models (75MB - 3GB)
- Resource Requirements: Requires adequate RAM/VRAM
- Processing Time: Slower than cloud solutions on low-end hardware
- Storage: Models require 0.5-10GB disk space
Neutral
- Hardware Dependency: Performance varies based on user hardware
- Maintenance: Requires occasional model updates for improvements
Mitigations and Risks
Risk: Out of Memory Errors
Mitigation: Automatic fallback to smaller models, clear hardware requirements documentation
Risk: Slow Processing on Low-End Hardware
Mitigation: Pre-calculate estimated processing time, recommend model based on hardware profile, optional "fast mode" with tiny model
Risk: Model Download Size
Mitigation: Progressive model downloads, only download selected model, compress models
Risk: Poor Accuracy on Specific Content
Mitigation: Allow model switching per job, support for custom fine-tuned models in future
Alternative Configurations
Hybrid Approach (Future Consideration)
- Local-first: Use Whisper by default
- Optional Cloud: Allow users to opt-in to cloud providers for specific jobs
- Comparison Mode: Run both local and cloud for quality benchmarking
Whisper.cpp Alternative
- Consideration: Use whisper.cpp (C++ port) for better performance
- Status: Monitor for stability and feature parity
- Timeline: Evaluate for v2.0
Compliance and Privacy
Data Processing
- All audio and transcription data remains on user's local machine
- No telemetry or analytics sent to external servers
- Optional anonymous usage statistics (opt-in only)
Model Provenance
- Models sourced directly from OpenAI's official repository
- Checksum validation for model integrity
- Clear licensing (MIT) for all components
Performance Benchmarks
Based on testing with 30-minute YouTube video (technical lecture):
| Model | Hardware | Processing Time | Accuracy (subjective) | Memory Usage |
|---|---|---|---|---|
| tiny | CPU (i5) | 45 minutes | Good for clear speech | 1.2GB |
| base | CPU (i5) | 35 minutes | Very good | 1.5GB |
| small | CPU (i5) | 25 minutes | Excellent | 2.3GB |
| base | GPU (RTX 3060) | 4 minutes | Very good | 2.0GB VRAM |
| small | GPU (RTX 3060) | 6 minutes | Excellent | 3.5GB VRAM |
Review Schedule
This ADR should be reviewed:
- Quarterly: Check for new Whisper releases or alternatives
- On User Feedback: If accuracy or performance issues reported
- Hardware Evolution: When new acceleration options emerge (e.g., WebGPU)
References
Related ADRs
- ADR-002: Audio Download Strategy (yt-dlp vs alternatives)
- ADR-003: Output Format Support (SRT, VTT, JSON)
- ADR-004: Deployment Architecture (Docker vs native)
Signed Off: 2025-11-07 Next Review: 2025-02-07