Skip to main content

ADR-001: Speech-to-Text Model Selection and Deployment Strategy

Status: Accepted Date: 2025-11-07 Deciders: Technical Team Context: Local ASR for YouTube-to-text processing


Context and Problem Statement

CODITECT Audio2Text requires a speech-to-text (ASR) engine capable of accurately transcribing audio extracted from YouTube videos. The solution must prioritize:

  1. Privacy: No data should leave the user's machine
  2. Accuracy: High-quality transcription across various content types
  3. Performance: Reasonable processing times on consumer hardware
  4. Cost: No ongoing cloud service fees
  5. Extensibility: Support for multiple languages and future enhancements

The key decision is which ASR technology to use and how to deploy it.


Decision Drivers

  • Privacy and data sovereignty requirements
  • Transcription accuracy across diverse content (technical talks, lectures, podcasts)
  • Hardware resource constraints (CPU/GPU/RAM availability)
  • Processing latency and throughput requirements
  • Multi-language support
  • Open-source licensing and community support
  • Maintenance and update burden

Considered Options

Option 1: OpenAI Whisper (Local Deployment)

Pros:

  • State-of-the-art accuracy (WER 14-23% for small models)
  • Multiple model sizes for different hardware profiles
  • Excellent multi-language support (90+ languages)
  • Open-source (MIT license)
  • Active community and ongoing development
  • Runs entirely locally (CPU or GPU)
  • Handles noisy and accented speech well
  • Includes automatic language detection

Cons:

  • Slower than real-time on CPU (2-10x playback time)
  • Requires 1-10GB RAM/VRAM depending on model
  • Initial model download required
  • No streaming/real-time support in base implementation

Option 2: Cloud-based ASR (Google Speech-to-Text, AssemblyAI, etc.)

Pros:

  • Very high accuracy
  • Faster processing with cloud resources
  • No local resource requirements
  • Professional support

Cons:

  • Privacy concerns (data leaves local machine)
  • Ongoing costs per usage
  • Internet dependency
  • API rate limits
  • Vendor lock-in

Option 3: Mozilla DeepSpeech

Pros:

  • Open-source
  • Lightweight models available
  • Local deployment
  • Retrainable

Cons:

  • Lower accuracy than Whisper (WER 25-35%)
  • Project deprecated in 2021
  • Limited active development
  • Fewer languages supported

Option 4: Vosk

Pros:

  • Very lightweight (~50MB models)
  • Fast processing
  • Good for embedded systems
  • Multiple language models

Cons:

  • Lower accuracy than Whisper (WER 20-30%)
  • Limited to smaller vocabulary models
  • Less robust for noisy audio

Decision Outcome

Chosen option: OpenAI Whisper with local deployment and flexible model selection

Rationale

Whisper provides the best balance of:

  1. Privacy: Fully local processing with no external API calls
  2. Accuracy: Best-in-class open-source performance (16% WER for base model)
  3. Flexibility: Multiple model sizes allow users to trade accuracy for speed
  4. Language Support: 90+ languages with automatic detection
  5. Community: Active development and extensive community support
  6. Cost: One-time download, no ongoing fees
  7. Future-proof: Ongoing improvements from OpenAI and community

Implementation Strategy

Model Selection Matrix

Hardware ProfileDefault ModelVRAM RequiredExpected SpeedUse Case
CPU-only (low-end)tiny~1GB RAM~10x real-timeQuick drafts
CPU-only (mid-range)base~1GB RAM~7x real-timeBalanced
CPU-only (high-end)small~2GB RAM~4x real-timeQuality focus
GPU (4GB VRAM)small~2GB VRAM~20x real-timeFast processing
GPU (8GB+ VRAM)medium~5GB VRAM~30x real-timeHigh quality

Deployment Configuration

# Auto-detect best model based on available resources
def select_optimal_model():
if torch.cuda.is_available():
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
if vram_gb >= 8:
return "medium"
elif vram_gb >= 4:
return "small"
else:
return "base"
else:
# CPU fallback
import psutil
ram_gb = psutil.virtual_memory().total / 1e9
if ram_gb >= 16:
return "small"
else:
return "base"

Fallback Strategy

  1. Primary: Use user-selected or auto-detected model
  2. On OOM Error: Automatically retry with next smaller model
  3. Final Fallback: Use "tiny" model as last resort
  4. User Override: Allow manual model selection in settings

Consequences

Positive

  • User Control: Complete data privacy and ownership
  • No Ongoing Costs: One-time setup, unlimited usage
  • Offline Capable: Works without internet after initial setup
  • Scalable: Can process unlimited videos without API limits
  • Customizable: Can fine-tune models for specific domains
  • Transparent: Open-source model weights and code

Negative

  • Initial Setup: Users must download models (75MB - 3GB)
  • Resource Requirements: Requires adequate RAM/VRAM
  • Processing Time: Slower than cloud solutions on low-end hardware
  • Storage: Models require 0.5-10GB disk space

Neutral

  • Hardware Dependency: Performance varies based on user hardware
  • Maintenance: Requires occasional model updates for improvements

Mitigations and Risks

Risk: Out of Memory Errors

Mitigation: Automatic fallback to smaller models, clear hardware requirements documentation

Risk: Slow Processing on Low-End Hardware

Mitigation: Pre-calculate estimated processing time, recommend model based on hardware profile, optional "fast mode" with tiny model

Risk: Model Download Size

Mitigation: Progressive model downloads, only download selected model, compress models

Risk: Poor Accuracy on Specific Content

Mitigation: Allow model switching per job, support for custom fine-tuned models in future


Alternative Configurations

Hybrid Approach (Future Consideration)

  • Local-first: Use Whisper by default
  • Optional Cloud: Allow users to opt-in to cloud providers for specific jobs
  • Comparison Mode: Run both local and cloud for quality benchmarking

Whisper.cpp Alternative

  • Consideration: Use whisper.cpp (C++ port) for better performance
  • Status: Monitor for stability and feature parity
  • Timeline: Evaluate for v2.0

Compliance and Privacy

Data Processing

  • All audio and transcription data remains on user's local machine
  • No telemetry or analytics sent to external servers
  • Optional anonymous usage statistics (opt-in only)

Model Provenance

  • Models sourced directly from OpenAI's official repository
  • Checksum validation for model integrity
  • Clear licensing (MIT) for all components

Performance Benchmarks

Based on testing with 30-minute YouTube video (technical lecture):

ModelHardwareProcessing TimeAccuracy (subjective)Memory Usage
tinyCPU (i5)45 minutesGood for clear speech1.2GB
baseCPU (i5)35 minutesVery good1.5GB
smallCPU (i5)25 minutesExcellent2.3GB
baseGPU (RTX 3060)4 minutesVery good2.0GB VRAM
smallGPU (RTX 3060)6 minutesExcellent3.5GB VRAM

Review Schedule

This ADR should be reviewed:

  • Quarterly: Check for new Whisper releases or alternatives
  • On User Feedback: If accuracy or performance issues reported
  • Hardware Evolution: When new acceleration options emerge (e.g., WebGPU)

References


  • ADR-002: Audio Download Strategy (yt-dlp vs alternatives)
  • ADR-003: Output Format Support (SRT, VTT, JSON)
  • ADR-004: Deployment Architecture (Docker vs native)

Signed Off: 2025-11-07 Next Review: 2025-02-07