ADR-001: Speech-to-Text Model Selection and Deployment Strategy

Status: Accepted Date: 2025-11-07 Deciders: Technical Team Context: Local ASR for YouTube-to-text processing

Context and Problem Statement

CODITECT Audio2Text requires a speech-to-text (ASR) engine capable of accurately transcribing audio extracted from YouTube videos. The solution must prioritize:

Privacy: No data should leave the user's machine
Accuracy: High-quality transcription across various content types
Performance: Reasonable processing times on consumer hardware
Cost: No ongoing cloud service fees
Extensibility: Support for multiple languages and future enhancements

The key decision is which ASR technology to use and how to deploy it.

Decision Drivers

Privacy and data sovereignty requirements
Transcription accuracy across diverse content (technical talks, lectures, podcasts)
Hardware resource constraints (CPU/GPU/RAM availability)
Processing latency and throughput requirements
Multi-language support
Open-source licensing and community support
Maintenance and update burden

Considered Options

Option 1: OpenAI Whisper (Local Deployment)

Pros:

State-of-the-art accuracy (WER 14-23% for small models)
Multiple model sizes for different hardware profiles
Excellent multi-language support (90+ languages)
Open-source (MIT license)
Active community and ongoing development
Runs entirely locally (CPU or GPU)
Handles noisy and accented speech well
Includes automatic language detection

Cons:

Slower than real-time on CPU (2-10x playback time)
Requires 1-10GB RAM/VRAM depending on model
Initial model download required
No streaming/real-time support in base implementation

Option 2: Cloud-based ASR (Google Speech-to-Text, AssemblyAI, etc.)

Pros:

Very high accuracy
Faster processing with cloud resources
No local resource requirements
Professional support

Cons:

Privacy concerns (data leaves local machine)
Ongoing costs per usage
Internet dependency
API rate limits
Vendor lock-in

Option 3: Mozilla DeepSpeech

Pros:

Open-source
Lightweight models available
Local deployment
Retrainable

Cons:

Lower accuracy than Whisper (WER 25-35%)
Project deprecated in 2021
Limited active development
Fewer languages supported

Option 4: Vosk

Pros:

Very lightweight (~50MB models)
Fast processing
Good for embedded systems
Multiple language models

Cons:

Lower accuracy than Whisper (WER 20-30%)
Limited to smaller vocabulary models
Less robust for noisy audio

Decision Outcome

Chosen option: OpenAI Whisper with local deployment and flexible model selection

Rationale

Whisper provides the best balance of:

Privacy: Fully local processing with no external API calls
Accuracy: Best-in-class open-source performance (16% WER for base model)
Flexibility: Multiple model sizes allow users to trade accuracy for speed
Language Support: 90+ languages with automatic detection
Community: Active development and extensive community support
Cost: One-time download, no ongoing fees
Future-proof: Ongoing improvements from OpenAI and community

Implementation Strategy

Model Selection Matrix

Hardware Profile	Default Model	VRAM Required	Expected Speed	Use Case
CPU-only (low-end)	tiny	~1GB RAM	~10x real-time	Quick drafts
CPU-only (mid-range)	base	~1GB RAM	~7x real-time	Balanced
CPU-only (high-end)	small	~2GB RAM	~4x real-time	Quality focus
GPU (4GB VRAM)	small	~2GB VRAM	~20x real-time	Fast processing
GPU (8GB+ VRAM)	medium	~5GB VRAM	~30x real-time	High quality

Deployment Configuration

# Auto-detect best model based on available resources
def select_optimal_model():
    if torch.cuda.is_available():
        vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
        if vram_gb >= 8:
            return "medium"
        elif vram_gb >= 4:
            return "small"
        else:
            return "base"
    else:
        # CPU fallback
        import psutil
        ram_gb = psutil.virtual_memory().total / 1e9
        if ram_gb >= 16:
            return "small"
        else:
            return "base"

Fallback Strategy

Primary: Use user-selected or auto-detected model
On OOM Error: Automatically retry with next smaller model
Final Fallback: Use "tiny" model as last resort
User Override: Allow manual model selection in settings

Consequences

Positive

User Control: Complete data privacy and ownership
No Ongoing Costs: One-time setup, unlimited usage
Offline Capable: Works without internet after initial setup
Scalable: Can process unlimited videos without API limits
Customizable: Can fine-tune models for specific domains
Transparent: Open-source model weights and code

Negative

Initial Setup: Users must download models (75MB - 3GB)
Resource Requirements: Requires adequate RAM/VRAM
Processing Time: Slower than cloud solutions on low-end hardware
Storage: Models require 0.5-10GB disk space

Neutral

Hardware Dependency: Performance varies based on user hardware
Maintenance: Requires occasional model updates for improvements

Mitigations and Risks

Risk: Out of Memory Errors

Mitigation: Automatic fallback to smaller models, clear hardware requirements documentation

Risk: Slow Processing on Low-End Hardware

Mitigation: Pre-calculate estimated processing time, recommend model based on hardware profile, optional "fast mode" with tiny model

Risk: Model Download Size

Mitigation: Progressive model downloads, only download selected model, compress models

Risk: Poor Accuracy on Specific Content

Mitigation: Allow model switching per job, support for custom fine-tuned models in future

Alternative Configurations

Hybrid Approach (Future Consideration)

Local-first: Use Whisper by default
Optional Cloud: Allow users to opt-in to cloud providers for specific jobs
Comparison Mode: Run both local and cloud for quality benchmarking

Whisper.cpp Alternative

Consideration: Use whisper.cpp (C++ port) for better performance
Status: Monitor for stability and feature parity
Timeline: Evaluate for v2.0

Compliance and Privacy

Data Processing

All audio and transcription data remains on user's local machine
No telemetry or analytics sent to external servers
Optional anonymous usage statistics (opt-in only)

Model Provenance

Models sourced directly from OpenAI's official repository
Checksum validation for model integrity
Clear licensing (MIT) for all components

Performance Benchmarks

Based on testing with 30-minute YouTube video (technical lecture):

Model	Hardware	Processing Time	Accuracy (subjective)	Memory Usage
tiny	CPU (i5)	45 minutes	Good for clear speech	1.2GB
base	CPU (i5)	35 minutes	Very good	1.5GB
small	CPU (i5)	25 minutes	Excellent	2.3GB
base	GPU (RTX 3060)	4 minutes	Very good	2.0GB VRAM
small	GPU (RTX 3060)	6 minutes	Excellent	3.5GB VRAM

Review Schedule

This ADR should be reviewed:

Quarterly: Check for new Whisper releases or alternatives
On User Feedback: If accuracy or performance issues reported
Hardware Evolution: When new acceleration options emerge (e.g., WebGPU)

References

ADR-002: Audio Download Strategy (yt-dlp vs alternatives)
ADR-003: Output Format Support (SRT, VTT, JSON)
ADR-004: Deployment Architecture (Docker vs native)

Signed Off: 2025-11-07 Next Review: 2025-02-07

Context and Problem Statement​

Decision Drivers​

Considered Options​

Option 1: OpenAI Whisper (Local Deployment)​

Option 2: Cloud-based ASR (Google Speech-to-Text, AssemblyAI, etc.)​

Option 3: Mozilla DeepSpeech​

Option 4: Vosk​

Decision Outcome​

Rationale​

Implementation Strategy​

Model Selection Matrix​

Deployment Configuration​

Fallback Strategy​

Consequences​

Positive​

Negative​

Neutral​

Mitigations and Risks​

Risk: Out of Memory Errors​

Risk: Slow Processing on Low-End Hardware​

Risk: Model Download Size​

Risk: Poor Accuracy on Specific Content​

Alternative Configurations​

Hybrid Approach (Future Consideration)​

Whisper.cpp Alternative​

Compliance and Privacy​

Data Processing​

Model Provenance​

Performance Benchmarks​

Review Schedule​

References​

Related ADRs​

Context and Problem Statement

Decision Drivers

Considered Options

Option 1: OpenAI Whisper (Local Deployment)

Option 2: Cloud-based ASR (Google Speech-to-Text, AssemblyAI, etc.)

Option 3: Mozilla DeepSpeech

Option 4: Vosk

Decision Outcome

Rationale

Implementation Strategy

Model Selection Matrix

Deployment Configuration

Fallback Strategy

Consequences

Positive

Negative

Neutral

Mitigations and Risks

Risk: Out of Memory Errors

Risk: Slow Processing on Low-End Hardware

Risk: Model Download Size

Risk: Poor Accuracy on Specific Content

Alternative Configurations

Hybrid Approach (Future Consideration)

Whisper.cpp Alternative

Compliance and Privacy

Data Processing

Model Provenance

Performance Benchmarks

Review Schedule

References

Related ADRs