ADR 007: Configurable Whisper Model Sizes

Status

Accepted

Context

The pipeline supports local Whisper transcription, but different use cases require different trade-offs between speed and accuracy. We need to support multiple model sizes (tiny, base, small, medium, large-v3) with automatic optimization based on hardware and requirements.

Decision

Support configurable Whisper model sizes with automatic optimization selection.

Model Options

Model	Size	VRAM	RAM	Speed	Accuracy	Use Case
tiny	39M	~1GB	~500MB	32x realtime	58.0 WER	Quick tests
base	74M	~1GB	~1GB	16x realtime	48.0 WER	Balanced
small	244M	~2GB	~2GB	6x realtime	38.0 WER	Good quality
medium	769M	~5GB	~5GB	2x realtime	32.0 WER	High quality
large-v3	1550M	~10GB	~10GB	1x realtime	25.0 WER	Best quality

Configuration

class TranscriptionConfig:
    model_size: Literal["tiny", "base", "small", "medium", "large-v3", "auto"]
    device: Literal["cpu", "cuda", "auto"]
    compute_type: Literal["int8", "float16", "float32"]

Auto-Selection Logic

def select_optimal_model(available_memory_gb: float, priority: str) -> str:
    """
    Select optimal model based on hardware and priority.
    
    Args:
        available_memory_gb: Available system memory
        priority: "speed" | "quality" | "balanced"
    
    Returns:
        Model size string
    """
    if priority == "speed":
        return "base" if available_memory_gb >= 1 else "tiny"
    elif priority == "quality":
        if available_memory_gb >= 10: return "large-v3"
        if available_memory_gb >= 5: return "medium"
        if available_memory_gb >= 2: return "small"
        return "base"
    else:  # balanced
        if available_memory_gb >= 2: return "small"
        if available_memory_gb >= 1: return "base"
        return "tiny"

Consequences

Positive

Users can optimize for their hardware
Supports resource-constrained environments
Enables quality-first processing when needed
Backward compatible (defaults to "base")

Negative

Requires model download on first run
Different models produce slightly different outputs
More configuration complexity

Implementation

class TranscriptionEngine:
    def __init__(
        self,
        model_size: str = "auto",
        device: str = "auto",
        priority: str = "balanced"
    ):
        if model_size == "auto":
            import psutil
            memory = psutil.virtual_memory().available / (1024**3)
            model_size = select_optimal_model(memory, priority)
        
        self.model = whisper.load_model(model_size)

Notes

Models download automatically from HuggingFace on first use
Cached in ~/.cache/whisper/
Allow override via environment variable: WHISPER_MODEL_SIZE=medium

Status​

Context​

Decision​

Model Options​

Configuration​

Auto-Selection Logic​

Consequences​

Positive​

Negative​

Implementation​

Notes​