ADR-004: Audio Transcription Strategy - Local vs API

Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, DevOps Engineer
Related: SDD Section 2.2.2, Cost Analysis Section 9.1

Context

Audio transcription is a critical component of the video analysis pipeline. Requirements:

Accuracy: >95% word accuracy for technical/educational content
Word-Level Timing: Required for audio-visual correlation
Language Support: Primary: English, Secondary: Spanish, French, Mandarin
Cost: Process 1000+ hours/month
Latency: <5 minutes for 60-minute video
Privacy: No sensitive data leakage

Use Case Profile

monthly_volume = {
    'videos_processed': 1000,
    'avg_duration_minutes': 45,
    'total_audio_hours': 750,
    'languages': {
        'english': 0.80,
        'spanish': 0.10,
        'other': 0.10
    }
}

Decision Drivers

Cost Efficiency: $0.36 per hour (API) vs free (local) but infrastructure cost
Quality: SOTA models provide best accuracy
Deployment Simplicity: API = zero infrastructure, Local = GPU management
Scalability: Can we handle growth?
Privacy: User content security
Flexibility: Ability to switch models

Options Considered

Option 1: OpenAI Whisper API

Approach: Use OpenAI's hosted Whisper API

import openai

audio_file = open("video.mp3", "rb")
transcript = openai.Audio.transcribe(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word"]
)

Pros:

Zero infrastructure: No GPUs to manage
High quality: Whisper Large V3 equivalent
Word timestamps: Native support
Fast: <1 minute for 60-minute video
No maintenance: Always latest model
Scalability: Unlimited (rate limited)

Cons:

Cost: $0.006/minute = $0.36 per hour
Privacy: Audio data sent to OpenAI
Vendor lock-in: Dependent on OpenAI
Rate limits: 50 requests/minute

Cost Analysis:

monthly_cost_api = {
    'audio_hours': 750,
    'cost_per_hour': 0.36,
    'total': 750 * 0.36,  # $270/month
    'cost_per_video': 0.27  # 45-min avg
}

Verdict: ✅ Simple, reliable, but ongoing cost

Option 2: Self-Hosted OpenAI Whisper (Local GPU)

Approach: Run Whisper Large V3 on own infrastructure

import whisper

model = whisper.load_model("large-v3")  # 2.87GB model
result = model.transcribe(
    "audio.wav",
    word_timestamps=True,
    language="en"
)

Pros:

Zero API cost: Pay only for infrastructure
Privacy: Data never leaves infrastructure
No rate limits: Process unlimited videos
Model control: Can fine-tune or swap models
Flexibility: Full control over parameters

Cons:

Infrastructure cost: GPU required
Maintenance overhead: Model updates, GPU management, scaling
Slower: 5-10 minutes for 60-minute video (RTX 4090)
Complexity: DevOps burden for GPU infrastructure

Cost Analysis:

# AWS g4dn.xlarge (NVIDIA T4 GPU)
monthly_cost_self_hosted = {
    'instance_type': 'g4dn.xlarge',
    'hourly_rate': 0.526,
    'hours_per_month': 730,  # 24/7
    'compute_cost': 0.526 * 730,  # $384/month
    'storage_cost': 50,  # 500GB EBS
    'total': 384 + 50,  # $434/month
    
    # Assumes 24/7 availability for on-demand processing
    # More economical if processing in batches with downtime
}

# Alternative: Spot instances (70% discount)
monthly_cost_spot = {
    'hourly_rate': 0.158,  # 70% off
    'hours_per_month': 730,
    'compute_cost': 0.158 * 730,  # $115/month
    'interruption_risk': 'low for g4dn',
    'total': 115 + 50  # $165/month
}

Verdict: ⚠️ Cost-effective at scale, but infrastructure burden

Option 3: Alternative APIs (AssemblyAI, Deepgram)

AssemblyAI:

Cost: $0.01/minute = $0.60/hour (67% more expensive)
Better speaker diarization
Sentiment analysis included

Deepgram:

Cost: $0.0125/minute = $0.75/hour (108% more expensive)
Lowest latency (real-time capable)
Good for streaming

Verdict: ❌ More expensive than OpenAI, not compelling advantages

Option 4: Hybrid Approach (Selected)

Approach: Start with API, transition to self-hosted at scale

Phase 1 (Months 1-3): OpenAI Whisper API

Use API during MVP/pilot
Validate quality and demand
Zero infrastructure burden
Monthly cost: $200-300

Phase 2 (Month 4+): Self-Hosted Whisper

Migrate to self-hosted when volume > 500 hours/month
Use spot instances for cost efficiency
Keep API as fallback for burst capacity

class TranscriptionService:
    def __init__(self, H.P.009-CONFIG):
        self.mode = H.P.009-CONFIG.transcription_mode  # 'api' or 'local'
        self.fallback_enabled = H.P.009-CONFIG.enable_fallback
        
        if self.mode == 'local':
            self.local_model = whisper.load_model("large-v3")
        if self.fallback_enabled:
            self.api_client = openai.Audio
    
    async def transcribe(self, audio_path: Path):
        if self.mode == 'local':
            try:
                return await self._transcribe_local(audio_path)
            except Exception as e:
                if self.fallback_enabled:
                    logger.warning(f"Local failed, falling back to API: {e}")
                    return await self._transcribe_api(audio_path)
                raise
        else:
            return await self._transcribe_api(audio_path)

Pros:

Start fast: No infrastructure for MVP
Scale efficiently: Transition when economical
Flexibility: Can switch back if needed
Risk mitigation: API fallback for local failures

Cons:

Migration effort: Need to build self-hosted eventually
Two codepaths: Must maintain both API and local
Transition planning: When to switch?

Verdict: ✅ SELECTED - Best balance of simplicity and long-term cost

Decision

Hybrid Approach: API → Self-Hosted

Implementation Strategy

Phase 1: OpenAI Whisper API (Launch → Month 3)

# H.P.009-CONFIG.yaml
transcription:
  mode: api
  provider: openai
  model: whisper-1
  enable_fallback: false
  
  api_H.P.009-CONFIG:
    max_retries: 3
    timeout_seconds: 300
    rate_limit_rpm: 50

Migration Trigger: When monthly cost exceeds $200 for 3 consecutive months

Phase 2: Self-Hosted Migration (Month 4+)

# Infrastructure as Code (Terraform)
resource "aws_instance" "whisper_worker" {
  instance_type = "g4dn.xlarge"
  ami = "deep-learning-ami"
  
  user_data = <<-EOF
    #!/bin/bash
    docker run -d \
      --gpus all \
      -v /data:/data \
      whisper-worker:latest
  EOF
  
  tags = {
    Name = "whisper-transcription-worker"
    AutoScaling = "spot-instances"
  }
}

Configuration:

# H.P.009-CONFIG.yaml (Phase 2)
transcription:
  mode: local
  model: large-v3
  enable_fallback: true  # Keep API for burst capacity
  
  local_H.P.009-CONFIG:
    model_path: /models/whisper-large-v3
    batch_size: 16
    fp16: true  # Use mixed precision for 2x speedup
    
  fallback_H.P.009-CONFIG:
    provider: openai
    trigger_conditions:
      - local_queue_depth > 100
      - local_gpu_utilization > 95%
      - local_error_rate > 0.10

Cost Projection

cost_over_time = {
    'month_1': {
        'volume_hours': 200,
        'approach': 'api',
        'cost': 72  # 200 * 0.36
    },
    'month_3': {
        'volume_hours': 500,
        'approach': 'api',
        'cost': 180  # 500 * 0.36
    },
    'month_6': {
        'volume_hours': 750,
        'approach': 'self_hosted_spot',
        'infrastructure_cost': 165,
        'api_fallback_cost': 18,  # 5% overflow
        'total': 183,
        'savings_vs_api': 270 - 183  # $87/month saved
    },
    'month_12': {
        'volume_hours': 1200,
        'approach': 'self_hosted_spot',
        'infrastructure_cost': 165,
        'api_fallback_cost': 22,
        'total': 187,
        'savings_vs_api': 432 - 187  # $245/month saved
    }
}

# Break-even analysis
breakeven = {
    'self_hosted_monthly_cost': 165,  # Spot instances
    'api_cost_per_hour': 0.36,
    'breakeven_hours': 165 / 0.36,  # ~458 hours/month
    'breakeven_videos': 458 / 0.75,  # ~610 videos/month (45min avg)
}

Consequences

Positive

Low initial investment: Start with zero infrastructure
Predictable scaling: Clear migration trigger at 500 hours/month
Cost optimization: 57% savings at scale
Risk mitigation: API fallback prevents downtime
Flexibility: Can adjust strategy based on actual usage

Negative

Technical debt: Must build self-hosted infrastructure eventually
Maintenance burden: GPU management, model updates
Two codepaths: Must maintain both API and local
Migration complexity: Coordinating cutover
DevOps overhead: Monitoring, scaling, incident response

Mitigation Strategies

# Abstraction layer for easy switching
class TranscriptionProvider(Protocol):
    async def transcribe(self, audio_path: Path) -> TranscriptResult:
        ...

class OpenAIProvider(TranscriptionProvider):
    async def transcribe(self, audio_path: Path):
        # OpenAI API implementation
        pass

class LocalWhisperProvider(TranscriptionProvider):
    async def transcribe(self, audio_path: Path):
        # Local Whisper implementation
        pass

# Factory pattern
def get_transcription_provider(H.P.009-CONFIG) -> TranscriptionProvider:
    if H.P.009-CONFIG.mode == 'api':
        return OpenAIProvider(H.P.009-CONFIG.api_H.P.009-CONFIG)
    elif H.P.009-CONFIG.mode == 'local':
        return LocalWhisperProvider(H.P.009-CONFIG.local_H.P.009-CONFIG)
    else:
        raise ValueError(f"Unknown mode: {H.P.009-CONFIG.mode}")

Quality Assurance

# Regular quality checks
quality_metrics = {
    'wer': float,  # Word Error Rate
    'cer': float,  # Character Error Rate
    'rtf': float,  # Real-Time Factor (processing time / audio duration)
    
    # Compare API vs local
    'consistency_score': float,  # Agreement between providers
    'confidence_scores': List[float]  # Per-segment confidence
}

async def validate_transcription_quality():
    # Sample 1% of transcriptions
    # Run through both API and local
    # Compare results, flag discrepancies
    pass

Validation Metrics

metrics_to_track = {
    'cost': {
        'api_spend_monthly': float,
        'infrastructure_spend_monthly': float,
        'cost_per_hour': float,
        'cost_per_video': float
    },
    'quality': {
        'word_error_rate': float,
        'confidence_scores_avg': float,
        'language_accuracy': Dict[str, float]
    },
    'performance': {
        'transcription_latency_p50': float,
        'transcription_latency_p99': float,
        'real_time_factor': float,
        'throughput_hours_per_day': float
    },
    'reliability': {
        'api_success_rate': float,
        'local_success_rate': float,
        'fallback_trigger_rate': float
    }
}

Alternative Future Options

Option: Faster Whisper (ONNX/TensorRT)

# Use optimized Whisper implementation
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# 4x faster than standard Whisper
# Same accuracy

Future consideration: Migrate to Faster Whisper in Phase 2 for 4x speedup

Option: Fine-Tuned Domain Models

# Fine-tune Whisper on domain-specific content
# Potential for 10-20% WER improvement in technical domains

Future consideration: Fine-tune for educational/technical jargon (Month 6+)

Option: Streaming Transcription

# Real-time transcription for live video analysis
# Requires Deepgram or similar streaming-capable service

Future consideration: If real-time analysis becomes a requirement

Review Schedule

Month 1: Track API costs and quality metrics
Month 3: Evaluate migration trigger (volume vs cost)
Month 6: If migrated, assess self-hosted performance and cost savings
Month 12: Full review, consider fine-tuning or alternative models

References

OpenAI Whisper Paper
Faster Whisper GitHub
OpenAI Whisper API Pricing
AWS GPU Instance Pricing
Internal: Transcription Cost Analysis Spreadsheet

Context​

Use Case Profile​

Decision Drivers​

Options Considered​

Option 1: OpenAI Whisper API​

Option 2: Self-Hosted OpenAI Whisper (Local GPU)​

Option 3: Alternative APIs (AssemblyAI, Deepgram)​

Option 4: Hybrid Approach (Selected)​

Decision​

Implementation Strategy​

Phase 1: OpenAI Whisper API (Launch → Month 3)​

Phase 2: Self-Hosted Migration (Month 4+)​

Cost Projection​

Consequences​

Positive​

Negative​

Mitigation Strategies​

Quality Assurance​

Validation Metrics​

Alternative Future Options​

Option: Faster Whisper (ONNX/TensorRT)​

Option: Fine-Tuned Domain Models​

Option: Streaming Transcription​

Review Schedule​

References​

Context

Use Case Profile

Decision Drivers

Options Considered

Option 1: OpenAI Whisper API

Option 2: Self-Hosted OpenAI Whisper (Local GPU)

Option 3: Alternative APIs (AssemblyAI, Deepgram)

Option 4: Hybrid Approach (Selected)

Decision

Implementation Strategy

Phase 1: OpenAI Whisper API (Launch → Month 3)

Phase 2: Self-Hosted Migration (Month 4+)

Cost Projection

Consequences

Positive

Negative

Mitigation Strategies

Quality Assurance

Validation Metrics

Alternative Future Options

Option: Faster Whisper (ONNX/TensorRT)

Option: Fine-Tuned Domain Models

Option: Streaming Transcription

Review Schedule

References