Skip to main content

ADR-004: Audio Transcription Strategy - Local vs API

Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, DevOps Engineer
Related: SDD Section 2.2.2, Cost Analysis Section 9.1


Context

Audio transcription is a critical component of the video analysis pipeline. Requirements:

  1. Accuracy: >95% word accuracy for technical/educational content
  2. Word-Level Timing: Required for audio-visual correlation
  3. Language Support: Primary: English, Secondary: Spanish, French, Mandarin
  4. Cost: Process 1000+ hours/month
  5. Latency: <5 minutes for 60-minute video
  6. Privacy: No sensitive data leakage

Use Case Profile

monthly_volume = {
'videos_processed': 1000,
'avg_duration_minutes': 45,
'total_audio_hours': 750,
'languages': {
'english': 0.80,
'spanish': 0.10,
'other': 0.10
}
}

Decision Drivers

  1. Cost Efficiency: $0.36 per hour (API) vs free (local) but infrastructure cost
  2. Quality: SOTA models provide best accuracy
  3. Deployment Simplicity: API = zero infrastructure, Local = GPU management
  4. Scalability: Can we handle growth?
  5. Privacy: User content security
  6. Flexibility: Ability to switch models

Options Considered

Option 1: OpenAI Whisper API

Approach: Use OpenAI's hosted Whisper API

import openai

audio_file = open("video.mp3", "rb")
transcript = openai.Audio.transcribe(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)

Pros:

  • Zero infrastructure: No GPUs to manage
  • High quality: Whisper Large V3 equivalent
  • Word timestamps: Native support
  • Fast: <1 minute for 60-minute video
  • No maintenance: Always latest model
  • Scalability: Unlimited (rate limited)

Cons:

  • Cost: $0.006/minute = $0.36 per hour
  • Privacy: Audio data sent to OpenAI
  • Vendor lock-in: Dependent on OpenAI
  • Rate limits: 50 requests/minute

Cost Analysis:

monthly_cost_api = {
'audio_hours': 750,
'cost_per_hour': 0.36,
'total': 750 * 0.36, # $270/month
'cost_per_video': 0.27 # 45-min avg
}

Verdict: ✅ Simple, reliable, but ongoing cost


Option 2: Self-Hosted OpenAI Whisper (Local GPU)

Approach: Run Whisper Large V3 on own infrastructure

import whisper

model = whisper.load_model("large-v3") # 2.87GB model
result = model.transcribe(
"audio.wav",
word_timestamps=True,
language="en"
)

Pros:

  • Zero API cost: Pay only for infrastructure
  • Privacy: Data never leaves infrastructure
  • No rate limits: Process unlimited videos
  • Model control: Can fine-tune or swap models
  • Flexibility: Full control over parameters

Cons:

  • Infrastructure cost: GPU required
  • Maintenance overhead: Model updates, GPU management, scaling
  • Slower: 5-10 minutes for 60-minute video (RTX 4090)
  • Complexity: DevOps burden for GPU infrastructure

Cost Analysis:

# AWS g4dn.xlarge (NVIDIA T4 GPU)
monthly_cost_self_hosted = {
'instance_type': 'g4dn.xlarge',
'hourly_rate': 0.526,
'hours_per_month': 730, # 24/7
'compute_cost': 0.526 * 730, # $384/month
'storage_cost': 50, # 500GB EBS
'total': 384 + 50, # $434/month

# Assumes 24/7 availability for on-demand processing
# More economical if processing in batches with downtime
}

# Alternative: Spot instances (70% discount)
monthly_cost_spot = {
'hourly_rate': 0.158, # 70% off
'hours_per_month': 730,
'compute_cost': 0.158 * 730, # $115/month
'interruption_risk': 'low for g4dn',
'total': 115 + 50 # $165/month
}

Verdict: ⚠️ Cost-effective at scale, but infrastructure burden


Option 3: Alternative APIs (AssemblyAI, Deepgram)

AssemblyAI:

  • Cost: $0.01/minute = $0.60/hour (67% more expensive)
  • Better speaker diarization
  • Sentiment analysis included

Deepgram:

  • Cost: $0.0125/minute = $0.75/hour (108% more expensive)
  • Lowest latency (real-time capable)
  • Good for streaming

Verdict: ❌ More expensive than OpenAI, not compelling advantages


Option 4: Hybrid Approach (Selected)

Approach: Start with API, transition to self-hosted at scale

Phase 1 (Months 1-3): OpenAI Whisper API

  • Use API during MVP/pilot
  • Validate quality and demand
  • Zero infrastructure burden
  • Monthly cost: $200-300

Phase 2 (Month 4+): Self-Hosted Whisper

  • Migrate to self-hosted when volume > 500 hours/month
  • Use spot instances for cost efficiency
  • Keep API as fallback for burst capacity
class TranscriptionService:
def __init__(self, H.P.009-CONFIG):
self.mode = H.P.009-CONFIG.transcription_mode # 'api' or 'local'
self.fallback_enabled = H.P.009-CONFIG.enable_fallback

if self.mode == 'local':
self.local_model = whisper.load_model("large-v3")
if self.fallback_enabled:
self.api_client = openai.Audio

async def transcribe(self, audio_path: Path):
if self.mode == 'local':
try:
return await self._transcribe_local(audio_path)
except Exception as e:
if self.fallback_enabled:
logger.warning(f"Local failed, falling back to API: {e}")
return await self._transcribe_api(audio_path)
raise
else:
return await self._transcribe_api(audio_path)

Pros:

  • Start fast: No infrastructure for MVP
  • Scale efficiently: Transition when economical
  • Flexibility: Can switch back if needed
  • Risk mitigation: API fallback for local failures

Cons:

  • Migration effort: Need to build self-hosted eventually
  • Two codepaths: Must maintain both API and local
  • Transition planning: When to switch?

Verdict: ✅ SELECTED - Best balance of simplicity and long-term cost

Decision

Hybrid Approach: API → Self-Hosted

Implementation Strategy

Phase 1: OpenAI Whisper API (Launch → Month 3)

# H.P.009-CONFIG.yaml
transcription:
mode: api
provider: openai
model: whisper-1
enable_fallback: false

api_H.P.009-CONFIG:
max_retries: 3
timeout_seconds: 300
rate_limit_rpm: 50

Migration Trigger: When monthly cost exceeds $200 for 3 consecutive months

Phase 2: Self-Hosted Migration (Month 4+)

# Infrastructure as Code (Terraform)
resource "aws_instance" "whisper_worker" {
instance_type = "g4dn.xlarge"
ami = "deep-learning-ami"

user_data = <<-EOF
#!/bin/bash
docker run -d \
--gpus all \
-v /data:/data \
whisper-worker:latest
EOF

tags = {
Name = "whisper-transcription-worker"
AutoScaling = "spot-instances"
}
}

Configuration:

# H.P.009-CONFIG.yaml (Phase 2)
transcription:
mode: local
model: large-v3
enable_fallback: true # Keep API for burst capacity

local_H.P.009-CONFIG:
model_path: /models/whisper-large-v3
batch_size: 16
fp16: true # Use mixed precision for 2x speedup

fallback_H.P.009-CONFIG:
provider: openai
trigger_conditions:
- local_queue_depth > 100
- local_gpu_utilization > 95%
- local_error_rate > 0.10

Cost Projection

cost_over_time = {
'month_1': {
'volume_hours': 200,
'approach': 'api',
'cost': 72 # 200 * 0.36
},
'month_3': {
'volume_hours': 500,
'approach': 'api',
'cost': 180 # 500 * 0.36
},
'month_6': {
'volume_hours': 750,
'approach': 'self_hosted_spot',
'infrastructure_cost': 165,
'api_fallback_cost': 18, # 5% overflow
'total': 183,
'savings_vs_api': 270 - 183 # $87/month saved
},
'month_12': {
'volume_hours': 1200,
'approach': 'self_hosted_spot',
'infrastructure_cost': 165,
'api_fallback_cost': 22,
'total': 187,
'savings_vs_api': 432 - 187 # $245/month saved
}
}

# Break-even analysis
breakeven = {
'self_hosted_monthly_cost': 165, # Spot instances
'api_cost_per_hour': 0.36,
'breakeven_hours': 165 / 0.36, # ~458 hours/month
'breakeven_videos': 458 / 0.75, # ~610 videos/month (45min avg)
}

Consequences

Positive

  1. Low initial investment: Start with zero infrastructure
  2. Predictable scaling: Clear migration trigger at 500 hours/month
  3. Cost optimization: 57% savings at scale
  4. Risk mitigation: API fallback prevents downtime
  5. Flexibility: Can adjust strategy based on actual usage

Negative

  1. Technical debt: Must build self-hosted infrastructure eventually
  2. Maintenance burden: GPU management, model updates
  3. Two codepaths: Must maintain both API and local
  4. Migration complexity: Coordinating cutover
  5. DevOps overhead: Monitoring, scaling, incident response

Mitigation Strategies

# Abstraction layer for easy switching
class TranscriptionProvider(Protocol):
async def transcribe(self, audio_path: Path) -> TranscriptResult:
...

class OpenAIProvider(TranscriptionProvider):
async def transcribe(self, audio_path: Path):
# OpenAI API implementation
pass

class LocalWhisperProvider(TranscriptionProvider):
async def transcribe(self, audio_path: Path):
# Local Whisper implementation
pass

# Factory pattern
def get_transcription_provider(H.P.009-CONFIG) -> TranscriptionProvider:
if H.P.009-CONFIG.mode == 'api':
return OpenAIProvider(H.P.009-CONFIG.api_H.P.009-CONFIG)
elif H.P.009-CONFIG.mode == 'local':
return LocalWhisperProvider(H.P.009-CONFIG.local_H.P.009-CONFIG)
else:
raise ValueError(f"Unknown mode: {H.P.009-CONFIG.mode}")

Quality Assurance

# Regular quality checks
quality_metrics = {
'wer': float, # Word Error Rate
'cer': float, # Character Error Rate
'rtf': float, # Real-Time Factor (processing time / audio duration)

# Compare API vs local
'consistency_score': float, # Agreement between providers
'confidence_scores': List[float] # Per-segment confidence
}

async def validate_transcription_quality():
# Sample 1% of transcriptions
# Run through both API and local
# Compare results, flag discrepancies
pass

Validation Metrics

metrics_to_track = {
'cost': {
'api_spend_monthly': float,
'infrastructure_spend_monthly': float,
'cost_per_hour': float,
'cost_per_video': float
},
'quality': {
'word_error_rate': float,
'confidence_scores_avg': float,
'language_accuracy': Dict[str, float]
},
'performance': {
'transcription_latency_p50': float,
'transcription_latency_p99': float,
'real_time_factor': float,
'throughput_hours_per_day': float
},
'reliability': {
'api_success_rate': float,
'local_success_rate': float,
'fallback_trigger_rate': float
}
}

Alternative Future Options

Option: Faster Whisper (ONNX/TensorRT)

# Use optimized Whisper implementation
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# 4x faster than standard Whisper
# Same accuracy

Future consideration: Migrate to Faster Whisper in Phase 2 for 4x speedup

Option: Fine-Tuned Domain Models

# Fine-tune Whisper on domain-specific content
# Potential for 10-20% WER improvement in technical domains

Future consideration: Fine-tune for educational/technical jargon (Month 6+)

Option: Streaming Transcription

# Real-time transcription for live video analysis
# Requires Deepgram or similar streaming-capable service

Future consideration: If real-time analysis becomes a requirement

Review Schedule

  • Month 1: Track API costs and quality metrics
  • Month 3: Evaluate migration trigger (volume vs cost)
  • Month 6: If migrated, assess self-hosted performance and cost savings
  • Month 12: Full review, consider fine-tuning or alternative models

References