ADR-004: Audio Transcription Strategy - Local vs API
Date: 2026-01-19
Status: Accepted
Decision Makers: System Architect, DevOps Engineer
Related: SDD Section 2.2.2, Cost Analysis Section 9.1
Context
Audio transcription is a critical component of the video analysis pipeline. Requirements:
- Accuracy: >95% word accuracy for technical/educational content
- Word-Level Timing: Required for audio-visual correlation
- Language Support: Primary: English, Secondary: Spanish, French, Mandarin
- Cost: Process 1000+ hours/month
- Latency: <5 minutes for 60-minute video
- Privacy: No sensitive data leakage
Use Case Profile
monthly_volume = {
'videos_processed': 1000,
'avg_duration_minutes': 45,
'total_audio_hours': 750,
'languages': {
'english': 0.80,
'spanish': 0.10,
'other': 0.10
}
}
Decision Drivers
- Cost Efficiency: $0.36 per hour (API) vs free (local) but infrastructure cost
- Quality: SOTA models provide best accuracy
- Deployment Simplicity: API = zero infrastructure, Local = GPU management
- Scalability: Can we handle growth?
- Privacy: User content security
- Flexibility: Ability to switch models
Options Considered
Option 1: OpenAI Whisper API
Approach: Use OpenAI's hosted Whisper API
import openai
audio_file = open("video.mp3", "rb")
transcript = openai.Audio.transcribe(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)
Pros:
- Zero infrastructure: No GPUs to manage
- High quality: Whisper Large V3 equivalent
- Word timestamps: Native support
- Fast: <1 minute for 60-minute video
- No maintenance: Always latest model
- Scalability: Unlimited (rate limited)
Cons:
- Cost: $0.006/minute = $0.36 per hour
- Privacy: Audio data sent to OpenAI
- Vendor lock-in: Dependent on OpenAI
- Rate limits: 50 requests/minute
Cost Analysis:
monthly_cost_api = {
'audio_hours': 750,
'cost_per_hour': 0.36,
'total': 750 * 0.36, # $270/month
'cost_per_video': 0.27 # 45-min avg
}
Verdict: ✅ Simple, reliable, but ongoing cost
Option 2: Self-Hosted OpenAI Whisper (Local GPU)
Approach: Run Whisper Large V3 on own infrastructure
import whisper
model = whisper.load_model("large-v3") # 2.87GB model
result = model.transcribe(
"audio.wav",
word_timestamps=True,
language="en"
)
Pros:
- Zero API cost: Pay only for infrastructure
- Privacy: Data never leaves infrastructure
- No rate limits: Process unlimited videos
- Model control: Can fine-tune or swap models
- Flexibility: Full control over parameters
Cons:
- Infrastructure cost: GPU required
- Maintenance overhead: Model updates, GPU management, scaling
- Slower: 5-10 minutes for 60-minute video (RTX 4090)
- Complexity: DevOps burden for GPU infrastructure
Cost Analysis:
# AWS g4dn.xlarge (NVIDIA T4 GPU)
monthly_cost_self_hosted = {
'instance_type': 'g4dn.xlarge',
'hourly_rate': 0.526,
'hours_per_month': 730, # 24/7
'compute_cost': 0.526 * 730, # $384/month
'storage_cost': 50, # 500GB EBS
'total': 384 + 50, # $434/month
# Assumes 24/7 availability for on-demand processing
# More economical if processing in batches with downtime
}
# Alternative: Spot instances (70% discount)
monthly_cost_spot = {
'hourly_rate': 0.158, # 70% off
'hours_per_month': 730,
'compute_cost': 0.158 * 730, # $115/month
'interruption_risk': 'low for g4dn',
'total': 115 + 50 # $165/month
}
Verdict: ⚠️ Cost-effective at scale, but infrastructure burden
Option 3: Alternative APIs (AssemblyAI, Deepgram)
AssemblyAI:
- Cost: $0.01/minute = $0.60/hour (67% more expensive)
- Better speaker diarization
- Sentiment analysis included
Deepgram:
- Cost: $0.0125/minute = $0.75/hour (108% more expensive)
- Lowest latency (real-time capable)
- Good for streaming
Verdict: ❌ More expensive than OpenAI, not compelling advantages
Option 4: Hybrid Approach (Selected)
Approach: Start with API, transition to self-hosted at scale
Phase 1 (Months 1-3): OpenAI Whisper API
- Use API during MVP/pilot
- Validate quality and demand
- Zero infrastructure burden
- Monthly cost: $200-300
Phase 2 (Month 4+): Self-Hosted Whisper
- Migrate to self-hosted when volume > 500 hours/month
- Use spot instances for cost efficiency
- Keep API as fallback for burst capacity
class TranscriptionService:
def __init__(self, H.P.009-CONFIG):
self.mode = H.P.009-CONFIG.transcription_mode # 'api' or 'local'
self.fallback_enabled = H.P.009-CONFIG.enable_fallback
if self.mode == 'local':
self.local_model = whisper.load_model("large-v3")
if self.fallback_enabled:
self.api_client = openai.Audio
async def transcribe(self, audio_path: Path):
if self.mode == 'local':
try:
return await self._transcribe_local(audio_path)
except Exception as e:
if self.fallback_enabled:
logger.warning(f"Local failed, falling back to API: {e}")
return await self._transcribe_api(audio_path)
raise
else:
return await self._transcribe_api(audio_path)
Pros:
- Start fast: No infrastructure for MVP
- Scale efficiently: Transition when economical
- Flexibility: Can switch back if needed
- Risk mitigation: API fallback for local failures
Cons:
- Migration effort: Need to build self-hosted eventually
- Two codepaths: Must maintain both API and local
- Transition planning: When to switch?
Verdict: ✅ SELECTED - Best balance of simplicity and long-term cost
Decision
Hybrid Approach: API → Self-Hosted
Implementation Strategy
Phase 1: OpenAI Whisper API (Launch → Month 3)
# H.P.009-CONFIG.yaml
transcription:
mode: api
provider: openai
model: whisper-1
enable_fallback: false
api_H.P.009-CONFIG:
max_retries: 3
timeout_seconds: 300
rate_limit_rpm: 50
Migration Trigger: When monthly cost exceeds $200 for 3 consecutive months
Phase 2: Self-Hosted Migration (Month 4+)
# Infrastructure as Code (Terraform)
resource "aws_instance" "whisper_worker" {
instance_type = "g4dn.xlarge"
ami = "deep-learning-ami"
user_data = <<-EOF
#!/bin/bash
docker run -d \
--gpus all \
-v /data:/data \
whisper-worker:latest
EOF
tags = {
Name = "whisper-transcription-worker"
AutoScaling = "spot-instances"
}
}
Configuration:
# H.P.009-CONFIG.yaml (Phase 2)
transcription:
mode: local
model: large-v3
enable_fallback: true # Keep API for burst capacity
local_H.P.009-CONFIG:
model_path: /models/whisper-large-v3
batch_size: 16
fp16: true # Use mixed precision for 2x speedup
fallback_H.P.009-CONFIG:
provider: openai
trigger_conditions:
- local_queue_depth > 100
- local_gpu_utilization > 95%
- local_error_rate > 0.10
Cost Projection
cost_over_time = {
'month_1': {
'volume_hours': 200,
'approach': 'api',
'cost': 72 # 200 * 0.36
},
'month_3': {
'volume_hours': 500,
'approach': 'api',
'cost': 180 # 500 * 0.36
},
'month_6': {
'volume_hours': 750,
'approach': 'self_hosted_spot',
'infrastructure_cost': 165,
'api_fallback_cost': 18, # 5% overflow
'total': 183,
'savings_vs_api': 270 - 183 # $87/month saved
},
'month_12': {
'volume_hours': 1200,
'approach': 'self_hosted_spot',
'infrastructure_cost': 165,
'api_fallback_cost': 22,
'total': 187,
'savings_vs_api': 432 - 187 # $245/month saved
}
}
# Break-even analysis
breakeven = {
'self_hosted_monthly_cost': 165, # Spot instances
'api_cost_per_hour': 0.36,
'breakeven_hours': 165 / 0.36, # ~458 hours/month
'breakeven_videos': 458 / 0.75, # ~610 videos/month (45min avg)
}
Consequences
Positive
- Low initial investment: Start with zero infrastructure
- Predictable scaling: Clear migration trigger at 500 hours/month
- Cost optimization: 57% savings at scale
- Risk mitigation: API fallback prevents downtime
- Flexibility: Can adjust strategy based on actual usage
Negative
- Technical debt: Must build self-hosted infrastructure eventually
- Maintenance burden: GPU management, model updates
- Two codepaths: Must maintain both API and local
- Migration complexity: Coordinating cutover
- DevOps overhead: Monitoring, scaling, incident response
Mitigation Strategies
# Abstraction layer for easy switching
class TranscriptionProvider(Protocol):
async def transcribe(self, audio_path: Path) -> TranscriptResult:
...
class OpenAIProvider(TranscriptionProvider):
async def transcribe(self, audio_path: Path):
# OpenAI API implementation
pass
class LocalWhisperProvider(TranscriptionProvider):
async def transcribe(self, audio_path: Path):
# Local Whisper implementation
pass
# Factory pattern
def get_transcription_provider(H.P.009-CONFIG) -> TranscriptionProvider:
if H.P.009-CONFIG.mode == 'api':
return OpenAIProvider(H.P.009-CONFIG.api_H.P.009-CONFIG)
elif H.P.009-CONFIG.mode == 'local':
return LocalWhisperProvider(H.P.009-CONFIG.local_H.P.009-CONFIG)
else:
raise ValueError(f"Unknown mode: {H.P.009-CONFIG.mode}")
Quality Assurance
# Regular quality checks
quality_metrics = {
'wer': float, # Word Error Rate
'cer': float, # Character Error Rate
'rtf': float, # Real-Time Factor (processing time / audio duration)
# Compare API vs local
'consistency_score': float, # Agreement between providers
'confidence_scores': List[float] # Per-segment confidence
}
async def validate_transcription_quality():
# Sample 1% of transcriptions
# Run through both API and local
# Compare results, flag discrepancies
pass
Validation Metrics
metrics_to_track = {
'cost': {
'api_spend_monthly': float,
'infrastructure_spend_monthly': float,
'cost_per_hour': float,
'cost_per_video': float
},
'quality': {
'word_error_rate': float,
'confidence_scores_avg': float,
'language_accuracy': Dict[str, float]
},
'performance': {
'transcription_latency_p50': float,
'transcription_latency_p99': float,
'real_time_factor': float,
'throughput_hours_per_day': float
},
'reliability': {
'api_success_rate': float,
'local_success_rate': float,
'fallback_trigger_rate': float
}
}
Alternative Future Options
Option: Faster Whisper (ONNX/TensorRT)
# Use optimized Whisper implementation
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# 4x faster than standard Whisper
# Same accuracy
Future consideration: Migrate to Faster Whisper in Phase 2 for 4x speedup
Option: Fine-Tuned Domain Models
# Fine-tune Whisper on domain-specific content
# Potential for 10-20% WER improvement in technical domains
Future consideration: Fine-tune for educational/technical jargon (Month 6+)
Option: Streaming Transcription
# Real-time transcription for live video analysis
# Requires Deepgram or similar streaming-capable service
Future consideration: If real-time analysis becomes a requirement
Review Schedule
- Month 1: Track API costs and quality metrics
- Month 3: Evaluate migration trigger (volume vs cost)
- Month 6: If migrated, assess self-hosted performance and cost savings
- Month 12: Full review, consider fine-tuning or alternative models
References
- OpenAI Whisper Paper
- Faster Whisper GitHub
- OpenAI Whisper API Pricing
- AWS GPU Instance Pricing
- Internal: Transcription Cost Analysis Spreadsheet