ADR 002: Use OpenAI Whisper for Transcription
Status
Accepted
Context
Audio transcription is critical for the pipeline. Requirements:
- High accuracy (>95% for clear audio)
- Timestamp alignment
- Multi-language support
- Cost efficiency
Decision
Use OpenAI Whisper API (whisper-large-v3) for transcription.
Consequences
Positive
- State-of-the-art accuracy
- Supports 99 languages
- Returns word-level timestamps
- Cost-effective ($0.006/minute)
- Handles various audio qualities
- Speaker diarization available (with extensions)
Negative
- Requires internet connection
- API rate limits (50 RPM)
- Audio file size limits (25MB)
- Potential latency for long files
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Google Speech-to-Text | Enterprise features | Higher cost |
| AWS Transcribe | AWS integration | Accuracy lower |
| Local Whisper | No API dependency | Requires GPU, slower |
| AssemblyAI | Good accuracy | Higher cost |
| Deepgram | Fast, accurate | Higher cost |
Implementation
import openai
with open(audio_path, "rb") as audio_file:
transcript = openai.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
Notes
For files >25MB, split using FFmpeg before sending to API. Implement retry logic with exponential backoff for rate limits.