Skip to main content

ADR 002: Use OpenAI Whisper for Transcription

Status

Accepted

Context

Audio transcription is critical for the pipeline. Requirements:

  • High accuracy (>95% for clear audio)
  • Timestamp alignment
  • Multi-language support
  • Cost efficiency

Decision

Use OpenAI Whisper API (whisper-large-v3) for transcription.

Consequences

Positive

  • State-of-the-art accuracy
  • Supports 99 languages
  • Returns word-level timestamps
  • Cost-effective ($0.006/minute)
  • Handles various audio qualities
  • Speaker diarization available (with extensions)

Negative

  • Requires internet connection
  • API rate limits (50 RPM)
  • Audio file size limits (25MB)
  • Potential latency for long files

Alternatives Considered

AlternativeProsCons
Google Speech-to-TextEnterprise featuresHigher cost
AWS TranscribeAWS integrationAccuracy lower
Local WhisperNo API dependencyRequires GPU, slower
AssemblyAIGood accuracyHigher cost
DeepgramFast, accurateHigher cost

Implementation

import openai

with open(audio_path, "rb") as audio_file:
transcript = openai.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)

Notes

For files >25MB, split using FFmpeg before sending to API. Implement retry logic with exponential backoff for rate limits.