Skip to main content

User Guide

CODITECT Audio2Text

Welcome to CODITECT Audio2Text! This guide will help you get started with transcribing YouTube videos to text.


Getting Started

System Requirements

  • Operating System: Linux (Ubuntu 20.04+ recommended)
  • RAM: 4GB minimum, 8GB+ recommended
  • Storage: 10GB free space
  • Internet: Required for downloading videos and models

Installation

Quick Install

# 1. Install system dependencies
sudo apt update
sudo apt install python3 python3-pip ffmpeg -y

# 2. Install CODITECT Audio2Text
git clone <repository-url>
cd coditect-audio2text

# 3. Setup and install
make setup-dev
# 1. Clone repository
git clone <repository-url>
cd coditect-audio2text

# 2. Start with Docker
make docker-build
make docker-up

Using the Web Interface

Step 1: Access the Application

Open your web browser and navigate to:

Step 2: Submit a Transcription Job

  1. Enter YouTube URL

    • Copy the URL of the YouTube video you want to transcribe
    • Paste it into the "YouTube URL" field
    • Example: https://www.youtube.com/watch?v=dQw4w9WgXcQ
  2. Select Model Size

    • Tiny: Fastest, lowest accuracy (~10x real-time)
    • Base: Balanced speed and accuracy (~7x real-time) [Recommended]
    • Small: Better accuracy, slower (~4x real-time)
    • Medium: High accuracy, requires GPU (~2x real-time)
    • Large: Best accuracy, requires powerful GPU
  3. Click "Start Transcription"

    • The job will be created and added to the queue
    • You'll receive a Job ID to track progress

Step 3: Monitor Progress

  1. Navigate to the "Jobs" page
  2. Find your job in the list
  3. Watch the progress bar for status updates:
    • Pending: Waiting to start
    • Downloading: Downloading audio from YouTube
    • Processing: Converting audio format
    • Transcribing: Converting speech to text
    • Completed: Ready to download
    • Failed: Error occurred (check error message)

Step 4: Download Results

Once completed, download your transcription in available formats:

  • TXT: Plain text transcript
  • SRT: Subtitle file with timestamps
  • VTT: WebVTT format for web players
  • JSON: Structured data with segments

Command Line Interface (CLI)

Basic Usage

# Transcribe a YouTube video
python scripts/transcribe.py \
--url "https://youtube.com/watch?v=..." \
--model base \
--output-formats txt srt

# Transcribe local audio file
python scripts/transcribe.py \
--file /path/to/audio.mp3 \
--model small \
--language en

CLI Options

--url URL              YouTube video URL
--file FILE Local audio file path
--model MODEL Whisper model (tiny/base/small/medium/large)
--language LANG Language code (e.g., en, es, fr)
--output-formats FMT Output formats (txt, srt, vtt, json)
--output-dir DIR Output directory
--device DEVICE Processing device (cpu/cuda)

Examples

Transcribe a lecture in Spanish:

python scripts/transcribe.py \
--url "https://youtube.com/watch?v=..." \
--model base \
--language es \
--output-formats txt srt

Batch process multiple videos:

# Create a file with URLs (urls.txt)
# One URL per line

python scripts/batch_transcribe.py \
--urls-file urls.txt \
--model base

High-quality transcription with GPU:

python scripts/transcribe.py \
--url "https://youtube.com/watch?v=..." \
--model medium \
--device cuda

Choosing the Right Model

Model Comparison

ModelSpeedAccuracyVRAMBest For
TinyFastestGood~1GBQuick drafts, previews
BaseFastVery Good~1GBGeneral use, balanced
SmallModerateExcellent~2GBHigh quality, CPU-friendly
MediumSlowSuperior~5GBProfessional work, GPU
LargeSlowestBest~10GBCritical accuracy, GPU

Recommendations

  • Clear audio, quick results: Use Base or Tiny
  • Technical content, lectures: Use Small or Medium
  • Noisy audio, accents: Use Medium or Large
  • Multiple languages: Use Small or higher
  • Real-time subtitles: Use Tiny with GPU

Understanding Output Formats

TXT (Plain Text)

  • Complete transcript as plain text
  • No timestamps
  • Easy to read and edit
  • Best for: Reading, copying text

SRT (SubRip Subtitles)

  • Industry-standard subtitle format
  • Includes timestamps
  • Supported by most video players
  • Best for: Adding subtitles to videos

VTT (WebVTT)

  • Web-based subtitle format
  • Similar to SRT with web features
  • Supports styling and metadata
  • Best for: Web video players, HTML5

JSON (Structured Data)

  • Complete data including segments
  • Programmable and parseable
  • Includes confidence scores
  • Best for: Further processing, analysis

Tips and Best Practices

Getting Better Accuracy

  1. Choose the right model: Larger models = better accuracy
  2. Specify language: Auto-detection works but manual is better
  3. Clean audio: Less background noise = better results
  4. Clear speech: Single speaker, clear pronunciation works best

Optimizing Performance

  1. Use GPU: NVIDIA GPU with CUDA significantly speeds up processing
  2. Close other apps: Free up RAM for better performance
  3. Local files: Processing local files is faster than downloading
  4. Batch processing: Process multiple videos overnight

Privacy and Security

  1. Local processing: All transcription happens on your machine
  2. No data sent out: Your audio and transcripts stay local
  3. Clean up: Delete cached files regularly to save space
  4. Secure URLs: Only process videos you have permission to transcribe

Troubleshooting

Common Issues

Problem: "Download failed"

  • Solution: Check if the YouTube URL is valid and accessible
  • Try downloading the video manually first to verify it works
  • Some videos may be geo-restricted or private

Problem: "Out of memory"

  • Solution: Use a smaller model (Tiny or Base)
  • Close other applications to free up RAM
  • Enable swap if you have limited RAM

Problem: "Transcription is very slow"

  • Solution: Use GPU if available (set DEVICE=cuda)
  • Use a smaller model for faster results
  • Check CPU usage - close background apps

Problem: "Poor transcription quality"

  • Solution: Try a larger model (Medium or Large)
  • Specify the language manually instead of auto-detect
  • Check if audio quality is good (clear speech, low noise)

Problem: "Model download fails"

  • Solution: Check internet connection
  • Retry the download
  • Download model manually and place in models/ directory

Advanced Features

Custom Model Location

# Set custom model directory
export WHISPER_CACHE_DIR=/path/to/models

Adjust Cache Settings

Edit .env file:

CACHE_RETENTION_DAYS=7
ENABLE_CACHE=True

API Integration

Use the REST API programmatically:

import requests

# Create transcription job
response = requests.post('http://localhost:8000/api/transcribe', json={
'url': 'https://youtube.com/watch?v=...',
'model': 'base',
'output_formats': ['txt', 'srt']
})

job_id = response.json()['job_id']

# Check status
status = requests.get(f'http://localhost:8000/api/transcribe/{job_id}')
print(status.json())

Frequently Asked Questions

Q: Is this free to use? A: Yes, CODITECT Audio2Text is open-source and free.

Q: What languages are supported? A: Whisper supports 90+ languages including English, Spanish, French, German, Chinese, Japanese, and many more.

Q: Can I transcribe my own audio files? A: Yes, you can upload MP3, WAV, M4A, and other common audio formats.

Q: How accurate is the transcription? A: Accuracy ranges from 75-95% depending on model size and audio quality. Base model achieves ~84% accuracy on clear speech.

Q: How long does it take? A: Processing time varies by model:

  • Tiny: ~10x real-time (30min video = 3min processing)
  • Base: ~7x real-time (30min video = 4min processing)
  • Small: ~4x real-time (30min video = 7min processing) With GPU, these times are much faster.

Q: Can I translate to English? A: Yes, set task='translate' to translate any language to English.

Q: Is my data safe? A: Yes, all processing happens locally on your machine. Nothing is sent to external servers.


Getting Help

  • Documentation: Check the /docs directory
  • API Docs: http://localhost:8000/docs
  • GitHub Issues: Report bugs or request features
  • Community: Join discussions and ask questions

Next Steps

  • Explore the API Documentation for integration options
  • Read the Developer Guide if you want to contribute
  • Check out Architecture Decision Records to understand design choices

Last Updated: 2025-11-07