User Guide

CODITECT Audio2Text

Welcome to CODITECT Audio2Text! This guide will help you get started with transcribing YouTube videos to text.

Getting Started

System Requirements

Operating System: Linux (Ubuntu 20.04+ recommended)
RAM: 4GB minimum, 8GB+ recommended
Storage: 10GB free space
Internet: Required for downloading videos and models

Installation

Quick Install

# 1. Install system dependencies
sudo apt update
sudo apt install python3 python3-pip ffmpeg -y

# 2. Install CODITECT Audio2Text
git clone <repository-url>
cd coditect-audio2text

# 3. Setup and install
make setup-dev

Docker Install (Recommended for Production)

# 1. Clone repository
git clone <repository-url>
cd coditect-audio2text

# 2. Start with Docker
make docker-build
make docker-up

Using the Web Interface

Step 1: Access the Application

Open your web browser and navigate to:

Local: http://localhost:3000
Docker: http://localhost:3000

Step 2: Submit a Transcription Job

Enter YouTube URL
- Copy the URL of the YouTube video you want to transcribe
- Paste it into the "YouTube URL" field
- Example: https://www.youtube.com/watch?v=dQw4w9WgXcQ
Select Model Size
- Tiny: Fastest, lowest accuracy (~10x real-time)
- Base: Balanced speed and accuracy (~7x real-time) [Recommended]
- Small: Better accuracy, slower (~4x real-time)
- Medium: High accuracy, requires GPU (~2x real-time)
- Large: Best accuracy, requires powerful GPU
Click "Start Transcription"
- The job will be created and added to the queue
- You'll receive a Job ID to track progress

Step 3: Monitor Progress

Navigate to the "Jobs" page
Find your job in the list
Watch the progress bar for status updates:
- Pending: Waiting to start
- Downloading: Downloading audio from YouTube
- Processing: Converting audio format
- Transcribing: Converting speech to text
- Completed: Ready to download
- Failed: Error occurred (check error message)

Step 4: Download Results

Once completed, download your transcription in available formats:

TXT: Plain text transcript
SRT: Subtitle file with timestamps
VTT: WebVTT format for web players
JSON: Structured data with segments

Command Line Interface (CLI)

Basic Usage

# Transcribe a YouTube video
python scripts/transcribe.py \
  --url "https://youtube.com/watch?v=..." \
  --model base \
  --output-formats txt srt

# Transcribe local audio file
python scripts/transcribe.py \
  --file /path/to/audio.mp3 \
  --model small \
  --language en

CLI Options

--url URL              YouTube video URL
--file FILE           Local audio file path
--model MODEL         Whisper model (tiny/base/small/medium/large)
--language LANG       Language code (e.g., en, es, fr)
--output-formats FMT  Output formats (txt, srt, vtt, json)
--output-dir DIR      Output directory
--device DEVICE       Processing device (cpu/cuda)

Examples

Transcribe a lecture in Spanish:

python scripts/transcribe.py \
  --url "https://youtube.com/watch?v=..." \
  --model base \
  --language es \
  --output-formats txt srt

Batch process multiple videos:

# Create a file with URLs (urls.txt)
# One URL per line

python scripts/batch_transcribe.py \
  --urls-file urls.txt \
  --model base

High-quality transcription with GPU:

python scripts/transcribe.py \
  --url "https://youtube.com/watch?v=..." \
  --model medium \
  --device cuda

Choosing the Right Model

Model Comparison

Model	Speed	Accuracy	VRAM	Best For
Tiny	Fastest	Good	~1GB	Quick drafts, previews
Base	Fast	Very Good	~1GB	General use, balanced
Small	Moderate	Excellent	~2GB	High quality, CPU-friendly
Medium	Slow	Superior	~5GB	Professional work, GPU
Large	Slowest	Best	~10GB	Critical accuracy, GPU

Recommendations

Clear audio, quick results: Use Base or Tiny
Technical content, lectures: Use Small or Medium
Noisy audio, accents: Use Medium or Large
Multiple languages: Use Small or higher
Real-time subtitles: Use Tiny with GPU

Understanding Output Formats

TXT (Plain Text)

Complete transcript as plain text
No timestamps
Easy to read and edit
Best for: Reading, copying text

SRT (SubRip Subtitles)

Industry-standard subtitle format
Includes timestamps
Supported by most video players
Best for: Adding subtitles to videos

VTT (WebVTT)

Web-based subtitle format
Similar to SRT with web features
Supports styling and metadata
Best for: Web video players, HTML5

JSON (Structured Data)

Complete data including segments
Programmable and parseable
Includes confidence scores
Best for: Further processing, analysis

Tips and Best Practices

Getting Better Accuracy

Choose the right model: Larger models = better accuracy
Specify language: Auto-detection works but manual is better
Clean audio: Less background noise = better results
Clear speech: Single speaker, clear pronunciation works best

Optimizing Performance

Use GPU: NVIDIA GPU with CUDA significantly speeds up processing
Close other apps: Free up RAM for better performance
Local files: Processing local files is faster than downloading
Batch processing: Process multiple videos overnight

Privacy and Security

Local processing: All transcription happens on your machine
No data sent out: Your audio and transcripts stay local
Clean up: Delete cached files regularly to save space
Secure URLs: Only process videos you have permission to transcribe

Troubleshooting

Common Issues

Problem: "Download failed"

Solution: Check if the YouTube URL is valid and accessible
Try downloading the video manually first to verify it works
Some videos may be geo-restricted or private

Problem: "Out of memory"

Solution: Use a smaller model (Tiny or Base)
Close other applications to free up RAM
Enable swap if you have limited RAM

Problem: "Transcription is very slow"

Solution: Use GPU if available (set DEVICE=cuda)
Use a smaller model for faster results
Check CPU usage - close background apps

Problem: "Poor transcription quality"

Solution: Try a larger model (Medium or Large)
Specify the language manually instead of auto-detect
Check if audio quality is good (clear speech, low noise)

Problem: "Model download fails"

Solution: Check internet connection
Retry the download
Download model manually and place in models/ directory

Advanced Features

Custom Model Location

# Set custom model directory
export WHISPER_CACHE_DIR=/path/to/models

Adjust Cache Settings

Edit .env file:

CACHE_RETENTION_DAYS=7
ENABLE_CACHE=True

API Integration

Use the REST API programmatically:

import requests

# Create transcription job
response = requests.post('http://localhost:8000/api/transcribe', json={
    'url': 'https://youtube.com/watch?v=...',
    'model': 'base',
    'output_formats': ['txt', 'srt']
})

job_id = response.json()['job_id']

# Check status
status = requests.get(f'http://localhost:8000/api/transcribe/{job_id}')
print(status.json())

Frequently Asked Questions

Q: Is this free to use? A: Yes, CODITECT Audio2Text is open-source and free.

Q: What languages are supported? A: Whisper supports 90+ languages including English, Spanish, French, German, Chinese, Japanese, and many more.

Q: Can I transcribe my own audio files? A: Yes, you can upload MP3, WAV, M4A, and other common audio formats.

Q: How accurate is the transcription? A: Accuracy ranges from 75-95% depending on model size and audio quality. Base model achieves ~84% accuracy on clear speech.

Q: How long does it take? A: Processing time varies by model:

Tiny: ~10x real-time (30min video = 3min processing)
Base: ~7x real-time (30min video = 4min processing)
Small: ~4x real-time (30min video = 7min processing) With GPU, these times are much faster.

Q: Can I translate to English? A: Yes, set task='translate' to translate any language to English.

Q: Is my data safe? A: Yes, all processing happens locally on your machine. Nothing is sent to external servers.

Getting Help

Documentation: Check the /docs directory
API Docs: http://localhost:8000/docs
GitHub Issues: Report bugs or request features
Community: Join discussions and ask questions

Next Steps

Explore the API Documentation for integration options
Read the Developer Guide if you want to contribute
Check out Architecture Decision Records to understand design choices

Last Updated: 2025-11-07

CODITECT Audio2Text​

Getting Started​

System Requirements​

Installation​

Quick Install​

Docker Install (Recommended for Production)​

Using the Web Interface​

Step 1: Access the Application​

Step 2: Submit a Transcription Job​

Step 3: Monitor Progress​

Step 4: Download Results​

Command Line Interface (CLI)​

Basic Usage​

CLI Options​

Examples​

Choosing the Right Model​

Model Comparison​

Recommendations​

Understanding Output Formats​

TXT (Plain Text)​

SRT (SubRip Subtitles)​

VTT (WebVTT)​

JSON (Structured Data)​

Tips and Best Practices​

Getting Better Accuracy​

Optimizing Performance​

Privacy and Security​

Troubleshooting​

Common Issues​

Advanced Features​

Custom Model Location​

Adjust Cache Settings​

API Integration​

Frequently Asked Questions​

Getting Help​

Next Steps​