User Guide
CODITECT Audio2Text
Welcome to CODITECT Audio2Text! This guide will help you get started with transcribing YouTube videos to text.
Getting Started
System Requirements
- Operating System: Linux (Ubuntu 20.04+ recommended)
- RAM: 4GB minimum, 8GB+ recommended
- Storage: 10GB free space
- Internet: Required for downloading videos and models
Installation
Quick Install
# 1. Install system dependencies
sudo apt update
sudo apt install python3 python3-pip ffmpeg -y
# 2. Install CODITECT Audio2Text
git clone <repository-url>
cd coditect-audio2text
# 3. Setup and install
make setup-dev
Docker Install (Recommended for Production)
# 1. Clone repository
git clone <repository-url>
cd coditect-audio2text
# 2. Start with Docker
make docker-build
make docker-up
Using the Web Interface
Step 1: Access the Application
Open your web browser and navigate to:
- Local: http://localhost:3000
- Docker: http://localhost:3000
Step 2: Submit a Transcription Job
-
Enter YouTube URL
- Copy the URL of the YouTube video you want to transcribe
- Paste it into the "YouTube URL" field
- Example:
https://www.youtube.com/watch?v=dQw4w9WgXcQ
-
Select Model Size
- Tiny: Fastest, lowest accuracy (~10x real-time)
- Base: Balanced speed and accuracy (~7x real-time) [Recommended]
- Small: Better accuracy, slower (~4x real-time)
- Medium: High accuracy, requires GPU (~2x real-time)
- Large: Best accuracy, requires powerful GPU
-
Click "Start Transcription"
- The job will be created and added to the queue
- You'll receive a Job ID to track progress
Step 3: Monitor Progress
- Navigate to the "Jobs" page
- Find your job in the list
- Watch the progress bar for status updates:
- Pending: Waiting to start
- Downloading: Downloading audio from YouTube
- Processing: Converting audio format
- Transcribing: Converting speech to text
- Completed: Ready to download
- Failed: Error occurred (check error message)
Step 4: Download Results
Once completed, download your transcription in available formats:
- TXT: Plain text transcript
- SRT: Subtitle file with timestamps
- VTT: WebVTT format for web players
- JSON: Structured data with segments
Command Line Interface (CLI)
Basic Usage
# Transcribe a YouTube video
python scripts/transcribe.py \
--url "https://youtube.com/watch?v=..." \
--model base \
--output-formats txt srt
# Transcribe local audio file
python scripts/transcribe.py \
--file /path/to/audio.mp3 \
--model small \
--language en
CLI Options
--url URL YouTube video URL
--file FILE Local audio file path
--model MODEL Whisper model (tiny/base/small/medium/large)
--language LANG Language code (e.g., en, es, fr)
--output-formats FMT Output formats (txt, srt, vtt, json)
--output-dir DIR Output directory
--device DEVICE Processing device (cpu/cuda)
Examples
Transcribe a lecture in Spanish:
python scripts/transcribe.py \
--url "https://youtube.com/watch?v=..." \
--model base \
--language es \
--output-formats txt srt
Batch process multiple videos:
# Create a file with URLs (urls.txt)
# One URL per line
python scripts/batch_transcribe.py \
--urls-file urls.txt \
--model base
High-quality transcription with GPU:
python scripts/transcribe.py \
--url "https://youtube.com/watch?v=..." \
--model medium \
--device cuda
Choosing the Right Model
Model Comparison
| Model | Speed | Accuracy | VRAM | Best For |
|---|---|---|---|---|
| Tiny | Fastest | Good | ~1GB | Quick drafts, previews |
| Base | Fast | Very Good | ~1GB | General use, balanced |
| Small | Moderate | Excellent | ~2GB | High quality, CPU-friendly |
| Medium | Slow | Superior | ~5GB | Professional work, GPU |
| Large | Slowest | Best | ~10GB | Critical accuracy, GPU |
Recommendations
- Clear audio, quick results: Use Base or Tiny
- Technical content, lectures: Use Small or Medium
- Noisy audio, accents: Use Medium or Large
- Multiple languages: Use Small or higher
- Real-time subtitles: Use Tiny with GPU
Understanding Output Formats
TXT (Plain Text)
- Complete transcript as plain text
- No timestamps
- Easy to read and edit
- Best for: Reading, copying text
SRT (SubRip Subtitles)
- Industry-standard subtitle format
- Includes timestamps
- Supported by most video players
- Best for: Adding subtitles to videos
VTT (WebVTT)
- Web-based subtitle format
- Similar to SRT with web features
- Supports styling and metadata
- Best for: Web video players, HTML5
JSON (Structured Data)
- Complete data including segments
- Programmable and parseable
- Includes confidence scores
- Best for: Further processing, analysis
Tips and Best Practices
Getting Better Accuracy
- Choose the right model: Larger models = better accuracy
- Specify language: Auto-detection works but manual is better
- Clean audio: Less background noise = better results
- Clear speech: Single speaker, clear pronunciation works best
Optimizing Performance
- Use GPU: NVIDIA GPU with CUDA significantly speeds up processing
- Close other apps: Free up RAM for better performance
- Local files: Processing local files is faster than downloading
- Batch processing: Process multiple videos overnight
Privacy and Security
- Local processing: All transcription happens on your machine
- No data sent out: Your audio and transcripts stay local
- Clean up: Delete cached files regularly to save space
- Secure URLs: Only process videos you have permission to transcribe
Troubleshooting
Common Issues
Problem: "Download failed"
- Solution: Check if the YouTube URL is valid and accessible
- Try downloading the video manually first to verify it works
- Some videos may be geo-restricted or private
Problem: "Out of memory"
- Solution: Use a smaller model (Tiny or Base)
- Close other applications to free up RAM
- Enable swap if you have limited RAM
Problem: "Transcription is very slow"
- Solution: Use GPU if available (set DEVICE=cuda)
- Use a smaller model for faster results
- Check CPU usage - close background apps
Problem: "Poor transcription quality"
- Solution: Try a larger model (Medium or Large)
- Specify the language manually instead of auto-detect
- Check if audio quality is good (clear speech, low noise)
Problem: "Model download fails"
- Solution: Check internet connection
- Retry the download
- Download model manually and place in models/ directory
Advanced Features
Custom Model Location
# Set custom model directory
export WHISPER_CACHE_DIR=/path/to/models
Adjust Cache Settings
Edit .env file:
CACHE_RETENTION_DAYS=7
ENABLE_CACHE=True
API Integration
Use the REST API programmatically:
import requests
# Create transcription job
response = requests.post('http://localhost:8000/api/transcribe', json={
'url': 'https://youtube.com/watch?v=...',
'model': 'base',
'output_formats': ['txt', 'srt']
})
job_id = response.json()['job_id']
# Check status
status = requests.get(f'http://localhost:8000/api/transcribe/{job_id}')
print(status.json())
Frequently Asked Questions
Q: Is this free to use? A: Yes, CODITECT Audio2Text is open-source and free.
Q: What languages are supported? A: Whisper supports 90+ languages including English, Spanish, French, German, Chinese, Japanese, and many more.
Q: Can I transcribe my own audio files? A: Yes, you can upload MP3, WAV, M4A, and other common audio formats.
Q: How accurate is the transcription? A: Accuracy ranges from 75-95% depending on model size and audio quality. Base model achieves ~84% accuracy on clear speech.
Q: How long does it take? A: Processing time varies by model:
- Tiny: ~10x real-time (30min video = 3min processing)
- Base: ~7x real-time (30min video = 4min processing)
- Small: ~4x real-time (30min video = 7min processing) With GPU, these times are much faster.
Q: Can I translate to English? A: Yes, set task='translate' to translate any language to English.
Q: Is my data safe? A: Yes, all processing happens locally on your machine. Nothing is sent to external servers.
Getting Help
- Documentation: Check the
/docsdirectory - API Docs: http://localhost:8000/docs
- GitHub Issues: Report bugs or request features
- Community: Join discussions and ask questions
Next Steps
- Explore the API Documentation for integration options
- Read the Developer Guide if you want to contribute
- Check out Architecture Decision Records to understand design choices
Last Updated: 2025-11-07