System Design Document (SDD)
CODITECT Audio2Text
Version: 1.0 Date: 2025-11-07 Status: Draft
1. Purpose & Scope
1.1 Purpose
Create a local, Linux-native application that downloads YouTube audio/video content, converts it to text using open-source Whisper-based ASR (Automatic Speech Recognition), and outputs textual files in multiple formats (TXT, SRT, JSON, VTT).
1.2 Scope
-
In Scope:
- YouTube video/audio download and extraction
- Local audio transcription using OpenAI Whisper
- Batch processing of multiple URLs
- Web-based UI for user interaction
- REST API for programmatic access
- Support for multiple output formats
- Error handling and retry logic
- Resource management and model selection
-
Out of Scope:
- Cloud-based transcription services
- Real-time streaming transcription
- Video processing beyond audio extraction
- Mobile applications (initial release)
1.3 Target Audience
- Content creators needing video transcriptions
- Researchers processing educational content
- Developers requiring automated transcription workflows
- Privacy-conscious users preferring local processing
2. System Architecture
2.1 High-Level Architecture
┌─────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌──────────────────┐ ┌────────────────────┐ │
│ │ Web Frontend │ │ CLI Interface │ │
│ │ (React/Vue) │ │ (argparse) │ │
│ └────────┬─────────┘ └─────────┬──────────┘ │
└───────────┼──────────────────────────────────┼─────────────┘
│ │
│ HTTP/REST API │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ FastAPI Backend Service │ │
│ │ ┌────────────┐ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ API │ │ Job Queue │ │ WebSocket │ │ │
│ │ │ Endpoints │ │ Manager │ │ Handler │ │ │
│ │ └────────────┘ └─────────────┘ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Core Library Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Download │ │ Transcription│ │ Processing │ │
│ │ Service │ │ Service │ │ Service │ │
│ │ (yt-dlp) │ │ (Whisper) │ │ (ffmpeg) │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data/Storage Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Input │ │ Cache │ │ Output │ │ Models │ │
│ │ Queue │ │ Storage │ │ Files │ │ Storage │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────┘
2.2 Component Overview
2.2.1 Frontend Component
- Purpose: User-facing web interface
- Technology: React with Material-UI
- Responsibilities:
- URL input and validation
- Job submission and monitoring
- Progress visualization
- Result download
- Settings management
2.2.2 Backend API Component
- Purpose: REST API and business logic orchestration
- Technology: FastAPI (Python)
- Responsibilities:
- Request validation
- Job queue management
- Task orchestration
- Progress reporting via WebSocket
- Error handling and logging
2.2.3 Core Processing Components
Download Service:
- YouTube URL validation
- Audio/video extraction using yt-dlp
- Format conversion coordination
- Metadata extraction
Transcription Service:
- Whisper model management
- Audio-to-text conversion
- Multiple format output generation
- Timestamp alignment
Processing Service:
- Audio format conversion (ffmpeg)
- Audio preprocessing (noise reduction, normalization)
- File cleanup and management
3. Data Flow
3.1 Primary Workflow
1. User Input (URL) → Frontend
2. Frontend → POST /api/transcribe → Backend
3. Backend → Job Queue → Task ID returned
4. Backend → Download Service → yt-dlp
5. Download Service → Audio File → Cache Storage
6. Backend → Processing Service → Audio normalization
7. Backend → Transcription Service → Whisper
8. Transcription Service → Text Output → Output Storage
9. Backend → WebSocket → Progress updates → Frontend
10. User → Download result files → Frontend
3.2 Batch Processing Workflow
1. User Input (Multiple URLs/Files) → Frontend
2. Frontend → POST /api/batch-transcribe → Backend
3. Backend → Create batch job → Multiple task IDs
4. For each URL:
- Download → Process → Transcribe
- Update individual task status
5. Backend → Aggregate results
6. Frontend → Display batch results dashboard
4. Resource Management
4.1 Model Selection Strategy
| Hardware Profile | Recommended Model | VRAM Required | Processing Speed |
|---|---|---|---|
| Low-end CPU | tiny | ~1GB | ~10x real-time |
| Mid-range CPU | base | ~1GB | ~7x real-time |
| High-end CPU | small | ~2GB | ~4x real-time |
| GPU (4GB) | base/small | ~2GB | ~20x real-time |
| GPU (8GB+) | medium/large | ~5GB | ~30x real-time |
4.2 Cache Management
- Audio Cache: 7-day retention, LRU eviction
- Model Cache: Persistent, user-configurable location
- Result Cache: 30-day retention, configurable
4.3 Concurrent Processing
- Maximum parallel transcription jobs: Based on available CPU/GPU
- Queue-based task management with priority levels
- Resource pool management for Whisper model instances
5. Error Handling
5.1 Failure Categories
| Failure Type | Detection Method | Recovery Strategy | User Notification |
|---|---|---|---|
| Download failure | yt-dlp exit code | Retry 3x with backoff | Error message with URL |
| Format error | File validation | Fallback format conversion | Format compatibility warning |
| Transcription failure | Whisper exception | Retry with smaller model | Suggest model downgrade |
| Resource exhaustion | Memory monitoring | Queue job, wait for resources | Estimated wait time |
| Invalid URL | Regex validation | Immediate rejection | URL format guidance |
5.2 Retry Logic
def retry_strategy(attempt: int) -> int:
"""Exponential backoff: 2^attempt seconds"""
return min(2 ** attempt, 300) # Max 5 minutes
5.3 Logging
- Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Outputs: File rotation (daily), console, optional syslog
- Retention: 30 days
6. Security Considerations
6.1 Input Validation
- URL whitelist/blacklist support
- Maximum file size limits (default: 2GB)
- File type validation
- Command injection prevention in yt-dlp calls
6.2 Resource Protection
- Rate limiting on API endpoints
- Maximum concurrent jobs per user
- Disk quota management
- Memory usage monitoring
6.3 Privacy
- No external API calls for transcription
- Optional telemetry (opt-in only)
- Local-only data storage
- Secure file cleanup
7. Extensibility & Future Enhancements
7.1 Plugin Architecture
- Custom pre-processing plugins
- Custom post-processing (summarization, keyword extraction)
- Custom output formatters
- Model adapter interface for alternative ASR engines
7.2 Planned Features
- Real-time transcription for live streams
- Speaker diarization
- Multi-language auto-detection
- Translation capabilities
- Integration with note-taking apps
- Docker containerization
- REST API authentication
8. Dependencies
8.1 System Dependencies
- Python 3.8+
- FFmpeg 4.4+
- GPU drivers (optional, for CUDA support)
8.2 Python Dependencies
- openai-whisper
- yt-dlp
- fastapi
- uvicorn
- pydantic
- python-multipart
- ffmpeg-python
8.3 Frontend Dependencies
- React 18+
- Material-UI
- Axios
- Socket.io-client
9. Deployment Architecture
9.1 Local Deployment
- Single-user installation via pip/npm
- Systemd service for backend
- Nginx reverse proxy for production
9.2 Multi-User Deployment
- Docker Compose setup
- Shared model storage
- PostgreSQL for job tracking
- Redis for queue management
10. Performance Targets
| Metric | Target | Measurement Method |
|---|---|---|
| API Response Time | < 200ms | 95th percentile |
| Download Start Time | < 5s | From job submission |
| Transcription Throughput | 1x-10x real-time | Based on model |
| UI Load Time | < 2s | First contentful paint |
| Maximum Concurrent Jobs | 10 (configurable) | System limit |
11. Monitoring & Observability
11.1 Metrics
- Job success/failure rates
- Processing times per model
- Queue depth and wait times
- Resource utilization (CPU, GPU, memory, disk)
- API endpoint latencies
11.2 Health Checks
/healthendpoint for service status- Model availability checks
- Disk space monitoring
- External dependency checks (ffmpeg, yt-dlp)
12. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-11-07 | System | Initial draft |