System Design Document (SDD)

CODITECT Audio2Text

Version: 1.0 Date: 2025-11-07 Status: Draft

1. Purpose & Scope

1.1 Purpose

Create a local, Linux-native application that downloads YouTube audio/video content, converts it to text using open-source Whisper-based ASR (Automatic Speech Recognition), and outputs textual files in multiple formats (TXT, SRT, JSON, VTT).

1.2 Scope

In Scope:
- YouTube video/audio download and extraction
- Local audio transcription using OpenAI Whisper
- Batch processing of multiple URLs
- Web-based UI for user interaction
- REST API for programmatic access
- Support for multiple output formats
- Error handling and retry logic
- Resource management and model selection
Out of Scope:
- Cloud-based transcription services
- Real-time streaming transcription
- Video processing beyond audio extraction
- Mobile applications (initial release)

1.3 Target Audience

Content creators needing video transcriptions
Researchers processing educational content
Developers requiring automated transcription workflows
Privacy-conscious users preferring local processing

2. System Architecture

2.1 High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                         Client Layer                         │
│  ┌──────────────────┐              ┌────────────────────┐  │
│  │   Web Frontend   │              │   CLI Interface    │  │
│  │   (React/Vue)    │              │    (argparse)      │  │
│  └────────┬─────────┘              └─────────┬──────────┘  │
└───────────┼──────────────────────────────────┼─────────────┘
            │                                   │
            │          HTTP/REST API            │
            ▼                                   ▼
┌─────────────────────────────────────────────────────────────┐
│                      Application Layer                       │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              FastAPI Backend Service                  │  │
│  │  ┌────────────┐  ┌─────────────┐  ┌──────────────┐  │  │
│  │  │  API       │  │  Job Queue  │  │   WebSocket  │  │  │
│  │  │  Endpoints │  │  Manager    │  │   Handler    │  │  │
│  │  └────────────┘  └─────────────┘  └──────────────┘  │  │
│  └──────────────────────────────────────────────────────┘  │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                       Core Library Layer                     │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐  │
│  │   Download   │  │ Transcription│  │   Processing    │  │
│  │   Service    │  │   Service    │  │    Service      │  │
│  │  (yt-dlp)    │  │  (Whisper)   │  │   (ffmpeg)      │  │
│  └──────────────┘  └──────────────┘  └─────────────────┘  │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    Data/Storage Layer                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌───────────┐  │
│  │  Input   │  │  Cache   │  │  Output  │  │  Models   │  │
│  │  Queue   │  │  Storage │  │  Files   │  │  Storage  │  │
│  └──────────┘  └──────────┘  └──────────┘  └───────────┘  │
└─────────────────────────────────────────────────────────────┘

2.2 Component Overview

2.2.1 Frontend Component

Purpose: User-facing web interface
Technology: React with Material-UI
Responsibilities:
- URL input and validation
- Job submission and monitoring
- Progress visualization
- Result download
- Settings management

2.2.2 Backend API Component

Purpose: REST API and business logic orchestration
Technology: FastAPI (Python)
Responsibilities:
- Request validation
- Job queue management
- Task orchestration
- Progress reporting via WebSocket
- Error handling and logging

2.2.3 Core Processing Components

Download Service:

YouTube URL validation
Audio/video extraction using yt-dlp
Format conversion coordination
Metadata extraction

Transcription Service:

Whisper model management
Audio-to-text conversion
Multiple format output generation
Timestamp alignment

Processing Service:

Audio format conversion (ffmpeg)
Audio preprocessing (noise reduction, normalization)
File cleanup and management

3. Data Flow

3.1 Primary Workflow

User Input (URL) → Frontend
Frontend → POST /api/transcribe → Backend
Backend → Job Queue → Task ID returned
Backend → Download Service → yt-dlp
Download Service → Audio File → Cache Storage
Backend → Processing Service → Audio normalization
Backend → Transcription Service → Whisper
Transcription Service → Text Output → Output Storage
Backend → WebSocket → Progress updates → Frontend
User → Download result files → Frontend

3.2 Batch Processing Workflow

User Input (Multiple URLs/Files) → Frontend
Frontend → POST /api/batch-transcribe → Backend
Backend → Create batch job → Multiple task IDs
For each URL:
   - Download → Process → Transcribe
   - Update individual task status
Backend → Aggregate results
Frontend → Display batch results dashboard

4. Resource Management

4.1 Model Selection Strategy

Hardware Profile	Recommended Model	VRAM Required	Processing Speed
Low-end CPU	tiny	~1GB	~10x real-time
Mid-range CPU	base	~1GB	~7x real-time
High-end CPU	small	~2GB	~4x real-time
GPU (4GB)	base/small	~2GB	~20x real-time
GPU (8GB+)	medium/large	~5GB	~30x real-time

4.2 Cache Management

Audio Cache: 7-day retention, LRU eviction
Model Cache: Persistent, user-configurable location
Result Cache: 30-day retention, configurable

4.3 Concurrent Processing

Maximum parallel transcription jobs: Based on available CPU/GPU
Queue-based task management with priority levels
Resource pool management for Whisper model instances

5. Error Handling

5.1 Failure Categories

Failure Type	Detection Method	Recovery Strategy	User Notification
Download failure	yt-dlp exit code	Retry 3x with backoff	Error message with URL
Format error	File validation	Fallback format conversion	Format compatibility warning
Transcription failure	Whisper exception	Retry with smaller model	Suggest model downgrade
Resource exhaustion	Memory monitoring	Queue job, wait for resources	Estimated wait time
Invalid URL	Regex validation	Immediate rejection	URL format guidance

5.2 Retry Logic

def retry_strategy(attempt: int) -> int:
    """Exponential backoff: 2^attempt seconds"""
    return min(2 ** attempt, 300)  # Max 5 minutes

5.3 Logging

Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
Outputs: File rotation (daily), console, optional syslog
Retention: 30 days

6. Security Considerations

6.1 Input Validation

URL whitelist/blacklist support
Maximum file size limits (default: 2GB)
File type validation
Command injection prevention in yt-dlp calls

6.2 Resource Protection

Rate limiting on API endpoints
Maximum concurrent jobs per user
Disk quota management
Memory usage monitoring

6.3 Privacy

No external API calls for transcription
Optional telemetry (opt-in only)
Local-only data storage
Secure file cleanup

7. Extensibility & Future Enhancements

7.1 Plugin Architecture

Custom pre-processing plugins
Custom post-processing (summarization, keyword extraction)
Custom output formatters
Model adapter interface for alternative ASR engines

7.2 Planned Features

Real-time transcription for live streams
Speaker diarization
Multi-language auto-detection
Translation capabilities
Integration with note-taking apps
Docker containerization
REST API authentication

8. Dependencies

8.1 System Dependencies

Python 3.8+
FFmpeg 4.4+
GPU drivers (optional, for CUDA support)

8.2 Python Dependencies

openai-whisper
yt-dlp
fastapi
uvicorn
pydantic
python-multipart
ffmpeg-python

8.3 Frontend Dependencies

React 18+
Material-UI
Axios
Socket.io-client

9. Deployment Architecture

9.1 Local Deployment

Single-user installation via pip/npm
Systemd service for backend
Nginx reverse proxy for production

9.2 Multi-User Deployment

Docker Compose setup
Shared model storage
PostgreSQL for job tracking
Redis for queue management

10. Performance Targets

Metric	Target	Measurement Method
API Response Time	< 200ms	95th percentile
Download Start Time	< 5s	From job submission
Transcription Throughput	1x-10x real-time	Based on model
UI Load Time	< 2s	First contentful paint
Maximum Concurrent Jobs	10 (configurable)	System limit

11. Monitoring & Observability

11.1 Metrics

Job success/failure rates
Processing times per model
Queue depth and wait times
Resource utilization (CPU, GPU, memory, disk)
API endpoint latencies

11.2 Health Checks

/health endpoint for service status
Model availability checks
Disk space monitoring
External dependency checks (ffmpeg, yt-dlp)

12. Revision History

Version	Date	Author	Changes
1.0	2025-11-07	System	Initial draft

CODITECT Audio2Text​

1. Purpose & Scope​

1.1 Purpose​

1.2 Scope​

1.3 Target Audience​

2. System Architecture​

2.1 High-Level Architecture​

2.2 Component Overview​

2.2.1 Frontend Component​

2.2.2 Backend API Component​

2.2.3 Core Processing Components​

3. Data Flow​

3.1 Primary Workflow​

3.2 Batch Processing Workflow​

4. Resource Management​

4.1 Model Selection Strategy​

4.2 Cache Management​

4.3 Concurrent Processing​

5. Error Handling​

5.1 Failure Categories​

5.2 Retry Logic​

5.3 Logging​

6. Security Considerations​

6.1 Input Validation​

6.2 Resource Protection​

6.3 Privacy​

7. Extensibility & Future Enhancements​

7.1 Plugin Architecture​

7.2 Planned Features​

8. Dependencies​

8.1 System Dependencies​

8.2 Python Dependencies​

8.3 Frontend Dependencies​

9. Deployment Architecture​

9.1 Local Deployment​

9.2 Multi-User Deployment​

10. Performance Targets​

11. Monitoring & Observability​

11.1 Metrics​

11.2 Health Checks​

12. Revision History​