Skip to main content

System Design Document (SDD)

CODITECT Audio2Text

Version: 1.0 Date: 2025-11-07 Status: Draft


1. Purpose & Scope

1.1 Purpose

Create a local, Linux-native application that downloads YouTube audio/video content, converts it to text using open-source Whisper-based ASR (Automatic Speech Recognition), and outputs textual files in multiple formats (TXT, SRT, JSON, VTT).

1.2 Scope

  • In Scope:

    • YouTube video/audio download and extraction
    • Local audio transcription using OpenAI Whisper
    • Batch processing of multiple URLs
    • Web-based UI for user interaction
    • REST API for programmatic access
    • Support for multiple output formats
    • Error handling and retry logic
    • Resource management and model selection
  • Out of Scope:

    • Cloud-based transcription services
    • Real-time streaming transcription
    • Video processing beyond audio extraction
    • Mobile applications (initial release)

1.3 Target Audience

  • Content creators needing video transcriptions
  • Researchers processing educational content
  • Developers requiring automated transcription workflows
  • Privacy-conscious users preferring local processing

2. System Architecture

2.1 High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌──────────────────┐ ┌────────────────────┐ │
│ │ Web Frontend │ │ CLI Interface │ │
│ │ (React/Vue) │ │ (argparse) │ │
│ └────────┬─────────┘ └─────────┬──────────┘ │
└───────────┼──────────────────────────────────┼─────────────┘
│ │
│ HTTP/REST API │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ FastAPI Backend Service │ │
│ │ ┌────────────┐ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │ API │ │ Job Queue │ │ WebSocket │ │ │
│ │ │ Endpoints │ │ Manager │ │ Handler │ │ │
│ │ └────────────┘ └─────────────┘ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└───────────────────────────┬─────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Core Library Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ Download │ │ Transcription│ │ Processing │ │
│ │ Service │ │ Service │ │ Service │ │
│ │ (yt-dlp) │ │ (Whisper) │ │ (ffmpeg) │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
└───────────────────────────┬─────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Data/Storage Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Input │ │ Cache │ │ Output │ │ Models │ │
│ │ Queue │ │ Storage │ │ Files │ │ Storage │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────┘

2.2 Component Overview

2.2.1 Frontend Component

  • Purpose: User-facing web interface
  • Technology: React with Material-UI
  • Responsibilities:
    • URL input and validation
    • Job submission and monitoring
    • Progress visualization
    • Result download
    • Settings management

2.2.2 Backend API Component

  • Purpose: REST API and business logic orchestration
  • Technology: FastAPI (Python)
  • Responsibilities:
    • Request validation
    • Job queue management
    • Task orchestration
    • Progress reporting via WebSocket
    • Error handling and logging

2.2.3 Core Processing Components

Download Service:

  • YouTube URL validation
  • Audio/video extraction using yt-dlp
  • Format conversion coordination
  • Metadata extraction

Transcription Service:

  • Whisper model management
  • Audio-to-text conversion
  • Multiple format output generation
  • Timestamp alignment

Processing Service:

  • Audio format conversion (ffmpeg)
  • Audio preprocessing (noise reduction, normalization)
  • File cleanup and management

3. Data Flow

3.1 Primary Workflow

1. User Input (URL) → Frontend
2. Frontend → POST /api/transcribe → Backend
3. Backend → Job Queue → Task ID returned
4. Backend → Download Service → yt-dlp
5. Download Service → Audio File → Cache Storage
6. Backend → Processing Service → Audio normalization
7. Backend → Transcription Service → Whisper
8. Transcription Service → Text Output → Output Storage
9. Backend → WebSocket → Progress updates → Frontend
10. User → Download result files → Frontend

3.2 Batch Processing Workflow

1. User Input (Multiple URLs/Files) → Frontend
2. Frontend → POST /api/batch-transcribe → Backend
3. Backend → Create batch job → Multiple task IDs
4. For each URL:
- Download → Process → Transcribe
- Update individual task status
5. Backend → Aggregate results
6. Frontend → Display batch results dashboard

4. Resource Management

4.1 Model Selection Strategy

Hardware ProfileRecommended ModelVRAM RequiredProcessing Speed
Low-end CPUtiny~1GB~10x real-time
Mid-range CPUbase~1GB~7x real-time
High-end CPUsmall~2GB~4x real-time
GPU (4GB)base/small~2GB~20x real-time
GPU (8GB+)medium/large~5GB~30x real-time

4.2 Cache Management

  • Audio Cache: 7-day retention, LRU eviction
  • Model Cache: Persistent, user-configurable location
  • Result Cache: 30-day retention, configurable

4.3 Concurrent Processing

  • Maximum parallel transcription jobs: Based on available CPU/GPU
  • Queue-based task management with priority levels
  • Resource pool management for Whisper model instances

5. Error Handling

5.1 Failure Categories

Failure TypeDetection MethodRecovery StrategyUser Notification
Download failureyt-dlp exit codeRetry 3x with backoffError message with URL
Format errorFile validationFallback format conversionFormat compatibility warning
Transcription failureWhisper exceptionRetry with smaller modelSuggest model downgrade
Resource exhaustionMemory monitoringQueue job, wait for resourcesEstimated wait time
Invalid URLRegex validationImmediate rejectionURL format guidance

5.2 Retry Logic

def retry_strategy(attempt: int) -> int:
"""Exponential backoff: 2^attempt seconds"""
return min(2 ** attempt, 300) # Max 5 minutes

5.3 Logging

  • Levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
  • Outputs: File rotation (daily), console, optional syslog
  • Retention: 30 days

6. Security Considerations

6.1 Input Validation

  • URL whitelist/blacklist support
  • Maximum file size limits (default: 2GB)
  • File type validation
  • Command injection prevention in yt-dlp calls

6.2 Resource Protection

  • Rate limiting on API endpoints
  • Maximum concurrent jobs per user
  • Disk quota management
  • Memory usage monitoring

6.3 Privacy

  • No external API calls for transcription
  • Optional telemetry (opt-in only)
  • Local-only data storage
  • Secure file cleanup

7. Extensibility & Future Enhancements

7.1 Plugin Architecture

  • Custom pre-processing plugins
  • Custom post-processing (summarization, keyword extraction)
  • Custom output formatters
  • Model adapter interface for alternative ASR engines

7.2 Planned Features

  • Real-time transcription for live streams
  • Speaker diarization
  • Multi-language auto-detection
  • Translation capabilities
  • Integration with note-taking apps
  • Docker containerization
  • REST API authentication

8. Dependencies

8.1 System Dependencies

  • Python 3.8+
  • FFmpeg 4.4+
  • GPU drivers (optional, for CUDA support)

8.2 Python Dependencies

  • openai-whisper
  • yt-dlp
  • fastapi
  • uvicorn
  • pydantic
  • python-multipart
  • ffmpeg-python

8.3 Frontend Dependencies

  • React 18+
  • Material-UI
  • Axios
  • Socket.io-client

9. Deployment Architecture

9.1 Local Deployment

  • Single-user installation via pip/npm
  • Systemd service for backend
  • Nginx reverse proxy for production

9.2 Multi-User Deployment

  • Docker Compose setup
  • Shared model storage
  • PostgreSQL for job tracking
  • Redis for queue management

10. Performance Targets

MetricTargetMeasurement Method
API Response Time< 200ms95th percentile
Download Start Time< 5sFrom job submission
Transcription Throughput1x-10x real-timeBased on model
UI Load Time< 2sFirst contentful paint
Maximum Concurrent Jobs10 (configurable)System limit

11. Monitoring & Observability

11.1 Metrics

  • Job success/failure rates
  • Processing times per model
  • Queue depth and wait times
  • Resource utilization (CPU, GPU, memory, disk)
  • API endpoint latencies

11.2 Health Checks

  • /health endpoint for service status
  • Model availability checks
  • Disk space monitoring
  • External dependency checks (ffmpeg, yt-dlp)

12. Revision History

VersionDateAuthorChanges
1.02025-11-07SystemInitial draft