Technical Design Document: UDOM Pipeline

Version: 1.0 | Date: 2026-02-09
Classification: Architecture — Technical Design
Subsystem: UDOM Pipeline Integration into CODITECT

1. APIs & Extension Points

1.1 UDOM Pipeline API (Internal — Agent-Facing)

POST   /api/v1/udom/extract          # Submit single paper for extraction
POST   /api/v1/udom/batch             # Submit batch extraction job
GET    /api/v1/udom/batch/:id/status   # Batch job status
GET    /api/v1/udom/documents/:arxiv_id  # Retrieve UDOM document
POST   /api/v1/udom/search            # Semantic search across UDOM store
GET    /api/v1/udom/corpora            # List tenant's corpora
GET    /api/v1/udom/quality/:corpus    # Quality report for corpus
DELETE /api/v1/udom/documents/:arxiv_id  # Remove document (soft delete)

1.2 Extraction Request Schema

interface ExtractionRequest {
  arxiv_id: string;                          // e.g., "2003.05991"
  sources?: ("docling" | "ar5iv" | "latex")[]; // Default: all three
  quality_threshold?: number;                 // Default: 0.85
  max_retries?: number;                       // Default: 2
  corpus?: string;                            // Target corpus name
  priority?: "low" | "normal" | "high";       // Queue priority
  metadata?: Record<string, string>;          // Custom metadata
}

interface BatchRequest {
  arxiv_ids: string[];
  max_concurrent?: number;                    // Default: 5
  quality_threshold?: number;
  corpus?: string;
  callback_url?: string;                      // Webhook on completion
}

1.3 Extension Points

Extension Point	Interface	Use Case
Source Adapter	`SourceAdapter` (Python ABC)	Add new publishers (PubMed, IEEE, Springer)
Fusion Strategy	`FusionStrategy` (Python ABC)	Custom fusion rules per domain (e.g., medical papers prioritize different sources)
Quality Dimension	`QualityDimension` (Python ABC)	Add domain-specific quality checks (e.g., "clinical trial table format")
Post-Processor	`PostProcessor` (Python ABC)	Transform UDOM output (e.g., generate summary, extract key findings)
KaTeX Macro Pack	JSON dictionary	Add domain-specific LaTeX macros for Navigator rendering

2. Configuration Surfaces

2.1 Pipeline Configuration

# udom-pipeline.yaml
extraction:
  docling:
    enabled: true
    timeout_seconds: 30
    max_memory_mb: 2048
    version: ">=2.0,<3.0"
  ar5iv:
    enabled: true
    timeout_seconds: 15
    rate_limit_rps: 5
    cache_ttl_hours: 720      # 30 days — papers don't change
    fallback_on_failure: true
  latex:
    enabled: true
    timeout_seconds: 45
    pandoc_args: ["--wrap=none", "--from=latex+raw_tex"]
    custom_macros_file: "macros.json"

fusion:
  strategy: "confidence_weighted"  # or "source_priority" or "voting"
  confidence_weights:
    equation_display: { latex: 0.95, ar5iv: 0.90, docling: 0.70 }
    equation_inline: { ar5iv: 0.95, latex: 0.90, docling: 0.65 }
    table: { ar5iv: 0.92, docling: 0.80, latex: 0.75 }
    heading: { docling: 0.95, ar5iv: 0.85, latex: 0.80 }
    paragraph: { docling: 0.90, ar5iv: 0.85, latex: 0.80 }

quality:
  threshold: 0.85              # Grade A minimum
  max_retries: 2
  dimension_weights:
    structure: 0.12
    tables: 0.12
    math: 0.15
    citations: 0.10
    images: 0.08
    content_density: 0.10
    latex_residual: 0.13
    heading_hierarchy: 0.10
    bibliography: 0.10

batch:
  max_concurrent: 10
  checkpoint_on_failure: true
  audit_trail: true

store:
  connection_pool_size: 20
  jsonb_gin_index: true
  full_text_search: true
  retention_days: null          # null = infinite retention

2.2 Tenant Overrides

Tenants can override specific configuration values:

# tenant-overrides/tenant-pharma-123.yaml
quality:
  threshold: 0.90              # Higher threshold for FDA-regulated workflows
  
extraction:
  ar5iv:
    rate_limit_rps: 2          # Conservative rate limiting

fusion:
  confidence_weights:
    table:
      ar5iv: 0.95             # Pharma prioritizes table accuracy
      docling: 0.85

custom:
  macros_file: "pharma-macros.json"  # Domain-specific LaTeX macros
  post_processors: ["extract_clinical_data", "flag_adverse_events"]

3. Packaging & Deployment

3.1 Container Images

Image	Base	Size	Contents
`udom-orchestrator`	`python:3.11-slim`	~450MB	Orchestration logic, NATS client, PostgreSQL client
`udom-worker-docling`	`python:3.11-slim`	~1.2GB	Docling engine + model files
`udom-worker-ar5iv`	`python:3.11-slim`	~200MB	BeautifulSoup, lxml, HTTP client
`udom-worker-latex`	`python:3.11-slim` + pandoc	~350MB	Pandoc, tarfile handling, macro expansion
`udom-navigator`	`nginx:alpine`	~30MB	Static HTML/JS/CSS

3.2 Helm Chart Values

# values.yaml
replicaCount:
  orchestrator: 1
  docling: 3
  ar5iv: 2
  latex: 2
  scorer: 1

resources:
  docling:
    requests: { cpu: "1000m", memory: "2Gi" }
    limits: { cpu: "2000m", memory: "4Gi" }
  ar5iv:
    requests: { cpu: "250m", memory: "256Mi" }
    limits: { cpu: "500m", memory: "512Mi" }
  latex:
    requests: { cpu: "500m", memory: "512Mi" }
    limits: { cpu: "1000m", memory: "1Gi" }

nats:
  url: "nats://nats:4222"
  subjects:
    extract_docling: "udom.extract.docling"
    extract_ar5iv: "udom.extract.ar5iv"
    extract_latex: "udom.extract.latex"
    fusion: "udom.fusion"
    quality: "udom.quality"
    audit: "udom.audit"

postgresql:
  host: "postgres"
  database: "udom"
  pool_size: 20

4. Data Model

4.1 Core Tables

-- UDOM Documents
CREATE TABLE udom_documents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    arxiv_id        TEXT NOT NULL,
    title           TEXT,
    authors         JSONB,
    abstract        TEXT,
    components      JSONB NOT NULL,
    quality_score   JSONB NOT NULL,
    source_stats    JSONB,
    corpus          TEXT DEFAULT 'default',
    version         INTEGER DEFAULT 1,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (tenant_id, arxiv_id, version)
);

-- Batch Runs
CREATE TABLE udom_batch_runs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    status          TEXT CHECK (status IN ('pending', 'running', 'completed', 'failed')),
    paper_count     INTEGER,
    completed_count INTEGER DEFAULT 0,
    failed_count    INTEGER DEFAULT 0,
    configuration   JSONB,
    report          JSONB,
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

-- Audit Trail (append-only)
CREATE TABLE udom_audit_events (
    id              BIGSERIAL PRIMARY KEY,
    tenant_id       UUID NOT NULL,
    event_type      TEXT NOT NULL,
    arxiv_id        TEXT,
    batch_id        UUID,
    payload         JSONB NOT NULL,
    content_hash    TEXT,          -- SHA-256 of component content
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

-- Corpora (named collections)
CREATE TABLE udom_corpora (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    name            TEXT NOT NULL,
    description     TEXT,
    paper_count     INTEGER DEFAULT 0,
    quality_stats   JSONB,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (tenant_id, name)
);

4.2 UDOMComponent JSON Schema

{
  "type": "object",
  "required": ["type", "content", "source", "confidence", "position"],
  "properties": {
    "type": {
      "type": "string",
      "enum": ["heading", "paragraph", "equation", "figure", "table", 
               "citation", "code", "list", "abstract", "bibliography",
               "caption", "footnote", "algorithm", "theorem", "proof",
               "definition", "example", "remark", "appendix", "acknowledgment",
               "author_info", "metadata", "reference", "supplementary", "unknown"]
    },
    "content": { "type": "string" },
    "source": { "type": "string", "enum": ["docling", "ar5iv", "latex", "fused"] },
    "confidence": { "type": "number", "minimum": 0, "maximum": 1 },
    "position": { "type": "integer", "minimum": 0 },
    "metadata": {
      "type": "object",
      "properties": {
        "level": { "type": "integer" },
        "display": { "type": "string", "enum": ["inline", "block"] },
        "caption": { "type": "string" },
        "alt_sources": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    }
  }
}

5. Security Integration

5.1 Data Flow Security

Data Flow	Security Control	Implementation
arXiv PDF download	TLS 1.3	HTTPS only, certificate pinning
ar5iv HTML fetch	TLS 1.3	HTTPS only, response validation
LaTeX source download	TLS 1.3	HTTPS only, content-type verification
NATS messaging	mTLS	Worker-to-NATS mutual TLS authentication
PostgreSQL connections	TLS + SCRAM-SHA-256	Encrypted connections, password hashing
Navigator access	OIDC + RBAC	Tenant-scoped authentication

5.2 Content Security

Threat	Mitigation
Malicious PDF	Docling sandboxed extraction; no JavaScript execution in PDF parser
LaTeX injection	Pandoc runs in restricted mode; no `\input` from external URLs
XSS in Navigator	KaTeX renders to safe HTML; CSP headers on Navigator
SSRF via paper URLs	Allowlist: `arxiv.org`, `ar5iv.labs.arxiv.org` only; no arbitrary URL fetching
PHI in papers	Pre-extraction PHI scanner for regulated tenants (configurable)

5.3 Access Control

RBAC Roles:
├── udom:admin      — Full CRUD on all corpora, configuration, batch management
├── udom:researcher — Read corpora, submit extraction requests, search
├── udom:reviewer   — Read corpora, view quality reports, approve checkpoints
└── udom:agent      — Programmatic read access to UDOM store (agent service account)

6. Example Interfaces

6.1 Python — Core Types

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from datetime import datetime
from uuid import UUID

class ComponentType(str, Enum):
    HEADING = "heading"
    PARAGRAPH = "paragraph"
    EQUATION = "equation"
    FIGURE = "figure"
    TABLE = "table"
    CITATION = "citation"
    CODE = "code"
    LIST = "list"
    ABSTRACT = "abstract"
    BIBLIOGRAPHY = "bibliography"
    CAPTION = "caption"
    FOOTNOTE = "footnote"
    ALGORITHM = "algorithm"
    THEOREM = "theorem"
    PROOF = "proof"
    DEFINITION = "definition"
    EXAMPLE = "example"
    REMARK = "remark"
    APPENDIX = "appendix"
    ACKNOWLEDGMENT = "acknowledgment"
    AUTHOR_INFO = "author_info"
    METADATA = "metadata"
    REFERENCE = "reference"
    SUPPLEMENTARY = "supplementary"
    UNKNOWN = "unknown"

class ExtractionSource(str, Enum):
    DOCLING = "docling"
    AR5IV = "ar5iv"
    LATEX = "latex"
    FUSED = "fused"

class QualityGrade(str, Enum):
    A = "A"  # >= 0.85
    B = "B"  # >= 0.70
    C = "C"  # < 0.70

@dataclass(frozen=True)
class UDOMComponent:
    type: ComponentType
    content: str
    source: ExtractionSource
    confidence: float
    position: int
    metadata: dict = field(default_factory=dict)

@dataclass
class QualityScore:
    structure: float
    tables: float
    math: float
    citations: float
    images: float
    content_density: float
    latex_residual: float
    heading_hierarchy: float
    bibliography: float
    overall: float = 0.0
    grade: QualityGrade = QualityGrade.C
    
    def __post_init__(self):
        weights = [0.12, 0.12, 0.15, 0.10, 0.08, 0.10, 0.13, 0.10, 0.10]
        scores = [self.structure, self.tables, self.math, self.citations,
                  self.images, self.content_density, self.latex_residual,
                  self.heading_hierarchy, self.bibliography]
        self.overall = sum(w * s for w, s in zip(weights, scores))
        self.grade = (QualityGrade.A if self.overall >= 0.85 
                      else QualityGrade.B if self.overall >= 0.70 
                      else QualityGrade.C)

@dataclass
class UDOMDocument:
    id: UUID
    tenant_id: UUID
    arxiv_id: str
    title: str
    components: list[UDOMComponent]
    quality_score: QualityScore
    source_stats: dict
    corpus: str = "default"
    version: int = 1
    created_at: datetime = field(default_factory=datetime.utcnow)

6.2 TypeScript — Agent Tool Types

// UDOM types for CODITECT agent integration

export enum ComponentType {
  Heading = "heading",
  Paragraph = "paragraph",
  Equation = "equation",
  Figure = "figure",
  Table = "table",
  Citation = "citation",
  Code = "code",
  List = "list",
  Abstract = "abstract",
  Bibliography = "bibliography",
  // ... 15 more
}

export enum ExtractionSource {
  Docling = "docling",
  Ar5iv = "ar5iv",
  Latex = "latex",
  Fused = "fused",
}

export interface UDOMComponent {
  readonly type: ComponentType;
  readonly content: string;
  readonly source: ExtractionSource;
  readonly confidence: number;
  readonly position: number;
  readonly metadata: Record<string, unknown>;
}

export interface QualityScore {
  readonly structure: number;
  readonly tables: number;
  readonly math: number;
  readonly citations: number;
  readonly images: number;
  readonly contentDensity: number;
  readonly latexResidual: number;
  readonly headingHierarchy: number;
  readonly bibliography: number;
  readonly overall: number;
  readonly grade: "A" | "B" | "C";
}

export interface UDOMDocument {
  readonly id: string;
  readonly tenantId: string;
  readonly arxivId: string;
  readonly title: string;
  readonly components: ReadonlyArray<UDOMComponent>;
  readonly qualityScore: QualityScore;
  readonly sourceStats: {
    readonly docling: number;
    readonly ar5iv: number;
    readonly latex: number;
    readonly fused: number;
    readonly elapsedSeconds: number;
  };
  readonly corpus: string;
  readonly version: number;
  readonly createdAt: string;
}

// Agent tool parameter types
export interface SearchParams {
  query: string;
  tenantId: string;
  corpus?: string;
  componentType?: ComponentType;
  limit?: number;
  minConfidence?: number;
}

export interface CrossPaperComparisonParams {
  metric: string;
  paperIds: string[];
  tenantId: string;
}

export interface ComparisonResult {
  metric: string;
  papers: Array<{
    arxivId: string;
    title: string;
    relevantTables: UDOMComponent[];
    relevantEquations: UDOMComponent[];
  }>;
}

7. Performance Characteristics

7.1 Extraction Latency

Source	P50	P95	P99	Bottleneck
Docling (PDF)	5.2s	8.1s	12.3s	CPU (PDF parsing)
ar5iv (HTML)	2.1s	3.5s	5.8s	Network (HTTP fetch)
LaTeX (source)	6.8s	11.2s	18.5s	CPU (pandoc) + Network (tar download)
Fusion	0.3s	0.5s	0.8s	CPU (component matching)
Quality scoring	0.1s	0.2s	0.3s	CPU (score computation)
Total (parallel)	8.4s	14.2s	22.1s	Network + CPU

7.2 Storage Requirements

Metric	Value
Average UDOM JSON per paper	~150 KB
Average markdown per paper	~45 KB
Average components per paper	300+
218-paper corpus total	~42 MB JSONB + ~10 MB markdown
10,000-paper corpus estimate	~1.9 GB JSONB + ~450 MB markdown
PostgreSQL with GIN indexes	~3× raw data size

7.3 Query Performance (PostgreSQL)

Query Type	Target Latency	Index Used
Get document by arxiv_id	< 5ms	B-tree on (tenant_id, arxiv_id)
Full-text search across corpus	< 100ms	GIN on tsvector
Component type filter	< 50ms	GIN on JSONB components
Quality score range query	< 30ms	GIN on JSONB quality_score
Cross-paper comparison (10 papers)	< 200ms	B-tree + GIN

7.4 Token Efficiency

Consumption Mode	Tokens per Paper	Cost (Sonnet)
Raw PDF text to agent	~15,000 tokens	~$0.045
UDOM markdown to agent	~8,000 tokens	~$0.024
UDOM targeted query (equations only)	~1,500 tokens	~$0.005
Savings: UDOM vs. raw PDF	~47% reduction	~47% cost reduction
Savings: UDOM targeted vs. raw PDF	~90% reduction	~90% cost reduction

This token efficiency compounds across agent interactions — a research agent querying 20 papers via targeted UDOM search uses ~30,000 tokens versus ~300,000 tokens for raw PDF ingestion.

Technical Design Document covers: APIs, configuration, deployment, data model, security, interfaces, and performance characteristics.

1. APIs & Extension Points​

1.1 UDOM Pipeline API (Internal — Agent-Facing)​

1.2 Extraction Request Schema​

1.3 Extension Points​

2. Configuration Surfaces​

2.1 Pipeline Configuration​

2.2 Tenant Overrides​

3. Packaging & Deployment​

3.1 Container Images​

3.2 Helm Chart Values​

4. Data Model​

4.1 Core Tables​

4.2 UDOMComponent JSON Schema​

5. Security Integration​

5.1 Data Flow Security​

5.2 Content Security​

5.3 Access Control​

6. Example Interfaces​

6.1 Python — Core Types​

6.2 TypeScript — Agent Tool Types​

7. Performance Characteristics​

7.1 Extraction Latency​

7.2 Storage Requirements​

7.3 Query Performance (PostgreSQL)​

7.4 Token Efficiency​