Skip to main content

Technical Design Document: UDOM Pipeline

Version: 1.0 | Date: 2026-02-09
Classification: Architecture — Technical Design
Subsystem: UDOM Pipeline Integration into CODITECT


1. APIs & Extension Points

1.1 UDOM Pipeline API (Internal — Agent-Facing)

POST   /api/v1/udom/extract          # Submit single paper for extraction
POST /api/v1/udom/batch # Submit batch extraction job
GET /api/v1/udom/batch/:id/status # Batch job status
GET /api/v1/udom/documents/:arxiv_id # Retrieve UDOM document
POST /api/v1/udom/search # Semantic search across UDOM store
GET /api/v1/udom/corpora # List tenant's corpora
GET /api/v1/udom/quality/:corpus # Quality report for corpus
DELETE /api/v1/udom/documents/:arxiv_id # Remove document (soft delete)

1.2 Extraction Request Schema

interface ExtractionRequest {
arxiv_id: string; // e.g., "2003.05991"
sources?: ("docling" | "ar5iv" | "latex")[]; // Default: all three
quality_threshold?: number; // Default: 0.85
max_retries?: number; // Default: 2
corpus?: string; // Target corpus name
priority?: "low" | "normal" | "high"; // Queue priority
metadata?: Record<string, string>; // Custom metadata
}

interface BatchRequest {
arxiv_ids: string[];
max_concurrent?: number; // Default: 5
quality_threshold?: number;
corpus?: string;
callback_url?: string; // Webhook on completion
}

1.3 Extension Points

Extension PointInterfaceUse Case
Source AdapterSourceAdapter (Python ABC)Add new publishers (PubMed, IEEE, Springer)
Fusion StrategyFusionStrategy (Python ABC)Custom fusion rules per domain (e.g., medical papers prioritize different sources)
Quality DimensionQualityDimension (Python ABC)Add domain-specific quality checks (e.g., "clinical trial table format")
Post-ProcessorPostProcessor (Python ABC)Transform UDOM output (e.g., generate summary, extract key findings)
KaTeX Macro PackJSON dictionaryAdd domain-specific LaTeX macros for Navigator rendering

2. Configuration Surfaces

2.1 Pipeline Configuration

# udom-pipeline.yaml
extraction:
docling:
enabled: true
timeout_seconds: 30
max_memory_mb: 2048
version: ">=2.0,<3.0"
ar5iv:
enabled: true
timeout_seconds: 15
rate_limit_rps: 5
cache_ttl_hours: 720 # 30 days — papers don't change
fallback_on_failure: true
latex:
enabled: true
timeout_seconds: 45
pandoc_args: ["--wrap=none", "--from=latex+raw_tex"]
custom_macros_file: "macros.json"

fusion:
strategy: "confidence_weighted" # or "source_priority" or "voting"
confidence_weights:
equation_display: { latex: 0.95, ar5iv: 0.90, docling: 0.70 }
equation_inline: { ar5iv: 0.95, latex: 0.90, docling: 0.65 }
table: { ar5iv: 0.92, docling: 0.80, latex: 0.75 }
heading: { docling: 0.95, ar5iv: 0.85, latex: 0.80 }
paragraph: { docling: 0.90, ar5iv: 0.85, latex: 0.80 }

quality:
threshold: 0.85 # Grade A minimum
max_retries: 2
dimension_weights:
structure: 0.12
tables: 0.12
math: 0.15
citations: 0.10
images: 0.08
content_density: 0.10
latex_residual: 0.13
heading_hierarchy: 0.10
bibliography: 0.10

batch:
max_concurrent: 10
checkpoint_on_failure: true
audit_trail: true

store:
connection_pool_size: 20
jsonb_gin_index: true
full_text_search: true
retention_days: null # null = infinite retention

2.2 Tenant Overrides

Tenants can override specific configuration values:

# tenant-overrides/tenant-pharma-123.yaml
quality:
threshold: 0.90 # Higher threshold for FDA-regulated workflows

extraction:
ar5iv:
rate_limit_rps: 2 # Conservative rate limiting

fusion:
confidence_weights:
table:
ar5iv: 0.95 # Pharma prioritizes table accuracy
docling: 0.85

custom:
macros_file: "pharma-macros.json" # Domain-specific LaTeX macros
post_processors: ["extract_clinical_data", "flag_adverse_events"]

3. Packaging & Deployment

3.1 Container Images

ImageBaseSizeContents
udom-orchestratorpython:3.11-slim~450MBOrchestration logic, NATS client, PostgreSQL client
udom-worker-doclingpython:3.11-slim~1.2GBDocling engine + model files
udom-worker-ar5ivpython:3.11-slim~200MBBeautifulSoup, lxml, HTTP client
udom-worker-latexpython:3.11-slim + pandoc~350MBPandoc, tarfile handling, macro expansion
udom-navigatornginx:alpine~30MBStatic HTML/JS/CSS

3.2 Helm Chart Values

# values.yaml
replicaCount:
orchestrator: 1
docling: 3
ar5iv: 2
latex: 2
scorer: 1

resources:
docling:
requests: { cpu: "1000m", memory: "2Gi" }
limits: { cpu: "2000m", memory: "4Gi" }
ar5iv:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
latex:
requests: { cpu: "500m", memory: "512Mi" }
limits: { cpu: "1000m", memory: "1Gi" }

nats:
url: "nats://nats:4222"
subjects:
extract_docling: "udom.extract.docling"
extract_ar5iv: "udom.extract.ar5iv"
extract_latex: "udom.extract.latex"
fusion: "udom.fusion"
quality: "udom.quality"
audit: "udom.audit"

postgresql:
host: "postgres"
database: "udom"
pool_size: 20

4. Data Model

4.1 Core Tables

-- UDOM Documents
CREATE TABLE udom_documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
arxiv_id TEXT NOT NULL,
title TEXT,
authors JSONB,
abstract TEXT,
components JSONB NOT NULL,
quality_score JSONB NOT NULL,
source_stats JSONB,
corpus TEXT DEFAULT 'default',
version INTEGER DEFAULT 1,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (tenant_id, arxiv_id, version)
);

-- Batch Runs
CREATE TABLE udom_batch_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
status TEXT CHECK (status IN ('pending', 'running', 'completed', 'failed')),
paper_count INTEGER,
completed_count INTEGER DEFAULT 0,
failed_count INTEGER DEFAULT 0,
configuration JSONB,
report JSONB,
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Audit Trail (append-only)
CREATE TABLE udom_audit_events (
id BIGSERIAL PRIMARY KEY,
tenant_id UUID NOT NULL,
event_type TEXT NOT NULL,
arxiv_id TEXT,
batch_id UUID,
payload JSONB NOT NULL,
content_hash TEXT, -- SHA-256 of component content
created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Corpora (named collections)
CREATE TABLE udom_corpora (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
name TEXT NOT NULL,
description TEXT,
paper_count INTEGER DEFAULT 0,
quality_stats JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (tenant_id, name)
);

4.2 UDOMComponent JSON Schema

{
"type": "object",
"required": ["type", "content", "source", "confidence", "position"],
"properties": {
"type": {
"type": "string",
"enum": ["heading", "paragraph", "equation", "figure", "table",
"citation", "code", "list", "abstract", "bibliography",
"caption", "footnote", "algorithm", "theorem", "proof",
"definition", "example", "remark", "appendix", "acknowledgment",
"author_info", "metadata", "reference", "supplementary", "unknown"]
},
"content": { "type": "string" },
"source": { "type": "string", "enum": ["docling", "ar5iv", "latex", "fused"] },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 },
"position": { "type": "integer", "minimum": 0 },
"metadata": {
"type": "object",
"properties": {
"level": { "type": "integer" },
"display": { "type": "string", "enum": ["inline", "block"] },
"caption": { "type": "string" },
"alt_sources": {
"type": "array",
"items": { "type": "string" }
}
}
}
}
}

5. Security Integration

5.1 Data Flow Security

Data FlowSecurity ControlImplementation
arXiv PDF downloadTLS 1.3HTTPS only, certificate pinning
ar5iv HTML fetchTLS 1.3HTTPS only, response validation
LaTeX source downloadTLS 1.3HTTPS only, content-type verification
NATS messagingmTLSWorker-to-NATS mutual TLS authentication
PostgreSQL connectionsTLS + SCRAM-SHA-256Encrypted connections, password hashing
Navigator accessOIDC + RBACTenant-scoped authentication

5.2 Content Security

ThreatMitigation
Malicious PDFDocling sandboxed extraction; no JavaScript execution in PDF parser
LaTeX injectionPandoc runs in restricted mode; no \input from external URLs
XSS in NavigatorKaTeX renders to safe HTML; CSP headers on Navigator
SSRF via paper URLsAllowlist: arxiv.org, ar5iv.labs.arxiv.org only; no arbitrary URL fetching
PHI in papersPre-extraction PHI scanner for regulated tenants (configurable)

5.3 Access Control

RBAC Roles:
├── udom:admin — Full CRUD on all corpora, configuration, batch management
├── udom:researcher — Read corpora, submit extraction requests, search
├── udom:reviewer — Read corpora, view quality reports, approve checkpoints
└── udom:agent — Programmatic read access to UDOM store (agent service account)

6. Example Interfaces

6.1 Python — Core Types

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from datetime import datetime
from uuid import UUID

class ComponentType(str, Enum):
HEADING = "heading"
PARAGRAPH = "paragraph"
EQUATION = "equation"
FIGURE = "figure"
TABLE = "table"
CITATION = "citation"
CODE = "code"
LIST = "list"
ABSTRACT = "abstract"
BIBLIOGRAPHY = "bibliography"
CAPTION = "caption"
FOOTNOTE = "footnote"
ALGORITHM = "algorithm"
THEOREM = "theorem"
PROOF = "proof"
DEFINITION = "definition"
EXAMPLE = "example"
REMARK = "remark"
APPENDIX = "appendix"
ACKNOWLEDGMENT = "acknowledgment"
AUTHOR_INFO = "author_info"
METADATA = "metadata"
REFERENCE = "reference"
SUPPLEMENTARY = "supplementary"
UNKNOWN = "unknown"

class ExtractionSource(str, Enum):
DOCLING = "docling"
AR5IV = "ar5iv"
LATEX = "latex"
FUSED = "fused"

class QualityGrade(str, Enum):
A = "A" # >= 0.85
B = "B" # >= 0.70
C = "C" # < 0.70

@dataclass(frozen=True)
class UDOMComponent:
type: ComponentType
content: str
source: ExtractionSource
confidence: float
position: int
metadata: dict = field(default_factory=dict)

@dataclass
class QualityScore:
structure: float
tables: float
math: float
citations: float
images: float
content_density: float
latex_residual: float
heading_hierarchy: float
bibliography: float
overall: float = 0.0
grade: QualityGrade = QualityGrade.C

def __post_init__(self):
weights = [0.12, 0.12, 0.15, 0.10, 0.08, 0.10, 0.13, 0.10, 0.10]
scores = [self.structure, self.tables, self.math, self.citations,
self.images, self.content_density, self.latex_residual,
self.heading_hierarchy, self.bibliography]
self.overall = sum(w * s for w, s in zip(weights, scores))
self.grade = (QualityGrade.A if self.overall >= 0.85
else QualityGrade.B if self.overall >= 0.70
else QualityGrade.C)

@dataclass
class UDOMDocument:
id: UUID
tenant_id: UUID
arxiv_id: str
title: str
components: list[UDOMComponent]
quality_score: QualityScore
source_stats: dict
corpus: str = "default"
version: int = 1
created_at: datetime = field(default_factory=datetime.utcnow)

6.2 TypeScript — Agent Tool Types

// UDOM types for CODITECT agent integration

export enum ComponentType {
Heading = "heading",
Paragraph = "paragraph",
Equation = "equation",
Figure = "figure",
Table = "table",
Citation = "citation",
Code = "code",
List = "list",
Abstract = "abstract",
Bibliography = "bibliography",
// ... 15 more
}

export enum ExtractionSource {
Docling = "docling",
Ar5iv = "ar5iv",
Latex = "latex",
Fused = "fused",
}

export interface UDOMComponent {
readonly type: ComponentType;
readonly content: string;
readonly source: ExtractionSource;
readonly confidence: number;
readonly position: number;
readonly metadata: Record<string, unknown>;
}

export interface QualityScore {
readonly structure: number;
readonly tables: number;
readonly math: number;
readonly citations: number;
readonly images: number;
readonly contentDensity: number;
readonly latexResidual: number;
readonly headingHierarchy: number;
readonly bibliography: number;
readonly overall: number;
readonly grade: "A" | "B" | "C";
}

export interface UDOMDocument {
readonly id: string;
readonly tenantId: string;
readonly arxivId: string;
readonly title: string;
readonly components: ReadonlyArray<UDOMComponent>;
readonly qualityScore: QualityScore;
readonly sourceStats: {
readonly docling: number;
readonly ar5iv: number;
readonly latex: number;
readonly fused: number;
readonly elapsedSeconds: number;
};
readonly corpus: string;
readonly version: number;
readonly createdAt: string;
}

// Agent tool parameter types
export interface SearchParams {
query: string;
tenantId: string;
corpus?: string;
componentType?: ComponentType;
limit?: number;
minConfidence?: number;
}

export interface CrossPaperComparisonParams {
metric: string;
paperIds: string[];
tenantId: string;
}

export interface ComparisonResult {
metric: string;
papers: Array<{
arxivId: string;
title: string;
relevantTables: UDOMComponent[];
relevantEquations: UDOMComponent[];
}>;
}

7. Performance Characteristics

7.1 Extraction Latency

SourceP50P95P99Bottleneck
Docling (PDF)5.2s8.1s12.3sCPU (PDF parsing)
ar5iv (HTML)2.1s3.5s5.8sNetwork (HTTP fetch)
LaTeX (source)6.8s11.2s18.5sCPU (pandoc) + Network (tar download)
Fusion0.3s0.5s0.8sCPU (component matching)
Quality scoring0.1s0.2s0.3sCPU (score computation)
Total (parallel)8.4s14.2s22.1sNetwork + CPU

7.2 Storage Requirements

MetricValue
Average UDOM JSON per paper~150 KB
Average markdown per paper~45 KB
Average components per paper300+
218-paper corpus total~42 MB JSONB + ~10 MB markdown
10,000-paper corpus estimate~1.9 GB JSONB + ~450 MB markdown
PostgreSQL with GIN indexes~3× raw data size

7.3 Query Performance (PostgreSQL)

Query TypeTarget LatencyIndex Used
Get document by arxiv_id< 5msB-tree on (tenant_id, arxiv_id)
Full-text search across corpus< 100msGIN on tsvector
Component type filter< 50msGIN on JSONB components
Quality score range query< 30msGIN on JSONB quality_score
Cross-paper comparison (10 papers)< 200msB-tree + GIN

7.4 Token Efficiency

Consumption ModeTokens per PaperCost (Sonnet)
Raw PDF text to agent~15,000 tokens~$0.045
UDOM markdown to agent~8,000 tokens~$0.024
UDOM targeted query (equations only)~1,500 tokens~$0.005
Savings: UDOM vs. raw PDF~47% reduction~47% cost reduction
Savings: UDOM targeted vs. raw PDF~90% reduction~90% cost reduction

This token efficiency compounds across agent interactions — a research agent querying 20 papers via targeted UDOM search uses ~30,000 tokens versus ~300,000 tokens for raw PDF ingestion.


Technical Design Document covers: APIs, configuration, deployment, data model, security, interfaces, and performance characteristics.