Technical Design Document: UDOM Pipeline
Version: 1.0 | Date: 2026-02-09
Classification: Architecture — Technical Design
Subsystem: UDOM Pipeline Integration into CODITECT
1. APIs & Extension Points
1.1 UDOM Pipeline API (Internal — Agent-Facing)
POST /api/v1/udom/extract # Submit single paper for extraction
POST /api/v1/udom/batch # Submit batch extraction job
GET /api/v1/udom/batch/:id/status # Batch job status
GET /api/v1/udom/documents/:arxiv_id # Retrieve UDOM document
POST /api/v1/udom/search # Semantic search across UDOM store
GET /api/v1/udom/corpora # List tenant's corpora
GET /api/v1/udom/quality/:corpus # Quality report for corpus
DELETE /api/v1/udom/documents/:arxiv_id # Remove document (soft delete)
1.2 Extraction Request Schema
interface ExtractionRequest {
arxiv_id: string; // e.g., "2003.05991"
sources?: ("docling" | "ar5iv" | "latex")[]; // Default: all three
quality_threshold?: number; // Default: 0.85
max_retries?: number; // Default: 2
corpus?: string; // Target corpus name
priority?: "low" | "normal" | "high"; // Queue priority
metadata?: Record<string, string>; // Custom metadata
}
interface BatchRequest {
arxiv_ids: string[];
max_concurrent?: number; // Default: 5
quality_threshold?: number;
corpus?: string;
callback_url?: string; // Webhook on completion
}
1.3 Extension Points
| Extension Point | Interface | Use Case |
|---|---|---|
| Source Adapter | SourceAdapter (Python ABC) | Add new publishers (PubMed, IEEE, Springer) |
| Fusion Strategy | FusionStrategy (Python ABC) | Custom fusion rules per domain (e.g., medical papers prioritize different sources) |
| Quality Dimension | QualityDimension (Python ABC) | Add domain-specific quality checks (e.g., "clinical trial table format") |
| Post-Processor | PostProcessor (Python ABC) | Transform UDOM output (e.g., generate summary, extract key findings) |
| KaTeX Macro Pack | JSON dictionary | Add domain-specific LaTeX macros for Navigator rendering |
2. Configuration Surfaces
2.1 Pipeline Configuration
# udom-pipeline.yaml
extraction:
docling:
enabled: true
timeout_seconds: 30
max_memory_mb: 2048
version: ">=2.0,<3.0"
ar5iv:
enabled: true
timeout_seconds: 15
rate_limit_rps: 5
cache_ttl_hours: 720 # 30 days — papers don't change
fallback_on_failure: true
latex:
enabled: true
timeout_seconds: 45
pandoc_args: ["--wrap=none", "--from=latex+raw_tex"]
custom_macros_file: "macros.json"
fusion:
strategy: "confidence_weighted" # or "source_priority" or "voting"
confidence_weights:
equation_display: { latex: 0.95, ar5iv: 0.90, docling: 0.70 }
equation_inline: { ar5iv: 0.95, latex: 0.90, docling: 0.65 }
table: { ar5iv: 0.92, docling: 0.80, latex: 0.75 }
heading: { docling: 0.95, ar5iv: 0.85, latex: 0.80 }
paragraph: { docling: 0.90, ar5iv: 0.85, latex: 0.80 }
quality:
threshold: 0.85 # Grade A minimum
max_retries: 2
dimension_weights:
structure: 0.12
tables: 0.12
math: 0.15
citations: 0.10
images: 0.08
content_density: 0.10
latex_residual: 0.13
heading_hierarchy: 0.10
bibliography: 0.10
batch:
max_concurrent: 10
checkpoint_on_failure: true
audit_trail: true
store:
connection_pool_size: 20
jsonb_gin_index: true
full_text_search: true
retention_days: null # null = infinite retention
2.2 Tenant Overrides
Tenants can override specific configuration values:
# tenant-overrides/tenant-pharma-123.yaml
quality:
threshold: 0.90 # Higher threshold for FDA-regulated workflows
extraction:
ar5iv:
rate_limit_rps: 2 # Conservative rate limiting
fusion:
confidence_weights:
table:
ar5iv: 0.95 # Pharma prioritizes table accuracy
docling: 0.85
custom:
macros_file: "pharma-macros.json" # Domain-specific LaTeX macros
post_processors: ["extract_clinical_data", "flag_adverse_events"]
3. Packaging & Deployment
3.1 Container Images
| Image | Base | Size | Contents |
|---|---|---|---|
udom-orchestrator | python:3.11-slim | ~450MB | Orchestration logic, NATS client, PostgreSQL client |
udom-worker-docling | python:3.11-slim | ~1.2GB | Docling engine + model files |
udom-worker-ar5iv | python:3.11-slim | ~200MB | BeautifulSoup, lxml, HTTP client |
udom-worker-latex | python:3.11-slim + pandoc | ~350MB | Pandoc, tarfile handling, macro expansion |
udom-navigator | nginx:alpine | ~30MB | Static HTML/JS/CSS |
3.2 Helm Chart Values
# values.yaml
replicaCount:
orchestrator: 1
docling: 3
ar5iv: 2
latex: 2
scorer: 1
resources:
docling:
requests: { cpu: "1000m", memory: "2Gi" }
limits: { cpu: "2000m", memory: "4Gi" }
ar5iv:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
latex:
requests: { cpu: "500m", memory: "512Mi" }
limits: { cpu: "1000m", memory: "1Gi" }
nats:
url: "nats://nats:4222"
subjects:
extract_docling: "udom.extract.docling"
extract_ar5iv: "udom.extract.ar5iv"
extract_latex: "udom.extract.latex"
fusion: "udom.fusion"
quality: "udom.quality"
audit: "udom.audit"
postgresql:
host: "postgres"
database: "udom"
pool_size: 20
4. Data Model
4.1 Core Tables
-- UDOM Documents
CREATE TABLE udom_documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
arxiv_id TEXT NOT NULL,
title TEXT,
authors JSONB,
abstract TEXT,
components JSONB NOT NULL,
quality_score JSONB NOT NULL,
source_stats JSONB,
corpus TEXT DEFAULT 'default',
version INTEGER DEFAULT 1,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (tenant_id, arxiv_id, version)
);
-- Batch Runs
CREATE TABLE udom_batch_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
status TEXT CHECK (status IN ('pending', 'running', 'completed', 'failed')),
paper_count INTEGER,
completed_count INTEGER DEFAULT 0,
failed_count INTEGER DEFAULT 0,
configuration JSONB,
report JSONB,
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Audit Trail (append-only)
CREATE TABLE udom_audit_events (
id BIGSERIAL PRIMARY KEY,
tenant_id UUID NOT NULL,
event_type TEXT NOT NULL,
arxiv_id TEXT,
batch_id UUID,
payload JSONB NOT NULL,
content_hash TEXT, -- SHA-256 of component content
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Corpora (named collections)
CREATE TABLE udom_corpora (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
name TEXT NOT NULL,
description TEXT,
paper_count INTEGER DEFAULT 0,
quality_stats JSONB,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (tenant_id, name)
);
4.2 UDOMComponent JSON Schema
{
"type": "object",
"required": ["type", "content", "source", "confidence", "position"],
"properties": {
"type": {
"type": "string",
"enum": ["heading", "paragraph", "equation", "figure", "table",
"citation", "code", "list", "abstract", "bibliography",
"caption", "footnote", "algorithm", "theorem", "proof",
"definition", "example", "remark", "appendix", "acknowledgment",
"author_info", "metadata", "reference", "supplementary", "unknown"]
},
"content": { "type": "string" },
"source": { "type": "string", "enum": ["docling", "ar5iv", "latex", "fused"] },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 },
"position": { "type": "integer", "minimum": 0 },
"metadata": {
"type": "object",
"properties": {
"level": { "type": "integer" },
"display": { "type": "string", "enum": ["inline", "block"] },
"caption": { "type": "string" },
"alt_sources": {
"type": "array",
"items": { "type": "string" }
}
}
}
}
}
5. Security Integration
5.1 Data Flow Security
| Data Flow | Security Control | Implementation |
|---|---|---|
| arXiv PDF download | TLS 1.3 | HTTPS only, certificate pinning |
| ar5iv HTML fetch | TLS 1.3 | HTTPS only, response validation |
| LaTeX source download | TLS 1.3 | HTTPS only, content-type verification |
| NATS messaging | mTLS | Worker-to-NATS mutual TLS authentication |
| PostgreSQL connections | TLS + SCRAM-SHA-256 | Encrypted connections, password hashing |
| Navigator access | OIDC + RBAC | Tenant-scoped authentication |
5.2 Content Security
| Threat | Mitigation |
|---|---|
| Malicious PDF | Docling sandboxed extraction; no JavaScript execution in PDF parser |
| LaTeX injection | Pandoc runs in restricted mode; no \input from external URLs |
| XSS in Navigator | KaTeX renders to safe HTML; CSP headers on Navigator |
| SSRF via paper URLs | Allowlist: arxiv.org, ar5iv.labs.arxiv.org only; no arbitrary URL fetching |
| PHI in papers | Pre-extraction PHI scanner for regulated tenants (configurable) |
5.3 Access Control
RBAC Roles:
├── udom:admin — Full CRUD on all corpora, configuration, batch management
├── udom:researcher — Read corpora, submit extraction requests, search
├── udom:reviewer — Read corpora, view quality reports, approve checkpoints
└── udom:agent — Programmatic read access to UDOM store (agent service account)
6. Example Interfaces
6.1 Python — Core Types
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from datetime import datetime
from uuid import UUID
class ComponentType(str, Enum):
HEADING = "heading"
PARAGRAPH = "paragraph"
EQUATION = "equation"
FIGURE = "figure"
TABLE = "table"
CITATION = "citation"
CODE = "code"
LIST = "list"
ABSTRACT = "abstract"
BIBLIOGRAPHY = "bibliography"
CAPTION = "caption"
FOOTNOTE = "footnote"
ALGORITHM = "algorithm"
THEOREM = "theorem"
PROOF = "proof"
DEFINITION = "definition"
EXAMPLE = "example"
REMARK = "remark"
APPENDIX = "appendix"
ACKNOWLEDGMENT = "acknowledgment"
AUTHOR_INFO = "author_info"
METADATA = "metadata"
REFERENCE = "reference"
SUPPLEMENTARY = "supplementary"
UNKNOWN = "unknown"
class ExtractionSource(str, Enum):
DOCLING = "docling"
AR5IV = "ar5iv"
LATEX = "latex"
FUSED = "fused"
class QualityGrade(str, Enum):
A = "A" # >= 0.85
B = "B" # >= 0.70
C = "C" # < 0.70
@dataclass(frozen=True)
class UDOMComponent:
type: ComponentType
content: str
source: ExtractionSource
confidence: float
position: int
metadata: dict = field(default_factory=dict)
@dataclass
class QualityScore:
structure: float
tables: float
math: float
citations: float
images: float
content_density: float
latex_residual: float
heading_hierarchy: float
bibliography: float
overall: float = 0.0
grade: QualityGrade = QualityGrade.C
def __post_init__(self):
weights = [0.12, 0.12, 0.15, 0.10, 0.08, 0.10, 0.13, 0.10, 0.10]
scores = [self.structure, self.tables, self.math, self.citations,
self.images, self.content_density, self.latex_residual,
self.heading_hierarchy, self.bibliography]
self.overall = sum(w * s for w, s in zip(weights, scores))
self.grade = (QualityGrade.A if self.overall >= 0.85
else QualityGrade.B if self.overall >= 0.70
else QualityGrade.C)
@dataclass
class UDOMDocument:
id: UUID
tenant_id: UUID
arxiv_id: str
title: str
components: list[UDOMComponent]
quality_score: QualityScore
source_stats: dict
corpus: str = "default"
version: int = 1
created_at: datetime = field(default_factory=datetime.utcnow)
6.2 TypeScript — Agent Tool Types
// UDOM types for CODITECT agent integration
export enum ComponentType {
Heading = "heading",
Paragraph = "paragraph",
Equation = "equation",
Figure = "figure",
Table = "table",
Citation = "citation",
Code = "code",
List = "list",
Abstract = "abstract",
Bibliography = "bibliography",
// ... 15 more
}
export enum ExtractionSource {
Docling = "docling",
Ar5iv = "ar5iv",
Latex = "latex",
Fused = "fused",
}
export interface UDOMComponent {
readonly type: ComponentType;
readonly content: string;
readonly source: ExtractionSource;
readonly confidence: number;
readonly position: number;
readonly metadata: Record<string, unknown>;
}
export interface QualityScore {
readonly structure: number;
readonly tables: number;
readonly math: number;
readonly citations: number;
readonly images: number;
readonly contentDensity: number;
readonly latexResidual: number;
readonly headingHierarchy: number;
readonly bibliography: number;
readonly overall: number;
readonly grade: "A" | "B" | "C";
}
export interface UDOMDocument {
readonly id: string;
readonly tenantId: string;
readonly arxivId: string;
readonly title: string;
readonly components: ReadonlyArray<UDOMComponent>;
readonly qualityScore: QualityScore;
readonly sourceStats: {
readonly docling: number;
readonly ar5iv: number;
readonly latex: number;
readonly fused: number;
readonly elapsedSeconds: number;
};
readonly corpus: string;
readonly version: number;
readonly createdAt: string;
}
// Agent tool parameter types
export interface SearchParams {
query: string;
tenantId: string;
corpus?: string;
componentType?: ComponentType;
limit?: number;
minConfidence?: number;
}
export interface CrossPaperComparisonParams {
metric: string;
paperIds: string[];
tenantId: string;
}
export interface ComparisonResult {
metric: string;
papers: Array<{
arxivId: string;
title: string;
relevantTables: UDOMComponent[];
relevantEquations: UDOMComponent[];
}>;
}
7. Performance Characteristics
7.1 Extraction Latency
| Source | P50 | P95 | P99 | Bottleneck |
|---|---|---|---|---|
| Docling (PDF) | 5.2s | 8.1s | 12.3s | CPU (PDF parsing) |
| ar5iv (HTML) | 2.1s | 3.5s | 5.8s | Network (HTTP fetch) |
| LaTeX (source) | 6.8s | 11.2s | 18.5s | CPU (pandoc) + Network (tar download) |
| Fusion | 0.3s | 0.5s | 0.8s | CPU (component matching) |
| Quality scoring | 0.1s | 0.2s | 0.3s | CPU (score computation) |
| Total (parallel) | 8.4s | 14.2s | 22.1s | Network + CPU |
7.2 Storage Requirements
| Metric | Value |
|---|---|
| Average UDOM JSON per paper | ~150 KB |
| Average markdown per paper | ~45 KB |
| Average components per paper | 300+ |
| 218-paper corpus total | ~42 MB JSONB + ~10 MB markdown |
| 10,000-paper corpus estimate | ~1.9 GB JSONB + ~450 MB markdown |
| PostgreSQL with GIN indexes | ~3× raw data size |
7.3 Query Performance (PostgreSQL)
| Query Type | Target Latency | Index Used |
|---|---|---|
| Get document by arxiv_id | < 5ms | B-tree on (tenant_id, arxiv_id) |
| Full-text search across corpus | < 100ms | GIN on tsvector |
| Component type filter | < 50ms | GIN on JSONB components |
| Quality score range query | < 30ms | GIN on JSONB quality_score |
| Cross-paper comparison (10 papers) | < 200ms | B-tree + GIN |
7.4 Token Efficiency
| Consumption Mode | Tokens per Paper | Cost (Sonnet) |
|---|---|---|
| Raw PDF text to agent | ~15,000 tokens | ~$0.045 |
| UDOM markdown to agent | ~8,000 tokens | ~$0.024 |
| UDOM targeted query (equations only) | ~1,500 tokens | ~$0.005 |
| Savings: UDOM vs. raw PDF | ~47% reduction | ~47% cost reduction |
| Savings: UDOM targeted vs. raw PDF | ~90% reduction | ~90% cost reduction |
This token efficiency compounds across agent interactions — a research agent querying 20 papers via targeted UDOM search uses ~30,000 tokens versus ~300,000 tokens for raw PDF ingestion.
Technical Design Document covers: APIs, configuration, deployment, data model, security, interfaces, and performance characteristics.