Skip to main content

System Design Document: UDOM Pipeline

Version: 1.0 | Date: 2026-02-09
Classification: Architecture — System Design
Subsystem: UDOM (Universal Document Object Model) Pipeline within CODITECT Platform


1. Context Diagram

The UDOM Pipeline operates as a subsystem within CODITECT, mediating between external paper sources (arXiv, publishers) and CODITECT's agent layer. It transforms unstructured scientific documents into typed, queryable knowledge components.

                          ┌─────────────────────────────────────────┐
│ CODITECT Platform │
│ │
┌──────────┐ │ ┌──────────────────────────────────┐ │
│ arXiv │───PDF──────│──│ │ │
│ Server │───LaTeX────│──│ UDOM Pipeline Subsystem │ │
└──────────┘ │ │ │ │
│ │ [Extraction] → [Fusion] → │ │
┌──────────┐ │ │ [Quality] → [Store] │ │
│ ar5iv │───HTML─────│──│ │ │
│ Server │ │ └────────────────┬─────────────────┘ │
└──────────┘ │ │ │
│ ▼ │
┌──────────┐ │ ┌──────────────────────────────────┐ │
│Publisher │───API──────│──│ UDOM Store (PostgreSQL) │ │
│ APIs │ │ └────────────────┬─────────────────┘ │
└──────────┘ │ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ CODITECT Agent Layer │ │
│ │ (Research, Compliance, │ │
│ │ Synthesis agents) │ │
│ └──────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────┐ │
│ │ UDOM Navigator (viewer.html) │◄──│── Human Users
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘

Actors:

  • arXiv Server — provides PDF, LaTeX source, and metadata via API
  • ar5iv Server — provides LaTeXML-rendered HTML with preserved math/tables
  • Publisher APIs — PubMed, IEEE, Springer, Elsevier (future adapters)
  • CODITECT Agents — consume UDOM components for research synthesis
  • Human Users — browse results via UDOM Navigator, review quality reports

2. Component Breakdown

2.1 Extraction Layer

Three independent, stateless workers operating in parallel:

WorkerInputOutputPerformanceStrength
Docling WorkerPDF binaryUDOM components~5–7s/paper (62× pymupdf4llm)Document structure, paragraphs, headings
ar5iv WorkerHTML pageUDOM components~2–3s/paperMath (alttext), tables, inline formatting
LaTeX Worker.tex source + pandocUDOM components~5–12s/paperDisplay math, macros, citations, bibliography

Each worker produces typed UDOMComponent objects with source, confidence, and position metadata.

2.2 Fusion Engine

Deterministic component merger using confidence-weighted selection:

Component TypePrimary SourceFallback 1Fallback 2Selection Rationale
HeadingDoclingar5ivLaTeXDocling best at structural detection
ParagraphDoclingar5ivLaTeXDocling preserves reading order
Equation (display)LaTeXar5ivDoclingLaTeX source is ground truth for math
Equation (inline)ar5ivLaTeXDoclingar5iv alttext captures inline math context
Tablear5ivDoclingLaTeXar5iv preserves HTML table structure
FigureDoclingar5ivDocling detects figure boundaries; ar5iv for images
CitationLaTeXar5ivDoclingLaTeX source has precise citation keys
BibliographyLaTeXar5ivDoclingLaTeX BibTeX entries are most complete
Abstractar5ivDoclingLaTeXar5iv has clean abstract section

2.3 Quality Scorer

9-dimension evaluation producing a scalar grade:

DimensionWeightMeasuresGrade A Threshold
Structure0.12Heading hierarchy completeness, section detection≥ 0.85
Tables0.12Column alignment, header detection, cell content≥ 0.80
Math0.15LaTeX validity, no broken delimiters, macro expansion≥ 0.85
Citations0.10Reference detection, bracketed citation format≥ 0.80
Images0.08Figure reference integrity, caption presence≥ 0.70
Content Density0.10Words per component, no empty sections≥ 0.85
LaTeX Residual0.13No stray \begin, \end, raw LaTeX in text≥ 0.90
Heading Hierarchy0.10Monotonic nesting, no skipped levels≥ 0.85
Bibliography0.10Reference completeness, author/year/title present≥ 0.80

Overall score = weighted sum. Grade A ≥ 0.85.

2.4 UDOM Store

PostgreSQL with JSONB storage, GIN indexes, and full-text search:

CREATE TABLE udom_documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
arxiv_id TEXT NOT NULL,
title TEXT,
components JSONB NOT NULL, -- Array of UDOMComponent
quality_score JSONB NOT NULL, -- 9-dimension scores
source_stats JSONB, -- Per-source extraction metadata
corpus TEXT DEFAULT 'default', -- Named corpus for organization
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),

UNIQUE (tenant_id, arxiv_id)
);

-- RLS for multi-tenancy
ALTER TABLE udom_documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON udom_documents
USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Performance indexes
CREATE INDEX idx_udom_components ON udom_documents USING GIN (components);
CREATE INDEX idx_udom_quality ON udom_documents USING GIN (quality_score);
CREATE INDEX idx_udom_corpus ON udom_documents (tenant_id, corpus);
CREATE INDEX idx_udom_fulltext ON udom_documents
USING GIN (to_tsvector('english', title || ' ' || components::text));

2.5 UDOM Navigator

Static HTML/JS viewer (viewer.html) for human browsing:

FeatureImplementation
Light/dark themeCSS custom properties, [data-theme="dark"] override
Batch run browserHamburger menu scanning run-* directories
KaTeX math renderingCDN v0.16.11 with auto-render + custom macros
Component inspectorJSON viewer (@alenaksu/json-viewer)
Quality dashboardPer-paper and batch-level score visualization

3. Data & Control Flows

3.1 Single Paper Flow

1. Orchestrator receives paper_id

2. Dispatches 3 extraction tasks (parallel)
├── NATS: udom.extract.docling → Docling Worker
├── NATS: udom.extract.ar5iv → ar5iv Worker
└── NATS: udom.extract.latex → LaTeX Worker

3. Workers extract components independently (5-12s each, parallel)

4. All 3 results arrive at Fusion Engine

5. Fusion Engine merges → canonical UDOM Document

6. Quality Scorer evaluates 9 dimensions

7. Quality Gate Decision:
├── Grade A (≥0.85) → Store in PostgreSQL, publish audit event
└── Grade B/C (<0.85) → Retry with enhanced parameters (max 2 retries)
└── If still below threshold → Store with quality warning, checkpoint for human review

8. Audit event published to Event Bus (NATS)

9. Available for agent consumption via UDOM Store API

3.2 Batch Flow

1. Batch submitted (list of paper_ids + configuration)

2. Orchestrator creates batch record in State Store

3. For each paper (bounded concurrency via semaphore):
└── Execute Single Paper Flow (above)

4. Batch-level quality aggregation:
├── Grade distribution (A/B/C counts)
├── Average processing time
├── Total component count
└── Per-dimension score distributions

5. Batch report written to State Store

6. Audit trail: complete batch provenance chain

7. UDOM Navigator updated (new run-* directory)

4. Scaling Model

Horizontal Scaling (Primary Strategy)

ComponentScaling AxisConstraintTarget
Docling WorkersReplica countCPU-bound (PDF parsing)1 worker per 2 CPU cores
ar5iv WorkersReplica countNetwork I/O bound2–5 replicas per ar5iv rate limit tier
LaTeX WorkersReplica countCPU-bound (pandoc)1 worker per 2 CPU cores
Fusion EngineStateless, horizontalMemory (component assembly)1 instance per 10 concurrent papers
Quality ScorerStateless, horizontalCPU (scoring calculations)Co-located with Fusion
UDOM StoreRead replicasPostgreSQL connection limitsPrimary + 2 read replicas

Throughput Model

ConfigurationPapers/HourPapers/DayNotes
Single instance (1 worker each)~120~2,880Development/POC
Small cluster (3 Docling, 2 ar5iv, 2 LaTeX)~500~12,000Pilot deployment
Production (10 Docling, 5 ar5iv, 5 LaTeX)~1,800~43,200Enterprise tenant

Bottleneck Analysis

  1. ar5iv rate limiting — external service, cannot scale beyond their limits. Mitigation: cache ar5iv responses aggressively (papers don't change after publication).
  2. PostgreSQL write throughput — batch inserts of JSONB components. Mitigation: batch commits (10–50 papers per transaction), write-ahead log tuning.
  3. Network egress — downloading PDFs + HTML + LaTeX source. Mitigation: geographic co-location with arXiv CDN, local PDF cache.

5. Failure Modes

FailureDetectionImpactRecovery
Docling OOMWorker health check, memory thresholdSingle paper extraction failsCircuit breaker → restart with smaller batch; degrade to 2-source
ar5iv unavailableHTTP 429/503 for >5 consecutive requestsMath and table quality degradesCircuit breaker → exponential backoff; continue with Docling + LaTeX
LaTeX source missing404 from arXiv e-print endpointMath fidelity slightly reducedGraceful degradation → 2-source mode (Docling + ar5iv)
pandoc crashNon-zero exit code from subprocessSingle paper LaTeX extraction failsRetry with --from=latex+raw_tex; if persistent, skip LaTeX source
Fusion conflictMultiple sources disagree on component typeIncorrect component classificationDefault to Docling structure; log conflict for offline analysis
Quality gate failureScore < threshold after max retriesPaper below quality standardStore with warning flag; checkpoint for human review
PostgreSQL fullDisk space alertCannot store new documentsAlert → prune old batch runs; expand storage
NATS partitionWorker can't connect to NATSWorkers stall, no new extractionsNATS cluster self-heals; workers reconnect with backoff

Graceful Degradation Matrix

Sources AvailableQuality ImpactExpected Grade
Docling + ar5iv + LaTeXFull fidelityA (0.87–0.94)
Docling + ar5ivSlightly reduced math fidelityA (0.85–0.90)
Docling + LaTeXReduced table qualityA/B (0.82–0.89)
Docling onlyBaseline extractionB (0.75–0.84)
ar5iv onlyNo structural backboneB/C (0.70–0.80)

6. Observability Story

Tracing Strategy

Every paper processing creates a single trace with spans for each extraction source, fusion, scoring, and storage. Traces are correlated across the batch via batch_id and per-paper via paper_id.

Key Dashboards

DashboardMetricsAudience
Pipeline HealthWorker uptime, error rates, circuit breaker statePlatform engineering
Quality TrendsGrade distribution over time, per-dimension trendsQuality engineering
ThroughputPapers/hour, processing time P50/P95/P99Operations
Tenant ActivityPer-tenant extraction volume, corpus size growthCustomer success
CostCompute hours per paper, storage growth rateFinance / platform

SLOs

SLOTargetMeasurement
Quality: Grade A rate≥ 98%Rolling 7-day window
Latency: P95 per paper< 45sPer-extraction measurement
Availability: pipeline uptime99.5%Monthly
Data integrity: component hash verification100%Every read operation

7. Platform Boundary — Framework Provides vs. CODITECT Builds

CapabilityFramework ProvidesCODITECT Builds
PDF extractionDocling engine (IBM)Worker orchestration, error handling, circuit breakers
HTML parsingBeautifulSoup/lxmlar5iv-specific extraction logic, math alttext handling
LaTeX conversionPandocCustom macro expansion, preamble parsing, source detection
Document fusionEntire fusion engine (confidence-weighted, type-specific selection)
Quality scoring9-dimension scoring system, Grade thresholds, regression detection
StoragePostgreSQL, JSONB, GIN indexingUDOM schema, RLS policies, tenant isolation, search queries
Message busNATSWorker dispatch patterns, batch coordination, event schemas
ObservabilityOpenTelemetry, PrometheusUDOM-specific metrics, dashboards, SLOs, alerting rules
ViewerKaTeX (CDN), JSON viewer libUDOM Navigator (viewer.html), theme system, batch browser
Agent integrationUDOM agent tools, semantic search, cross-paper comparison

Build vs. use ratio: ~60% CODITECT-built, ~40% leveraging existing frameworks. The fusion engine, quality scoring, and agent integration are pure CODITECT IP — these are the value differentiators.


System Design Document covers: context, components, data flows, scaling, failure modes, observability, and platform boundaries.