System Design Document: UDOM Pipeline

Version: 1.0 | Date: 2026-02-09
Classification: Architecture — System Design
Subsystem: UDOM (Universal Document Object Model) Pipeline within CODITECT Platform

1. Context Diagram

The UDOM Pipeline operates as a subsystem within CODITECT, mediating between external paper sources (arXiv, publishers) and CODITECT's agent layer. It transforms unstructured scientific documents into typed, queryable knowledge components.

                          ┌─────────────────────────────────────────┐
                          │         CODITECT Platform               │
                          │                                         │
  ┌──────────┐            │  ┌──────────────────────────────────┐   │
  │  arXiv   │───PDF──────│──│                                  │   │
  │  Server  │───LaTeX────│──│      UDOM Pipeline Subsystem     │   │
  └──────────┘            │  │                                  │   │
                          │  │  [Extraction] → [Fusion] →       │   │
  ┌──────────┐            │  │  [Quality] → [Store]             │   │
  │  ar5iv   │───HTML─────│──│                                  │   │
  │  Server  │            │  └────────────────┬─────────────────┘   │
  └──────────┘            │                   │                     │
                          │                   ▼                     │
  ┌──────────┐            │  ┌──────────────────────────────────┐   │
  │Publisher │───API──────│──│      UDOM Store (PostgreSQL)     │   │
  │  APIs    │            │  └────────────────┬─────────────────┘   │
  └──────────┘            │                   │                     │
                          │                   ▼                     │
                          │  ┌──────────────────────────────────┐   │
                          │  │   CODITECT Agent Layer           │   │
                          │  │   (Research, Compliance,         │   │
                          │  │    Synthesis agents)             │   │
                          │  └──────────────────────────────────┘   │
                          │                                         │
                          │  ┌──────────────────────────────────┐   │
                          │  │   UDOM Navigator (viewer.html)   │◄──│── Human Users
                          │  └──────────────────────────────────┘   │
                          └─────────────────────────────────────────┘

Actors:

arXiv Server — provides PDF, LaTeX source, and metadata via API
ar5iv Server — provides LaTeXML-rendered HTML with preserved math/tables
Publisher APIs — PubMed, IEEE, Springer, Elsevier (future adapters)
CODITECT Agents — consume UDOM components for research synthesis
Human Users — browse results via UDOM Navigator, review quality reports

2. Component Breakdown

2.1 Extraction Layer

Three independent, stateless workers operating in parallel:

Worker	Input	Output	Performance	Strength
Docling Worker	PDF binary	UDOM components	~5–7s/paper (62× pymupdf4llm)	Document structure, paragraphs, headings
ar5iv Worker	HTML page	UDOM components	~2–3s/paper	Math (alttext), tables, inline formatting
LaTeX Worker	`.tex` source + pandoc	UDOM components	~5–12s/paper	Display math, macros, citations, bibliography

Each worker produces typed UDOMComponent objects with source, confidence, and position metadata.

2.2 Fusion Engine

Deterministic component merger using confidence-weighted selection:

Component Type	Primary Source	Fallback 1	Fallback 2	Selection Rationale
Heading	Docling	ar5iv	LaTeX	Docling best at structural detection
Paragraph	Docling	ar5iv	LaTeX	Docling preserves reading order
Equation (display)	LaTeX	ar5iv	Docling	LaTeX source is ground truth for math
Equation (inline)	ar5iv	LaTeX	Docling	ar5iv alttext captures inline math context
Table	ar5iv	Docling	LaTeX	ar5iv preserves HTML table structure
Figure	Docling	ar5iv	—	Docling detects figure boundaries; ar5iv for images
Citation	LaTeX	ar5iv	Docling	LaTeX source has precise citation keys
Bibliography	LaTeX	ar5iv	Docling	LaTeX BibTeX entries are most complete
Abstract	ar5iv	Docling	LaTeX	ar5iv has clean abstract section

2.3 Quality Scorer

9-dimension evaluation producing a scalar grade:

Dimension	Weight	Measures	Grade A Threshold
Structure	0.12	Heading hierarchy completeness, section detection	≥ 0.85
Tables	0.12	Column alignment, header detection, cell content	≥ 0.80
Math	0.15	LaTeX validity, no broken delimiters, macro expansion	≥ 0.85
Citations	0.10	Reference detection, bracketed citation format	≥ 0.80
Images	0.08	Figure reference integrity, caption presence	≥ 0.70
Content Density	0.10	Words per component, no empty sections	≥ 0.85
LaTeX Residual	0.13	No stray `\begin`, `\end`, raw LaTeX in text	≥ 0.90
Heading Hierarchy	0.10	Monotonic nesting, no skipped levels	≥ 0.85
Bibliography	0.10	Reference completeness, author/year/title present	≥ 0.80

Overall score = weighted sum. Grade A ≥ 0.85.

2.4 UDOM Store

PostgreSQL with JSONB storage, GIN indexes, and full-text search:

CREATE TABLE udom_documents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES tenants(id),
    arxiv_id        TEXT NOT NULL,
    title           TEXT,
    components      JSONB NOT NULL,        -- Array of UDOMComponent
    quality_score   JSONB NOT NULL,        -- 9-dimension scores
    source_stats    JSONB,                 -- Per-source extraction metadata
    corpus          TEXT DEFAULT 'default', -- Named corpus for organization
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    
    UNIQUE (tenant_id, arxiv_id)
);

-- RLS for multi-tenancy
ALTER TABLE udom_documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON udom_documents
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Performance indexes
CREATE INDEX idx_udom_components ON udom_documents USING GIN (components);
CREATE INDEX idx_udom_quality ON udom_documents USING GIN (quality_score);
CREATE INDEX idx_udom_corpus ON udom_documents (tenant_id, corpus);
CREATE INDEX idx_udom_fulltext ON udom_documents 
    USING GIN (to_tsvector('english', title || ' ' || components::text));

2.5 UDOM Navigator

Static HTML/JS viewer (viewer.html) for human browsing:

Feature	Implementation
Light/dark theme	CSS custom properties, `[data-theme="dark"]` override
Batch run browser	Hamburger menu scanning `run-*` directories
KaTeX math rendering	CDN v0.16.11 with auto-render + custom macros
Component inspector	JSON viewer (`@alenaksu/json-viewer`)
Quality dashboard	Per-paper and batch-level score visualization

3. Data & Control Flows

3.1 Single Paper Flow

1. Orchestrator receives paper_id
   │
2. Dispatches 3 extraction tasks (parallel)
   ├── NATS: udom.extract.docling → Docling Worker
   ├── NATS: udom.extract.ar5iv   → ar5iv Worker
   └── NATS: udom.extract.latex   → LaTeX Worker
   │
3. Workers extract components independently (5-12s each, parallel)
   │
4. All 3 results arrive at Fusion Engine
   │
5. Fusion Engine merges → canonical UDOM Document
   │
6. Quality Scorer evaluates 9 dimensions
   │
7. Quality Gate Decision:
   ├── Grade A (≥0.85) → Store in PostgreSQL, publish audit event
   └── Grade B/C (<0.85) → Retry with enhanced parameters (max 2 retries)
       └── If still below threshold → Store with quality warning, checkpoint for human review
   │
8. Audit event published to Event Bus (NATS)
   │
9. Available for agent consumption via UDOM Store API

3.2 Batch Flow

1. Batch submitted (list of paper_ids + configuration)
   │
2. Orchestrator creates batch record in State Store
   │
3. For each paper (bounded concurrency via semaphore):
   └── Execute Single Paper Flow (above)
   │
4. Batch-level quality aggregation:
   ├── Grade distribution (A/B/C counts)
   ├── Average processing time
   ├── Total component count
   └── Per-dimension score distributions
   │
5. Batch report written to State Store
   │
6. Audit trail: complete batch provenance chain
   │
7. UDOM Navigator updated (new run-* directory)

4. Scaling Model

Horizontal Scaling (Primary Strategy)

Component	Scaling Axis	Constraint	Target
Docling Workers	Replica count	CPU-bound (PDF parsing)	1 worker per 2 CPU cores
ar5iv Workers	Replica count	Network I/O bound	2–5 replicas per ar5iv rate limit tier
LaTeX Workers	Replica count	CPU-bound (pandoc)	1 worker per 2 CPU cores
Fusion Engine	Stateless, horizontal	Memory (component assembly)	1 instance per 10 concurrent papers
Quality Scorer	Stateless, horizontal	CPU (scoring calculations)	Co-located with Fusion
UDOM Store	Read replicas	PostgreSQL connection limits	Primary + 2 read replicas

Throughput Model

Configuration	Papers/Hour	Papers/Day	Notes
Single instance (1 worker each)	~120	~2,880	Development/POC
Small cluster (3 Docling, 2 ar5iv, 2 LaTeX)	~500	~12,000	Pilot deployment
Production (10 Docling, 5 ar5iv, 5 LaTeX)	~1,800	~43,200	Enterprise tenant

Bottleneck Analysis

ar5iv rate limiting — external service, cannot scale beyond their limits. Mitigation: cache ar5iv responses aggressively (papers don't change after publication).
PostgreSQL write throughput — batch inserts of JSONB components. Mitigation: batch commits (10–50 papers per transaction), write-ahead log tuning.
Network egress — downloading PDFs + HTML + LaTeX source. Mitigation: geographic co-location with arXiv CDN, local PDF cache.

5. Failure Modes

Failure	Detection	Impact	Recovery
Docling OOM	Worker health check, memory threshold	Single paper extraction fails	Circuit breaker → restart with smaller batch; degrade to 2-source
ar5iv unavailable	HTTP 429/503 for >5 consecutive requests	Math and table quality degrades	Circuit breaker → exponential backoff; continue with Docling + LaTeX
LaTeX source missing	404 from arXiv e-print endpoint	Math fidelity slightly reduced	Graceful degradation → 2-source mode (Docling + ar5iv)
pandoc crash	Non-zero exit code from subprocess	Single paper LaTeX extraction fails	Retry with `--from=latex+raw_tex`; if persistent, skip LaTeX source
Fusion conflict	Multiple sources disagree on component type	Incorrect component classification	Default to Docling structure; log conflict for offline analysis
Quality gate failure	Score < threshold after max retries	Paper below quality standard	Store with warning flag; checkpoint for human review
PostgreSQL full	Disk space alert	Cannot store new documents	Alert → prune old batch runs; expand storage
NATS partition	Worker can't connect to NATS	Workers stall, no new extractions	NATS cluster self-heals; workers reconnect with backoff

Graceful Degradation Matrix

Sources Available	Quality Impact	Expected Grade
Docling + ar5iv + LaTeX	Full fidelity	A (0.87–0.94)
Docling + ar5iv	Slightly reduced math fidelity	A (0.85–0.90)
Docling + LaTeX	Reduced table quality	A/B (0.82–0.89)
Docling only	Baseline extraction	B (0.75–0.84)
ar5iv only	No structural backbone	B/C (0.70–0.80)

6. Observability Story

Tracing Strategy

Every paper processing creates a single trace with spans for each extraction source, fusion, scoring, and storage. Traces are correlated across the batch via batch_id and per-paper via paper_id.

Key Dashboards

Dashboard	Metrics	Audience
Pipeline Health	Worker uptime, error rates, circuit breaker state	Platform engineering
Quality Trends	Grade distribution over time, per-dimension trends	Quality engineering
Throughput	Papers/hour, processing time P50/P95/P99	Operations
Tenant Activity	Per-tenant extraction volume, corpus size growth	Customer success
Cost	Compute hours per paper, storage growth rate	Finance / platform

SLOs

SLO	Target	Measurement
Quality: Grade A rate	≥ 98%	Rolling 7-day window
Latency: P95 per paper	< 45s	Per-extraction measurement
Availability: pipeline uptime	99.5%	Monthly
Data integrity: component hash verification	100%	Every read operation

7. Platform Boundary — Framework Provides vs. CODITECT Builds

Capability	Framework Provides	CODITECT Builds
PDF extraction	Docling engine (IBM)	Worker orchestration, error handling, circuit breakers
HTML parsing	BeautifulSoup/lxml	ar5iv-specific extraction logic, math alttext handling
LaTeX conversion	Pandoc	Custom macro expansion, preamble parsing, source detection
Document fusion	—	Entire fusion engine (confidence-weighted, type-specific selection)
Quality scoring	—	9-dimension scoring system, Grade thresholds, regression detection
Storage	PostgreSQL, JSONB, GIN indexing	UDOM schema, RLS policies, tenant isolation, search queries
Message bus	NATS	Worker dispatch patterns, batch coordination, event schemas
Observability	OpenTelemetry, Prometheus	UDOM-specific metrics, dashboards, SLOs, alerting rules
Viewer	KaTeX (CDN), JSON viewer lib	UDOM Navigator (viewer.html), theme system, batch browser
Agent integration	—	UDOM agent tools, semantic search, cross-paper comparison

Build vs. use ratio: ~60% CODITECT-built, ~40% leveraging existing frameworks. The fusion engine, quality scoring, and agent integration are pure CODITECT IP — these are the value differentiators.

System Design Document covers: context, components, data flows, scaling, failure modes, observability, and platform boundaries.

1. Context Diagram​

2. Component Breakdown​

2.1 Extraction Layer​

2.2 Fusion Engine​

2.3 Quality Scorer​

2.4 UDOM Store​

2.5 UDOM Navigator​

3. Data & Control Flows​

3.1 Single Paper Flow​

3.2 Batch Flow​

4. Scaling Model​

Horizontal Scaling (Primary Strategy)​

Throughput Model​

Bottleneck Analysis​

5. Failure Modes​

Graceful Degradation Matrix​

6. Observability Story​

Tracing Strategy​

Key Dashboards​

SLOs​

7. Platform Boundary — Framework Provides vs. CODITECT Builds​