System Design Document: UDOM Pipeline
Version: 1.0 | Date: 2026-02-09
Classification: Architecture — System Design
Subsystem: UDOM (Universal Document Object Model) Pipeline within CODITECT Platform
1. Context Diagram
The UDOM Pipeline operates as a subsystem within CODITECT, mediating between external paper sources (arXiv, publishers) and CODITECT's agent layer. It transforms unstructured scientific documents into typed, queryable knowledge components.
┌─────────────────────────────────────────┐
│ CODITECT Platform │
│ │
┌──────────┐ │ ┌──────────────────────────────────┐ │
│ arXiv │───PDF──────│──│ │ │
│ Server │───LaTeX────│──│ UDOM Pipeline Subsystem │ │
└──────────┘ │ │ │ │
│ │ [Extraction] → [Fusion] → │ │
┌──────────┐ │ │ [Quality] → [Store] │ │
│ ar5iv │───HTML─────│──│ │ │
│ Server │ │ └────────────────┬─────────────────┘ │
└──────────┘ │ │ │
│ ▼ │
┌──────────┐ │ ┌──────────────────────────────────┐ │
│Publisher │───API──────│──│ UDOM Store (PostgreSQL) │ │
│ APIs │ │ └────────────────┬─────────────────┘ │
└──────────┘ │ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ CODITECT Agent Layer │ │
│ │ (Research, Compliance, │ │
│ │ Synthesis agents) │ │
│ └──────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────┐ │
│ │ UDOM Navigator (viewer.html) │◄──│── Human Users
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
Actors:
- arXiv Server — provides PDF, LaTeX source, and metadata via API
- ar5iv Server — provides LaTeXML-rendered HTML with preserved math/tables
- Publisher APIs — PubMed, IEEE, Springer, Elsevier (future adapters)
- CODITECT Agents — consume UDOM components for research synthesis
- Human Users — browse results via UDOM Navigator, review quality reports
2. Component Breakdown
2.1 Extraction Layer
Three independent, stateless workers operating in parallel:
| Worker | Input | Output | Performance | Strength |
|---|---|---|---|---|
| Docling Worker | PDF binary | UDOM components | ~5–7s/paper (62× pymupdf4llm) | Document structure, paragraphs, headings |
| ar5iv Worker | HTML page | UDOM components | ~2–3s/paper | Math (alttext), tables, inline formatting |
| LaTeX Worker | .tex source + pandoc | UDOM components | ~5–12s/paper | Display math, macros, citations, bibliography |
Each worker produces typed UDOMComponent objects with source, confidence, and position metadata.
2.2 Fusion Engine
Deterministic component merger using confidence-weighted selection:
| Component Type | Primary Source | Fallback 1 | Fallback 2 | Selection Rationale |
|---|---|---|---|---|
| Heading | Docling | ar5iv | LaTeX | Docling best at structural detection |
| Paragraph | Docling | ar5iv | LaTeX | Docling preserves reading order |
| Equation (display) | LaTeX | ar5iv | Docling | LaTeX source is ground truth for math |
| Equation (inline) | ar5iv | LaTeX | Docling | ar5iv alttext captures inline math context |
| Table | ar5iv | Docling | LaTeX | ar5iv preserves HTML table structure |
| Figure | Docling | ar5iv | — | Docling detects figure boundaries; ar5iv for images |
| Citation | LaTeX | ar5iv | Docling | LaTeX source has precise citation keys |
| Bibliography | LaTeX | ar5iv | Docling | LaTeX BibTeX entries are most complete |
| Abstract | ar5iv | Docling | LaTeX | ar5iv has clean abstract section |
2.3 Quality Scorer
9-dimension evaluation producing a scalar grade:
| Dimension | Weight | Measures | Grade A Threshold |
|---|---|---|---|
| Structure | 0.12 | Heading hierarchy completeness, section detection | ≥ 0.85 |
| Tables | 0.12 | Column alignment, header detection, cell content | ≥ 0.80 |
| Math | 0.15 | LaTeX validity, no broken delimiters, macro expansion | ≥ 0.85 |
| Citations | 0.10 | Reference detection, bracketed citation format | ≥ 0.80 |
| Images | 0.08 | Figure reference integrity, caption presence | ≥ 0.70 |
| Content Density | 0.10 | Words per component, no empty sections | ≥ 0.85 |
| LaTeX Residual | 0.13 | No stray \begin, \end, raw LaTeX in text | ≥ 0.90 |
| Heading Hierarchy | 0.10 | Monotonic nesting, no skipped levels | ≥ 0.85 |
| Bibliography | 0.10 | Reference completeness, author/year/title present | ≥ 0.80 |
Overall score = weighted sum. Grade A ≥ 0.85.
2.4 UDOM Store
PostgreSQL with JSONB storage, GIN indexes, and full-text search:
CREATE TABLE udom_documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
arxiv_id TEXT NOT NULL,
title TEXT,
components JSONB NOT NULL, -- Array of UDOMComponent
quality_score JSONB NOT NULL, -- 9-dimension scores
source_stats JSONB, -- Per-source extraction metadata
corpus TEXT DEFAULT 'default', -- Named corpus for organization
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (tenant_id, arxiv_id)
);
-- RLS for multi-tenancy
ALTER TABLE udom_documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON udom_documents
USING (tenant_id = current_setting('app.current_tenant')::UUID);
-- Performance indexes
CREATE INDEX idx_udom_components ON udom_documents USING GIN (components);
CREATE INDEX idx_udom_quality ON udom_documents USING GIN (quality_score);
CREATE INDEX idx_udom_corpus ON udom_documents (tenant_id, corpus);
CREATE INDEX idx_udom_fulltext ON udom_documents
USING GIN (to_tsvector('english', title || ' ' || components::text));
2.5 UDOM Navigator
Static HTML/JS viewer (viewer.html) for human browsing:
| Feature | Implementation |
|---|---|
| Light/dark theme | CSS custom properties, [data-theme="dark"] override |
| Batch run browser | Hamburger menu scanning run-* directories |
| KaTeX math rendering | CDN v0.16.11 with auto-render + custom macros |
| Component inspector | JSON viewer (@alenaksu/json-viewer) |
| Quality dashboard | Per-paper and batch-level score visualization |
3. Data & Control Flows
3.1 Single Paper Flow
1. Orchestrator receives paper_id
│
2. Dispatches 3 extraction tasks (parallel)
├── NATS: udom.extract.docling → Docling Worker
├── NATS: udom.extract.ar5iv → ar5iv Worker
└── NATS: udom.extract.latex → LaTeX Worker
│
3. Workers extract components independently (5-12s each, parallel)
│
4. All 3 results arrive at Fusion Engine
│
5. Fusion Engine merges → canonical UDOM Document
│
6. Quality Scorer evaluates 9 dimensions
│
7. Quality Gate Decision:
├── Grade A (≥0.85) → Store in PostgreSQL, publish audit event
└── Grade B/C (<0.85) → Retry with enhanced parameters (max 2 retries)
└── If still below threshold → Store with quality warning, checkpoint for human review
│
8. Audit event published to Event Bus (NATS)
│
9. Available for agent consumption via UDOM Store API
3.2 Batch Flow
1. Batch submitted (list of paper_ids + configuration)
│
2. Orchestrator creates batch record in State Store
│
3. For each paper (bounded concurrency via semaphore):
└── Execute Single Paper Flow (above)
│
4. Batch-level quality aggregation:
├── Grade distribution (A/B/C counts)
├── Average processing time
├── Total component count
└── Per-dimension score distributions
│
5. Batch report written to State Store
│
6. Audit trail: complete batch provenance chain
│
7. UDOM Navigator updated (new run-* directory)
4. Scaling Model
Horizontal Scaling (Primary Strategy)
| Component | Scaling Axis | Constraint | Target |
|---|---|---|---|
| Docling Workers | Replica count | CPU-bound (PDF parsing) | 1 worker per 2 CPU cores |
| ar5iv Workers | Replica count | Network I/O bound | 2–5 replicas per ar5iv rate limit tier |
| LaTeX Workers | Replica count | CPU-bound (pandoc) | 1 worker per 2 CPU cores |
| Fusion Engine | Stateless, horizontal | Memory (component assembly) | 1 instance per 10 concurrent papers |
| Quality Scorer | Stateless, horizontal | CPU (scoring calculations) | Co-located with Fusion |
| UDOM Store | Read replicas | PostgreSQL connection limits | Primary + 2 read replicas |
Throughput Model
| Configuration | Papers/Hour | Papers/Day | Notes |
|---|---|---|---|
| Single instance (1 worker each) | ~120 | ~2,880 | Development/POC |
| Small cluster (3 Docling, 2 ar5iv, 2 LaTeX) | ~500 | ~12,000 | Pilot deployment |
| Production (10 Docling, 5 ar5iv, 5 LaTeX) | ~1,800 | ~43,200 | Enterprise tenant |
Bottleneck Analysis
- ar5iv rate limiting — external service, cannot scale beyond their limits. Mitigation: cache ar5iv responses aggressively (papers don't change after publication).
- PostgreSQL write throughput — batch inserts of JSONB components. Mitigation: batch commits (10–50 papers per transaction), write-ahead log tuning.
- Network egress — downloading PDFs + HTML + LaTeX source. Mitigation: geographic co-location with arXiv CDN, local PDF cache.
5. Failure Modes
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Docling OOM | Worker health check, memory threshold | Single paper extraction fails | Circuit breaker → restart with smaller batch; degrade to 2-source |
| ar5iv unavailable | HTTP 429/503 for >5 consecutive requests | Math and table quality degrades | Circuit breaker → exponential backoff; continue with Docling + LaTeX |
| LaTeX source missing | 404 from arXiv e-print endpoint | Math fidelity slightly reduced | Graceful degradation → 2-source mode (Docling + ar5iv) |
| pandoc crash | Non-zero exit code from subprocess | Single paper LaTeX extraction fails | Retry with --from=latex+raw_tex; if persistent, skip LaTeX source |
| Fusion conflict | Multiple sources disagree on component type | Incorrect component classification | Default to Docling structure; log conflict for offline analysis |
| Quality gate failure | Score < threshold after max retries | Paper below quality standard | Store with warning flag; checkpoint for human review |
| PostgreSQL full | Disk space alert | Cannot store new documents | Alert → prune old batch runs; expand storage |
| NATS partition | Worker can't connect to NATS | Workers stall, no new extractions | NATS cluster self-heals; workers reconnect with backoff |
Graceful Degradation Matrix
| Sources Available | Quality Impact | Expected Grade |
|---|---|---|
| Docling + ar5iv + LaTeX | Full fidelity | A (0.87–0.94) |
| Docling + ar5iv | Slightly reduced math fidelity | A (0.85–0.90) |
| Docling + LaTeX | Reduced table quality | A/B (0.82–0.89) |
| Docling only | Baseline extraction | B (0.75–0.84) |
| ar5iv only | No structural backbone | B/C (0.70–0.80) |
6. Observability Story
Tracing Strategy
Every paper processing creates a single trace with spans for each extraction source, fusion, scoring, and storage. Traces are correlated across the batch via batch_id and per-paper via paper_id.
Key Dashboards
| Dashboard | Metrics | Audience |
|---|---|---|
| Pipeline Health | Worker uptime, error rates, circuit breaker state | Platform engineering |
| Quality Trends | Grade distribution over time, per-dimension trends | Quality engineering |
| Throughput | Papers/hour, processing time P50/P95/P99 | Operations |
| Tenant Activity | Per-tenant extraction volume, corpus size growth | Customer success |
| Cost | Compute hours per paper, storage growth rate | Finance / platform |
SLOs
| SLO | Target | Measurement |
|---|---|---|
| Quality: Grade A rate | ≥ 98% | Rolling 7-day window |
| Latency: P95 per paper | < 45s | Per-extraction measurement |
| Availability: pipeline uptime | 99.5% | Monthly |
| Data integrity: component hash verification | 100% | Every read operation |
7. Platform Boundary — Framework Provides vs. CODITECT Builds
| Capability | Framework Provides | CODITECT Builds |
|---|---|---|
| PDF extraction | Docling engine (IBM) | Worker orchestration, error handling, circuit breakers |
| HTML parsing | BeautifulSoup/lxml | ar5iv-specific extraction logic, math alttext handling |
| LaTeX conversion | Pandoc | Custom macro expansion, preamble parsing, source detection |
| Document fusion | — | Entire fusion engine (confidence-weighted, type-specific selection) |
| Quality scoring | — | 9-dimension scoring system, Grade thresholds, regression detection |
| Storage | PostgreSQL, JSONB, GIN indexing | UDOM schema, RLS policies, tenant isolation, search queries |
| Message bus | NATS | Worker dispatch patterns, batch coordination, event schemas |
| Observability | OpenTelemetry, Prometheus | UDOM-specific metrics, dashboards, SLOs, alerting rules |
| Viewer | KaTeX (CDN), JSON viewer lib | UDOM Navigator (viewer.html), theme system, batch browser |
| Agent integration | — | UDOM agent tools, semantic search, cross-paper comparison |
Build vs. use ratio: ~60% CODITECT-built, ~40% leveraging existing frameworks. The fusion engine, quality scoring, and agent integration are pure CODITECT IP — these are the value differentiators.
System Design Document covers: context, components, data flows, scaling, failure modes, observability, and platform boundaries.