ADR-001: Adoption of 3-Source UDOM Extraction Pipeline

Status

Accepted

Context

CODITECT's autonomous agent platform requires machine-readable scientific literature to enable research discovery agents in regulated industries. Single-source PDF extraction (pymupdf4llm) achieved only 46% Grade A quality — insufficient for agent consumption where mathematical precision and tabular structure are critical for regulated decision-making. Iterative improvements (v2.5 hybrid) reached 80% but plateaued due to fundamental limitations of single-engine extraction.

Decision

Adopt a 3-source extraction architecture combining Docling PDF engine, ar5iv HTML (LaTeXML-rendered), and arXiv LaTeX source (pandoc conversion), fused into a Universal Document Object Model (UDOM) with 25 typed components. Each source contributes its strengths: Docling for structure, ar5iv for math/tables, LaTeX for display equations and citations. Quality is validated via a 9-dimension scoring system with Grade A threshold of 0.85.

Consequences

Positive:

100% Grade A achieved on 135/218 papers (vs. 46% with single-source)
Processing time of 10–35s per paper is operationally viable for batch and near-real-time use
3-source fusion creates a quality ceiling far above any single engine
Typed components enable structured agent tool-use (search equations, filter tables)
Zero LLM tokens consumed during extraction — pure deterministic processing

Negative:

Pipeline complexity increases (3 workers, fusion engine, quality scorer vs. 1 worker)
Dependency on external services (ar5iv availability, arXiv API rate limits)
Higher compute cost per paper (~3× single source) — offset by dramatically higher quality

Neutral:

Architecture follows CODITECT's existing orchestrator-workers pattern — no new infrastructure patterns needed
ar5iv coverage is not 100% — graceful degradation to 2-source mode handles gaps

Alternatives Considered

Enhanced single-source (pymupdf4llm with AI post-processing): Reached 80% Grade A ceiling. Rejected because AI post-processing adds LLM token cost and cannot recover information lost during PDF extraction.
LlamaParse commercial API: Better math handling than pymupdf4llm but still single-source. Vendor lock-in risk. Rejected because multi-source fusion architecturally surpasses any single-engine improvement.
Marker (Meta's PDF tool): Good table extraction but weak on LaTeX math. Single-source limitation applies. Rejected for same reason as alternative 2.
Custom OCR pipeline (Tesseract + LaTeX reconstruction): Highest potential math fidelity but extremely slow (~2–5 min/paper) and brittle. Rejected for operational infeasibility at scale.

Status​

Context​

Decision​

Consequences​

Alternatives Considered​

Status

Context

Decision

Consequences

Alternatives Considered