ADR-001: Adoption of 3-Source UDOM Extraction Pipeline
Status
Accepted
Context
CODITECT's autonomous agent platform requires machine-readable scientific literature to enable research discovery agents in regulated industries. Single-source PDF extraction (pymupdf4llm) achieved only 46% Grade A quality — insufficient for agent consumption where mathematical precision and tabular structure are critical for regulated decision-making. Iterative improvements (v2.5 hybrid) reached 80% but plateaued due to fundamental limitations of single-engine extraction.
Decision
Adopt a 3-source extraction architecture combining Docling PDF engine, ar5iv HTML (LaTeXML-rendered), and arXiv LaTeX source (pandoc conversion), fused into a Universal Document Object Model (UDOM) with 25 typed components. Each source contributes its strengths: Docling for structure, ar5iv for math/tables, LaTeX for display equations and citations. Quality is validated via a 9-dimension scoring system with Grade A threshold of 0.85.
Consequences
Positive:
- 100% Grade A achieved on 135/218 papers (vs. 46% with single-source)
- Processing time of 10–35s per paper is operationally viable for batch and near-real-time use
- 3-source fusion creates a quality ceiling far above any single engine
- Typed components enable structured agent tool-use (search equations, filter tables)
- Zero LLM tokens consumed during extraction — pure deterministic processing
Negative:
- Pipeline complexity increases (3 workers, fusion engine, quality scorer vs. 1 worker)
- Dependency on external services (ar5iv availability, arXiv API rate limits)
- Higher compute cost per paper (~3× single source) — offset by dramatically higher quality
Neutral:
- Architecture follows CODITECT's existing orchestrator-workers pattern — no new infrastructure patterns needed
- ar5iv coverage is not 100% — graceful degradation to 2-source mode handles gaps
Alternatives Considered
-
Enhanced single-source (pymupdf4llm with AI post-processing): Reached 80% Grade A ceiling. Rejected because AI post-processing adds LLM token cost and cannot recover information lost during PDF extraction.
-
LlamaParse commercial API: Better math handling than pymupdf4llm but still single-source. Vendor lock-in risk. Rejected because multi-source fusion architecturally surpasses any single-engine improvement.
-
Marker (Meta's PDF tool): Good table extraction but weak on LaTeX math. Single-source limitation applies. Rejected for same reason as alternative 2.
-
Custom OCR pipeline (Tesseract + LaTeX reconstruction): Highest potential math fidelity but extremely slow (~2–5 min/paper) and brittle. Rejected for operational infeasibility at scale.