ADR-006: Zero LLM Token Extraction Architecture
Status
Accepted
Context
CODITECT's token economics show 4–15× cost multipliers for agent workflows (Section 8 of system prompt). Any LLM dependency in the extraction pipeline would multiply costs across every paper processed. With batch sizes of 100–10,000 papers, even $0.01 per paper in LLM costs becomes significant. Additionally, LLM-based extraction is non-deterministic — the same paper may produce different output on re-extraction, which violates compliance audit requirements.
Decision
The UDOM extraction pipeline uses zero LLM tokens. All extraction (Docling, ar5iv, LaTeX), fusion (confidence-weighted), and quality scoring (9-dimension) are purely deterministic operations. LLM tokens are consumed only when CODITECT agents consume UDOM output for synthesis tasks — and even then, UDOM's typed components reduce token consumption by 47–90% compared to raw PDF text.
Consequences
Positive:
- Extraction cost is pure compute (CPU/memory), not per-token — dramatically cheaper at scale
- Deterministic: same paper always produces identical UDOM output (compliance requirement)
- No model version dependency in extraction — Docling version changes are manageable; LLM version changes would affect every paper
- Token savings compound: 47% reduction for full-document agent consumption, 90% for targeted queries
Negative:
- Cannot use LLM "intelligence" to resolve ambiguous extraction cases (e.g., is this a table or a figure?)
- Quality ceiling is bounded by deterministic algorithms — an LLM might achieve higher fidelity on edge cases
Neutral:
- Future enhancement: optional LLM-based post-processing for papers that fail quality gate, consuming tokens only for the ~2% that need it
Alternatives Considered
-
LLM-enhanced extraction (AI reviews and fixes each component): Higher potential quality but $0.05–0.50/paper in LLM costs, non-deterministic, and 10× slower. Rejected for cost and compliance reasons.
-
Hybrid: deterministic extraction + LLM quality review: Adds LLM cost proportional to papers processed. Deferred as optional enhancement for quality-gate failures only.
ADR-007: Docling as Primary PDF Extraction Engine
Status
Accepted
Context
Multiple PDF extraction libraries were evaluated for the primary extraction engine: pymupdf4llm (original choice, v1.0–v2.5), Docling (IBM), Marker (Meta), LlamaParse (commercial), and Unstructured (open-source). Selection criteria: extraction speed, structural fidelity, math handling, open-source availability, and active maintenance.
Decision
Adopt Docling (IBM) as the primary PDF extraction engine, replacing pymupdf4llm. Docling provides 62× speed improvement (~5–7s vs. ~5 minutes per paper with pymupdf4llm), superior structural detection (headings, sections, document hierarchy), and active IBM maintenance. Docling runs locally (no API dependency), is MIT-licensed, and produces structured output that maps naturally to UDOM component types.
Consequences
Positive:
- 62× speed improvement enables batch processing at production scale
- Structural detection (headings, sections) provides reliable document backbone for fusion
- Local execution — no external API dependency, no per-call cost
- MIT license — no commercial restrictions
Negative:
- Docling's math extraction is mediocre (confidence 0.70) — mitigated by ar5iv and LaTeX sources
- IBM may change maintenance priority — mitigated by version pinning and reference corpus regression tests
- ~1.2GB container image due to model files
Neutral:
- Docling v2 API differs from v1; pin to
>=2.0,<3.0with compatibility wrapper
Alternatives Considered
-
pymupdf4llm (original): 62× slower, limited structural detection. Rejected as primary engine; retained as emergency fallback.
-
Marker (Meta): Good table extraction but weak on LaTeX math. Similar speed to Docling. Rejected — Docling's structural output is more natural for UDOM component mapping.
-
LlamaParse: Commercial API with usage-based pricing. Better math than pymupdf4llm but single-source limitation. Rejected — vendor dependency + cost + privacy concerns for regulated content.
-
Unstructured: Good general-purpose extraction but less specialized for scientific papers. Rejected — Docling's academic paper handling is superior.