Skip to main content

ADR-006: Zero LLM Token Extraction Architecture

Status

Accepted

Context

CODITECT's token economics show 4–15× cost multipliers for agent workflows (Section 8 of system prompt). Any LLM dependency in the extraction pipeline would multiply costs across every paper processed. With batch sizes of 100–10,000 papers, even $0.01 per paper in LLM costs becomes significant. Additionally, LLM-based extraction is non-deterministic — the same paper may produce different output on re-extraction, which violates compliance audit requirements.

Decision

The UDOM extraction pipeline uses zero LLM tokens. All extraction (Docling, ar5iv, LaTeX), fusion (confidence-weighted), and quality scoring (9-dimension) are purely deterministic operations. LLM tokens are consumed only when CODITECT agents consume UDOM output for synthesis tasks — and even then, UDOM's typed components reduce token consumption by 47–90% compared to raw PDF text.

Consequences

Positive:

  • Extraction cost is pure compute (CPU/memory), not per-token — dramatically cheaper at scale
  • Deterministic: same paper always produces identical UDOM output (compliance requirement)
  • No model version dependency in extraction — Docling version changes are manageable; LLM version changes would affect every paper
  • Token savings compound: 47% reduction for full-document agent consumption, 90% for targeted queries

Negative:

  • Cannot use LLM "intelligence" to resolve ambiguous extraction cases (e.g., is this a table or a figure?)
  • Quality ceiling is bounded by deterministic algorithms — an LLM might achieve higher fidelity on edge cases

Neutral:

  • Future enhancement: optional LLM-based post-processing for papers that fail quality gate, consuming tokens only for the ~2% that need it

Alternatives Considered

  1. LLM-enhanced extraction (AI reviews and fixes each component): Higher potential quality but $0.05–0.50/paper in LLM costs, non-deterministic, and 10× slower. Rejected for cost and compliance reasons.

  2. Hybrid: deterministic extraction + LLM quality review: Adds LLM cost proportional to papers processed. Deferred as optional enhancement for quality-gate failures only.


ADR-007: Docling as Primary PDF Extraction Engine

Status

Accepted

Context

Multiple PDF extraction libraries were evaluated for the primary extraction engine: pymupdf4llm (original choice, v1.0–v2.5), Docling (IBM), Marker (Meta), LlamaParse (commercial), and Unstructured (open-source). Selection criteria: extraction speed, structural fidelity, math handling, open-source availability, and active maintenance.

Decision

Adopt Docling (IBM) as the primary PDF extraction engine, replacing pymupdf4llm. Docling provides 62× speed improvement (~5–7s vs. ~5 minutes per paper with pymupdf4llm), superior structural detection (headings, sections, document hierarchy), and active IBM maintenance. Docling runs locally (no API dependency), is MIT-licensed, and produces structured output that maps naturally to UDOM component types.

Consequences

Positive:

  • 62× speed improvement enables batch processing at production scale
  • Structural detection (headings, sections) provides reliable document backbone for fusion
  • Local execution — no external API dependency, no per-call cost
  • MIT license — no commercial restrictions

Negative:

  • Docling's math extraction is mediocre (confidence 0.70) — mitigated by ar5iv and LaTeX sources
  • IBM may change maintenance priority — mitigated by version pinning and reference corpus regression tests
  • ~1.2GB container image due to model files

Neutral:

  • Docling v2 API differs from v1; pin to >=2.0,<3.0 with compatibility wrapper

Alternatives Considered

  1. pymupdf4llm (original): 62× slower, limited structural detection. Rejected as primary engine; retained as emergency fallback.

  2. Marker (Meta): Good table extraction but weak on LaTeX math. Similar speed to Docling. Rejected — Docling's structural output is more natural for UDOM component mapping.

  3. LlamaParse: Commercial API with usage-based pricing. Better math than pymupdf4llm but single-source limitation. Rejected — vendor dependency + cost + privacy concerns for regulated content.

  4. Unstructured: Good general-purpose extraction but less specialized for scientific papers. Rejected — Docling's academic paper handling is superior.