ADR-006: Zero LLM Token Extraction Architecture

Status

Accepted

Context

CODITECT's token economics show 4–15× cost multipliers for agent workflows (Section 8 of system prompt). Any LLM dependency in the extraction pipeline would multiply costs across every paper processed. With batch sizes of 100–10,000 papers, even $0.01 per paper in LLM costs becomes significant. Additionally, LLM-based extraction is non-deterministic — the same paper may produce different output on re-extraction, which violates compliance audit requirements.

Decision

The UDOM extraction pipeline uses zero LLM tokens. All extraction (Docling, ar5iv, LaTeX), fusion (confidence-weighted), and quality scoring (9-dimension) are purely deterministic operations. LLM tokens are consumed only when CODITECT agents consume UDOM output for synthesis tasks — and even then, UDOM's typed components reduce token consumption by 47–90% compared to raw PDF text.

Consequences

Positive:

Extraction cost is pure compute (CPU/memory), not per-token — dramatically cheaper at scale
Deterministic: same paper always produces identical UDOM output (compliance requirement)
No model version dependency in extraction — Docling version changes are manageable; LLM version changes would affect every paper
Token savings compound: 47% reduction for full-document agent consumption, 90% for targeted queries

Negative:

Cannot use LLM "intelligence" to resolve ambiguous extraction cases (e.g., is this a table or a figure?)
Quality ceiling is bounded by deterministic algorithms — an LLM might achieve higher fidelity on edge cases

Neutral:

Future enhancement: optional LLM-based post-processing for papers that fail quality gate, consuming tokens only for the ~2% that need it

Alternatives Considered

LLM-enhanced extraction (AI reviews and fixes each component): Higher potential quality but $0.05–0.50/paper in LLM costs, non-deterministic, and 10× slower. Rejected for cost and compliance reasons.
Hybrid: deterministic extraction + LLM quality review: Adds LLM cost proportional to papers processed. Deferred as optional enhancement for quality-gate failures only.

ADR-007: Docling as Primary PDF Extraction Engine

Status

Accepted

Context

Multiple PDF extraction libraries were evaluated for the primary extraction engine: pymupdf4llm (original choice, v1.0–v2.5), Docling (IBM), Marker (Meta), LlamaParse (commercial), and Unstructured (open-source). Selection criteria: extraction speed, structural fidelity, math handling, open-source availability, and active maintenance.

Decision

Adopt Docling (IBM) as the primary PDF extraction engine, replacing pymupdf4llm. Docling provides 62× speed improvement (~5–7s vs. ~5 minutes per paper with pymupdf4llm), superior structural detection (headings, sections, document hierarchy), and active IBM maintenance. Docling runs locally (no API dependency), is MIT-licensed, and produces structured output that maps naturally to UDOM component types.

Consequences

Positive:

62× speed improvement enables batch processing at production scale
Structural detection (headings, sections) provides reliable document backbone for fusion
Local execution — no external API dependency, no per-call cost
MIT license — no commercial restrictions

Negative:

Docling's math extraction is mediocre (confidence 0.70) — mitigated by ar5iv and LaTeX sources
IBM may change maintenance priority — mitigated by version pinning and reference corpus regression tests
~1.2GB container image due to model files

Neutral:

Docling v2 API differs from v1; pin to >=2.0,<3.0 with compatibility wrapper

Alternatives Considered

pymupdf4llm (original): 62× slower, limited structural detection. Rejected as primary engine; retained as emergency fallback.
Marker (Meta): Good table extraction but weak on LaTeX math. Similar speed to Docling. Rejected — Docling's structural output is more natural for UDOM component mapping.
LlamaParse: Commercial API with usage-based pricing. Better math than pymupdf4llm but single-source limitation. Rejected — vendor dependency + cost + privacy concerns for regulated content.
Unstructured: Good general-purpose extraction but less specialized for scientific papers. Rejected — Docling's academic paper handling is superior.

Status​

Context​

Decision​

Consequences​

Alternatives Considered​

ADR-007: Docling as Primary PDF Extraction Engine

Status​

Context​

Decision​

Consequences​

Alternatives Considered​

Status

Context

Decision

Consequences

Alternatives Considered

Status

Context

Decision

Consequences

Alternatives Considered