Skip to main content

ADR-002: Confidence-Weighted Fusion Strategy

Status

Accepted

Context

With 3 extraction sources producing overlapping components for the same document, a fusion strategy is needed to select the best component for each position. The strategy must be deterministic (auditable for compliance), explainable (humans can verify why a source was chosen), and domain-configurable (different regulated industries may prioritize different quality dimensions).

Decision

Implement confidence-weighted fusion where each source has a pre-assigned confidence score per component type. For each document position, the component with the highest confidence is selected. Confidence weights are configurable per tenant, allowing regulated tenants to prioritize sources based on domain requirements.

Default confidence matrix:

Component TypeDoclingar5ivLaTeX
Heading0.950.850.80
Paragraph0.900.850.80
Equation (display)0.700.900.95
Equation (inline)0.650.950.90
Table0.800.920.75
Figure0.880.82
Citation0.750.850.95
Bibliography0.700.800.95

Consequences

Positive:

  • Deterministic — same inputs always produce same output (audit requirement)
  • Explainable — routing rationale captured per component in metadata
  • Configurable — pharmaceutical tenants can boost table accuracy; ML research tenants can boost math accuracy
  • Fast — no ML inference in fusion path; simple weighted selection

Negative:

  • Static weights don't adapt to per-paper quality variations (a Docling extraction that's unusually bad for one paper still gets high confidence)
  • Requires periodic calibration of weights against reference corpus

Neutral:

  • Future enhancement: dynamic confidence based on per-extraction quality signals (e.g., Docling reports low confidence on a specific table → automatically prefer ar5iv)

Alternatives Considered

  1. Majority voting: Select component agreed upon by 2+ sources. Rejected because it reduces to lowest-common-denominator quality and doesn't leverage source specialization.

  2. ML-based fusion: Train a classifier to select best source per component. Rejected because it's not deterministic (model updates change output), not explainable (black box), and adds LLM/ML inference cost to extraction.

  3. Source priority chain (fixed order): Always prefer LaTeX > ar5iv > Docling. Rejected because no single source is universally best — Docling beats LaTeX on structure, ar5iv beats LaTeX on tables.