Skip to main content

ADR-172: General Multi-Source Provider Architecture

Status

Accepted

Context

The UDOM Pipeline v1.3 achieves 218/218 Grade A (100%) on arXiv papers using 3-source extraction (Docling PDF + ar5iv HTML + arXiv LaTeX). However, the pipeline is currently arXiv-specific — it hardcodes:

  • ar5iv.labs.arxiv.org/html/{id} for HTML enrichment
  • arxiv.org/e-print/{id} for LaTeX source download
  • arxiv.org/ps/{id} for PostScript (not yet utilized)
  • arXiv-specific ID extraction (r"(\d{4}\.\d{4,5})")

To achieve the goal of 100% lossless conversion to machine-readable markdown for any document, the pipeline must generalize beyond arXiv to support:

  1. PostScript (PS) — arXiv's 4th available format, generated via dvips from LaTeX source, providing precise glyph positioning, vector figure data, and rendered math verification
  2. JATS/NLM XML — PubMed Central's structured XML (~8.5M papers) with MathML
  3. Publisher HTML — IEEE, Springer, Elsevier web pages
  4. Publisher XML — ScienceDirect, SpringerLink structured feeds
  5. Local files — arbitrary PDF, DOCX, PS, LaTeX from user uploads

The UDOM mapper already supports N-source fusion via SOURCE_PRIORITY and SOURCE_PREFERENCE dictionaries. The missing piece is a pluggable source discovery layer that decouples "where do sources come from?" from "how do we extract components from them?"

Decision

Introduce a SourceProvider abstraction that decouples source discovery from extraction:

1. SourceProvider Interface

class SourceProvider(ABC):
"""Discovers available source formats for a given document identifier."""

name: str # e.g., "arxiv", "pubmed", "local"
supported_formats: set[SourceFormat]

@abstractmethod
def discover(self, doc_id: str) -> list[AvailableSource]:
"""Return all available sources for a document."""

@abstractmethod
def fetch(self, source: AvailableSource, output_dir: Path) -> Path:
"""Download/locate the source file, return local path."""

@abstractmethod
def extract_doc_id(self, input_path: Path) -> str | None:
"""Extract a document ID from a file path/name, or None if not recognized."""

2. AvailableSource Dataclass

@dataclass
class AvailableSource:
format: SourceFormat # pdf, html, latex, postscript, xml, docx
url: str # download URL or local path
provider: str # source provider name
confidence_boost: float = 0.0 # provider-specific quality boost
rate_limit_s: float = 0.0 # minimum delay between requests
metadata: dict = field(default_factory=dict) # provider-specific metadata

3. SourceFormat Extension

Add POSTSCRIPT and JATS_XML to the existing SourceFormat enum:

class SourceFormat(str, Enum):
PDF = "pdf"
HTML = "html"
LATEX = "latex"
POSTSCRIPT = "postscript" # NEW — arXiv PS via dvips
JATS_XML = "jats_xml" # NEW — PubMed Central NLM/JATS
DOCX = "docx"
XML = "xml"
MANUAL = "manual"

4. Provider Registry

class ProviderRegistry:
"""Registry of all source providers. Pipeline queries this to discover sources."""

def discover_all(self, doc_id: str, input_path: Path) -> list[AvailableSource]:
"""Query all providers, return merged list of available sources."""

def register(self, provider: SourceProvider) -> None:
"""Register a new source provider."""

5. Built-in Providers (Phase 1)

ProviderSourcesID PatternRate Limit
ArxivProviderPDF, LaTeX, HTML (ar5iv), PS\d{4}\.\d{4,5}2.0s (arXiv API), 0.5s (ar5iv)
LocalFileProviderPDF, DOCX, PS, LaTeXfilename matchnone

6. PostScript Extractor

New udom/extractors/ps_extractor.py with capabilities:

CapabilityToolValue
Text with bounding boxespstotext -bboxesLayout validation against Docling
Vector figure extractionGhostScript epswrite/pdfwriteLossless EPS figures without rasterization
Rendered math verificationFont metric analysisCross-validate LaTeX extraction
Fallback textps2asciiWhen LaTeX source unavailable

Source Priority: PostScript ranks between PDF (1) and HTML (2) at priority 1.5 — better than raw PDF extraction but not as structured as HTML or LaTeX.

Component Preferences:

Component TypePS AdvantagePS Priority
FigureLossless vector EPS extractionPrimary when no HTML figures
Equation (rendered)Font metric verificationVerification only (LaTeX remains primary)
ParagraphBetter word boundaries via bounding boxesValidation only
TableLayout-based column detectionFallback (HTML tables are superior)

7. Pipeline Integration

run_udom_pipeline() changes from:

# Current (hardcoded 3 sources)
pdf_result = extract_from_pdf(pdf_path, ...)
html_result = extract_from_html(arxiv_id, ...) # ar5iv specific
latex_result = extract_from_latex(arxiv_id, ...) # arXiv specific

To:

# General (N sources via provider registry)
registry = ProviderRegistry()
registry.register(ArxivProvider())
registry.register(LocalFileProvider())
# Future: registry.register(PubMedProvider())

sources = registry.discover_all(doc_id, input_path)
results = []
for source in sources:
local_path = registry.fetch(source)
result = extract(source.format, local_path, ...) # dispatches to correct extractor
results.append(result)

document = map_and_fuse(results) # existing mapper handles N sources

Consequences

Positive:

  • Adding a new source (PubMed XML, IEEE HTML, Nature XML) requires only: (1) a SourceProvider subclass, (2) an extractor if new format
  • PostScript becomes immediately available for arXiv papers as a 4th validation source
  • Pipeline remains backward-compatible — ArxivProvider + LocalFileProvider reproduce current behavior exactly
  • Path to 47.4M+ papers across 6 major publishers
  • General solution handles local files (PDF, DOCX, PS) without arXiv dependency

Negative:

  • Adds abstraction layer to pipeline (provider registry, source discovery)
  • PostScript extraction provides incremental (not revolutionary) quality improvement for arXiv papers where LaTeX source is already available
  • Publisher API integrations require API keys and may have usage limits

Neutral:

  • Mapper and assembler remain unchanged — they already handle N sources via dictionaries
  • Existing arXiv behavior is preserved as ArxivProvider with identical rate limits and fallback logic
  • PostScript is most valuable as a fallback when LaTeX source is unavailable, and for vector figure extraction where it's uniquely superior

Implementation Plan

Phase 1: Foundation (H.21.1-H.21.3) — 1-2 weeks

  1. Define SourceProvider ABC, AvailableSource dataclass, ProviderRegistry
  2. Implement ArxivProvider (wraps current arXiv-specific logic)
  3. Implement LocalFileProvider (wraps current file detection)
  4. Add POSTSCRIPT and JATS_XML to SourceFormat enum
  5. Refactor run_udom_pipeline() to use provider registry
  6. Verify zero regression on 218-paper batch

Phase 2: PostScript Extractor (H.21.4) — 1 week

  1. Create udom/extractors/ps_extractor.py
  2. Implement pstotext -bboxes integration (GhostScript already self-provisioned)
  3. Implement EPS figure extraction from PS files
  4. Add PS to SOURCE_PRIORITY and SOURCE_PREFERENCE
  5. Test on 10 arXiv papers with PS available
  6. Batch test (218 papers) with PS as 4th source

Phase 3: PubMed XML Extractor (H.21.5) — 2 weeks

  1. Implement PubMedProvider (PMC OAI-PMH API)
  2. Create udom/extractors/jats_extractor.py (JATS/NLM XML → UDOM components)
  3. MathML → LaTeX conversion for equations
  4. Test on 50 PubMed Central papers

Phase 4: Publisher Adapters (H.21.6) — 2-4 weeks

  1. IEEEProvider (IEEE Xplore API)
  2. SpringerProvider (Springer Nature API)
  3. ElsevierProvider (ScienceDirect API)
  4. Publisher-specific extractors for proprietary HTML/XML formats

Alternatives Considered

  1. Keep arXiv-specific pipeline, add PS as inline code: Simpler but blocks PubMed/IEEE/Springer expansion. Rejected — the abstraction cost is low and the payoff is 47M+ additional papers.

  2. Use a third-party document hub (e.g., Semantic Scholar API) as single source: Provides metadata but not full-text structured extraction. Rejected — UDOM requires raw source access for component-level extraction.

  3. PostScript only (no general abstraction): Would add PS but not generalize. Rejected — PS alone provides incremental value; the general architecture provides transformative value by unlocking all publishers.


Decision Date: 2026-02-09 Decision Makers: Hal Casteel (CTO), Claude (Opus 4.6) Track: H.20 → H.21 (General Multi-Source Provider)