ADR-172: General Multi-Source Provider Architecture

Status

Accepted

Context

The UDOM Pipeline v1.3 achieves 218/218 Grade A (100%) on arXiv papers using 3-source extraction (Docling PDF + ar5iv HTML + arXiv LaTeX). However, the pipeline is currently arXiv-specific — it hardcodes:

ar5iv.labs.arxiv.org/html/{id} for HTML enrichment
arxiv.org/e-print/{id} for LaTeX source download
arxiv.org/ps/{id} for PostScript (not yet utilized)
arXiv-specific ID extraction (r"(\d{4}\.\d{4,5})")

To achieve the goal of 100% lossless conversion to machine-readable markdown for any document, the pipeline must generalize beyond arXiv to support:

PostScript (PS) — arXiv's 4th available format, generated via dvips from LaTeX source, providing precise glyph positioning, vector figure data, and rendered math verification
JATS/NLM XML — PubMed Central's structured XML (~8.5M papers) with MathML
Publisher HTML — IEEE, Springer, Elsevier web pages
Publisher XML — ScienceDirect, SpringerLink structured feeds
Local files — arbitrary PDF, DOCX, PS, LaTeX from user uploads

The UDOM mapper already supports N-source fusion via SOURCE_PRIORITY and SOURCE_PREFERENCE dictionaries. The missing piece is a pluggable source discovery layer that decouples "where do sources come from?" from "how do we extract components from them?"

Decision

Introduce a SourceProvider abstraction that decouples source discovery from extraction:

1. SourceProvider Interface

class SourceProvider(ABC):
    """Discovers available source formats for a given document identifier."""

    name: str                    # e.g., "arxiv", "pubmed", "local"
    supported_formats: set[SourceFormat]

    @abstractmethod
    def discover(self, doc_id: str) -> list[AvailableSource]:
        """Return all available sources for a document."""

    @abstractmethod
    def fetch(self, source: AvailableSource, output_dir: Path) -> Path:
        """Download/locate the source file, return local path."""

    @abstractmethod
    def extract_doc_id(self, input_path: Path) -> str | None:
        """Extract a document ID from a file path/name, or None if not recognized."""

2. AvailableSource Dataclass

@dataclass
class AvailableSource:
    format: SourceFormat           # pdf, html, latex, postscript, xml, docx
    url: str                       # download URL or local path
    provider: str                  # source provider name
    confidence_boost: float = 0.0  # provider-specific quality boost
    rate_limit_s: float = 0.0      # minimum delay between requests
    metadata: dict = field(default_factory=dict)  # provider-specific metadata

3. SourceFormat Extension

Add POSTSCRIPT and JATS_XML to the existing SourceFormat enum:

class SourceFormat(str, Enum):
    PDF = "pdf"
    HTML = "html"
    LATEX = "latex"
    POSTSCRIPT = "postscript"   # NEW — arXiv PS via dvips
    JATS_XML = "jats_xml"       # NEW — PubMed Central NLM/JATS
    DOCX = "docx"
    XML = "xml"
    MANUAL = "manual"

4. Provider Registry

class ProviderRegistry:
    """Registry of all source providers. Pipeline queries this to discover sources."""

    def discover_all(self, doc_id: str, input_path: Path) -> list[AvailableSource]:
        """Query all providers, return merged list of available sources."""

    def register(self, provider: SourceProvider) -> None:
        """Register a new source provider."""

5. Built-in Providers (Phase 1)

Provider	Sources	ID Pattern	Rate Limit
`ArxivProvider`	PDF, LaTeX, HTML (ar5iv), PS	`\d{4}\.\d{4,5}`	2.0s (arXiv API), 0.5s (ar5iv)
`LocalFileProvider`	PDF, DOCX, PS, LaTeX	filename match	none

6. PostScript Extractor

New udom/extractors/ps_extractor.py with capabilities:

Capability	Tool	Value
Text with bounding boxes	`pstotext -bboxes`	Layout validation against Docling
Vector figure extraction	GhostScript `epswrite`/`pdfwrite`	Lossless EPS figures without rasterization
Rendered math verification	Font metric analysis	Cross-validate LaTeX extraction
Fallback text	`ps2ascii`	When LaTeX source unavailable

Source Priority: PostScript ranks between PDF (1) and HTML (2) at priority 1.5 — better than raw PDF extraction but not as structured as HTML or LaTeX.

Component Preferences:

Component Type	PS Advantage	PS Priority
Figure	Lossless vector EPS extraction	Primary when no HTML figures
Equation (rendered)	Font metric verification	Verification only (LaTeX remains primary)
Paragraph	Better word boundaries via bounding boxes	Validation only
Table	Layout-based column detection	Fallback (HTML tables are superior)

7. Pipeline Integration

run_udom_pipeline() changes from:

# Current (hardcoded 3 sources)
pdf_result = extract_from_pdf(pdf_path, ...)
html_result = extract_from_html(arxiv_id, ...)  # ar5iv specific
latex_result = extract_from_latex(arxiv_id, ...)  # arXiv specific

To:

# General (N sources via provider registry)
registry = ProviderRegistry()
registry.register(ArxivProvider())
registry.register(LocalFileProvider())
# Future: registry.register(PubMedProvider())

sources = registry.discover_all(doc_id, input_path)
results = []
for source in sources:
    local_path = registry.fetch(source)
    result = extract(source.format, local_path, ...)  # dispatches to correct extractor
    results.append(result)

document = map_and_fuse(results)  # existing mapper handles N sources

Consequences

Positive:

Adding a new source (PubMed XML, IEEE HTML, Nature XML) requires only: (1) a SourceProvider subclass, (2) an extractor if new format
PostScript becomes immediately available for arXiv papers as a 4th validation source
Pipeline remains backward-compatible — ArxivProvider + LocalFileProvider reproduce current behavior exactly
Path to 47.4M+ papers across 6 major publishers
General solution handles local files (PDF, DOCX, PS) without arXiv dependency

Negative:

Adds abstraction layer to pipeline (provider registry, source discovery)
PostScript extraction provides incremental (not revolutionary) quality improvement for arXiv papers where LaTeX source is already available
Publisher API integrations require API keys and may have usage limits

Neutral:

Mapper and assembler remain unchanged — they already handle N sources via dictionaries
Existing arXiv behavior is preserved as ArxivProvider with identical rate limits and fallback logic
PostScript is most valuable as a fallback when LaTeX source is unavailable, and for vector figure extraction where it's uniquely superior

Implementation Plan

Phase 1: Foundation (H.21.1-H.21.3) — 1-2 weeks

Define SourceProvider ABC, AvailableSource dataclass, ProviderRegistry
Implement ArxivProvider (wraps current arXiv-specific logic)
Implement LocalFileProvider (wraps current file detection)
Add POSTSCRIPT and JATS_XML to SourceFormat enum
Refactor run_udom_pipeline() to use provider registry
Verify zero regression on 218-paper batch

Phase 2: PostScript Extractor (H.21.4) — 1 week

Create udom/extractors/ps_extractor.py
Implement pstotext -bboxes integration (GhostScript already self-provisioned)
Implement EPS figure extraction from PS files
Add PS to SOURCE_PRIORITY and SOURCE_PREFERENCE
Test on 10 arXiv papers with PS available
Batch test (218 papers) with PS as 4th source

Phase 3: PubMed XML Extractor (H.21.5) — 2 weeks

Implement PubMedProvider (PMC OAI-PMH API)
Create udom/extractors/jats_extractor.py (JATS/NLM XML → UDOM components)
MathML → LaTeX conversion for equations
Test on 50 PubMed Central papers

Phase 4: Publisher Adapters (H.21.6) — 2-4 weeks

IEEEProvider (IEEE Xplore API)
SpringerProvider (Springer Nature API)
ElsevierProvider (ScienceDirect API)
Publisher-specific extractors for proprietary HTML/XML formats

Alternatives Considered

Keep arXiv-specific pipeline, add PS as inline code: Simpler but blocks PubMed/IEEE/Springer expansion. Rejected — the abstraction cost is low and the payoff is 47M+ additional papers.
Use a third-party document hub (e.g., Semantic Scholar API) as single source: Provides metadata but not full-text structured extraction. Rejected — UDOM requires raw source access for component-level extraction.
PostScript only (no general abstraction): Would add PS but not generalize. Rejected — PS alone provides incremental value; the general architecture provides transformative value by unlocking all publishers.

Decision Date: 2026-02-09 Decision Makers: Hal Casteel (CTO), Claude (Opus 4.6) Track: H.20 → H.21 (General Multi-Source Provider)

Status​

Context​

Decision​

1. SourceProvider Interface​

2. AvailableSource Dataclass​

3. SourceFormat Extension​

4. Provider Registry​

5. Built-in Providers (Phase 1)​

6. PostScript Extractor​

7. Pipeline Integration​

Consequences​

Implementation Plan​

Phase 1: Foundation (H.21.1-H.21.3) — 1-2 weeks​

Phase 2: PostScript Extractor (H.21.4) — 1 week​

Phase 3: PubMed XML Extractor (H.21.5) — 2 weeks​

Phase 4: Publisher Adapters (H.21.6) — 2-4 weeks​

Alternatives Considered​