ADR-172: General Multi-Source Provider Architecture
Status
Accepted
Context
The UDOM Pipeline v1.3 achieves 218/218 Grade A (100%) on arXiv papers using 3-source extraction (Docling PDF + ar5iv HTML + arXiv LaTeX). However, the pipeline is currently arXiv-specific — it hardcodes:
ar5iv.labs.arxiv.org/html/{id}for HTML enrichmentarxiv.org/e-print/{id}for LaTeX source downloadarxiv.org/ps/{id}for PostScript (not yet utilized)- arXiv-specific ID extraction (
r"(\d{4}\.\d{4,5})")
To achieve the goal of 100% lossless conversion to machine-readable markdown for any document, the pipeline must generalize beyond arXiv to support:
- PostScript (PS) — arXiv's 4th available format, generated via
dvipsfrom LaTeX source, providing precise glyph positioning, vector figure data, and rendered math verification - JATS/NLM XML — PubMed Central's structured XML (~8.5M papers) with MathML
- Publisher HTML — IEEE, Springer, Elsevier web pages
- Publisher XML — ScienceDirect, SpringerLink structured feeds
- Local files — arbitrary PDF, DOCX, PS, LaTeX from user uploads
The UDOM mapper already supports N-source fusion via SOURCE_PRIORITY and SOURCE_PREFERENCE dictionaries. The missing piece is a pluggable source discovery layer that decouples "where do sources come from?" from "how do we extract components from them?"
Decision
Introduce a SourceProvider abstraction that decouples source discovery from extraction:
1. SourceProvider Interface
class SourceProvider(ABC):
"""Discovers available source formats for a given document identifier."""
name: str # e.g., "arxiv", "pubmed", "local"
supported_formats: set[SourceFormat]
@abstractmethod
def discover(self, doc_id: str) -> list[AvailableSource]:
"""Return all available sources for a document."""
@abstractmethod
def fetch(self, source: AvailableSource, output_dir: Path) -> Path:
"""Download/locate the source file, return local path."""
@abstractmethod
def extract_doc_id(self, input_path: Path) -> str | None:
"""Extract a document ID from a file path/name, or None if not recognized."""
2. AvailableSource Dataclass
@dataclass
class AvailableSource:
format: SourceFormat # pdf, html, latex, postscript, xml, docx
url: str # download URL or local path
provider: str # source provider name
confidence_boost: float = 0.0 # provider-specific quality boost
rate_limit_s: float = 0.0 # minimum delay between requests
metadata: dict = field(default_factory=dict) # provider-specific metadata
3. SourceFormat Extension
Add POSTSCRIPT and JATS_XML to the existing SourceFormat enum:
class SourceFormat(str, Enum):
PDF = "pdf"
HTML = "html"
LATEX = "latex"
POSTSCRIPT = "postscript" # NEW — arXiv PS via dvips
JATS_XML = "jats_xml" # NEW — PubMed Central NLM/JATS
DOCX = "docx"
XML = "xml"
MANUAL = "manual"
4. Provider Registry
class ProviderRegistry:
"""Registry of all source providers. Pipeline queries this to discover sources."""
def discover_all(self, doc_id: str, input_path: Path) -> list[AvailableSource]:
"""Query all providers, return merged list of available sources."""
def register(self, provider: SourceProvider) -> None:
"""Register a new source provider."""
5. Built-in Providers (Phase 1)
| Provider | Sources | ID Pattern | Rate Limit |
|---|---|---|---|
ArxivProvider | PDF, LaTeX, HTML (ar5iv), PS | \d{4}\.\d{4,5} | 2.0s (arXiv API), 0.5s (ar5iv) |
LocalFileProvider | PDF, DOCX, PS, LaTeX | filename match | none |
6. PostScript Extractor
New udom/extractors/ps_extractor.py with capabilities:
| Capability | Tool | Value |
|---|---|---|
| Text with bounding boxes | pstotext -bboxes | Layout validation against Docling |
| Vector figure extraction | GhostScript epswrite/pdfwrite | Lossless EPS figures without rasterization |
| Rendered math verification | Font metric analysis | Cross-validate LaTeX extraction |
| Fallback text | ps2ascii | When LaTeX source unavailable |
Source Priority: PostScript ranks between PDF (1) and HTML (2) at priority 1.5 — better than raw PDF extraction but not as structured as HTML or LaTeX.
Component Preferences:
| Component Type | PS Advantage | PS Priority |
|---|---|---|
| Figure | Lossless vector EPS extraction | Primary when no HTML figures |
| Equation (rendered) | Font metric verification | Verification only (LaTeX remains primary) |
| Paragraph | Better word boundaries via bounding boxes | Validation only |
| Table | Layout-based column detection | Fallback (HTML tables are superior) |
7. Pipeline Integration
run_udom_pipeline() changes from:
# Current (hardcoded 3 sources)
pdf_result = extract_from_pdf(pdf_path, ...)
html_result = extract_from_html(arxiv_id, ...) # ar5iv specific
latex_result = extract_from_latex(arxiv_id, ...) # arXiv specific
To:
# General (N sources via provider registry)
registry = ProviderRegistry()
registry.register(ArxivProvider())
registry.register(LocalFileProvider())
# Future: registry.register(PubMedProvider())
sources = registry.discover_all(doc_id, input_path)
results = []
for source in sources:
local_path = registry.fetch(source)
result = extract(source.format, local_path, ...) # dispatches to correct extractor
results.append(result)
document = map_and_fuse(results) # existing mapper handles N sources
Consequences
Positive:
- Adding a new source (PubMed XML, IEEE HTML, Nature XML) requires only: (1) a
SourceProvidersubclass, (2) an extractor if new format - PostScript becomes immediately available for arXiv papers as a 4th validation source
- Pipeline remains backward-compatible —
ArxivProvider+LocalFileProviderreproduce current behavior exactly - Path to 47.4M+ papers across 6 major publishers
- General solution handles local files (PDF, DOCX, PS) without arXiv dependency
Negative:
- Adds abstraction layer to pipeline (provider registry, source discovery)
- PostScript extraction provides incremental (not revolutionary) quality improvement for arXiv papers where LaTeX source is already available
- Publisher API integrations require API keys and may have usage limits
Neutral:
- Mapper and assembler remain unchanged — they already handle N sources via dictionaries
- Existing arXiv behavior is preserved as
ArxivProviderwith identical rate limits and fallback logic - PostScript is most valuable as a fallback when LaTeX source is unavailable, and for vector figure extraction where it's uniquely superior
Implementation Plan
Phase 1: Foundation (H.21.1-H.21.3) — 1-2 weeks
- Define
SourceProviderABC,AvailableSourcedataclass,ProviderRegistry - Implement
ArxivProvider(wraps current arXiv-specific logic) - Implement
LocalFileProvider(wraps current file detection) - Add
POSTSCRIPTandJATS_XMLtoSourceFormatenum - Refactor
run_udom_pipeline()to use provider registry - Verify zero regression on 218-paper batch
Phase 2: PostScript Extractor (H.21.4) — 1 week
- Create
udom/extractors/ps_extractor.py - Implement
pstotext -bboxesintegration (GhostScript already self-provisioned) - Implement EPS figure extraction from PS files
- Add PS to
SOURCE_PRIORITYandSOURCE_PREFERENCE - Test on 10 arXiv papers with PS available
- Batch test (218 papers) with PS as 4th source
Phase 3: PubMed XML Extractor (H.21.5) — 2 weeks
- Implement
PubMedProvider(PMC OAI-PMH API) - Create
udom/extractors/jats_extractor.py(JATS/NLM XML → UDOM components) - MathML → LaTeX conversion for equations
- Test on 50 PubMed Central papers
Phase 4: Publisher Adapters (H.21.6) — 2-4 weeks
IEEEProvider(IEEE Xplore API)SpringerProvider(Springer Nature API)ElsevierProvider(ScienceDirect API)- Publisher-specific extractors for proprietary HTML/XML formats
Alternatives Considered
-
Keep arXiv-specific pipeline, add PS as inline code: Simpler but blocks PubMed/IEEE/Springer expansion. Rejected — the abstraction cost is low and the payoff is 47M+ additional papers.
-
Use a third-party document hub (e.g., Semantic Scholar API) as single source: Provides metadata but not full-text structured extraction. Rejected — UDOM requires raw source access for component-level extraction.
-
PostScript only (no general abstraction): Would add PS but not generalize. Rejected — PS alone provides incremental value; the general architecture provides transformative value by unlocking all publishers.
Decision Date: 2026-02-09 Decision Makers: Hal Casteel (CTO), Claude (Opus 4.6) Track: H.20 → H.21 (General Multi-Source Provider)