ADR-164: Universal Document Object Model (UDOM)
Status
PROPOSED (2026-02-09)
Executive Summary
The Universal Document Object Model (UDOM) defines a canonical JSON intermediate representation for any structured document type — academic papers, invoices, purchase orders, contracts, proposals, technical reports, and more. Content is captured as typed, positioned components (text blocks, headings, tables, figures, equations, line items, clauses, signatures) independent of source format (PDF, HTML, LaTeX, DOCX) and output format (Markdown, HTML, PDF, DOCX, JSON).
A document mapping function cross-checks components extracted from multiple source formats, aligns them by position and content, selects the highest-fidelity version of each component, and assembles the final output. This guarantees 100% robust content copies where every element of the original document is represented in the canonical model.
Design principle: Content is sovereign. Format is rendering. The UDOM captures WHAT the content is (semantically), WHERE it appears (structurally), and FROM WHERE it was extracted (provenance). Any output format is a projection of this model.
1. Problem Statement
1.1 The Format Coupling Problem
Current document processing pipelines are format-to-format converters (PDF→Markdown, DOCX→HTML). They operate on text streams, not structured content. This creates three critical problems:
- Information loss: Tables flatten to text, images lose captions, equations break, hierarchies collapse
- No cross-validation: When a PDF table extraction fails, there's no mechanism to recover from an HTML or LaTeX source of the same document
- Format lock-in: Output is coupled to a single target format. Converting Markdown→DOCX requires re-parsing, losing structure again
1.2 The Multi-Source Opportunity
Many documents exist in multiple formats simultaneously:
| Document Type | Available Sources |
|---|---|
| Academic paper | PDF, arXiv LaTeX source, ar5iv HTML, PubMed XML, publisher HTML |
| Invoice | PDF, email HTML, EDI/XML, accounting system export |
| Contract | PDF, DOCX (redline), HTML (e-sign platform), plain text |
| Proposal | DOCX, PDF, Google Docs HTML, presentation PPTX |
| Technical report | PDF, LaTeX, HTML, Confluence wiki |
Each source captures different components with different fidelity. No single source is best at everything. PDF preserves layout but loses structure. HTML preserves structure but loses pagination. LaTeX preserves math but requires compilation. The UDOM enables a best-of-all-sources assembly.
1.3 The Agentic Processing Vision
CODITECT's goal is to automate agentic discovery using scientific papers and business documents. This requires documents that can be:
- Decomposed into atomic components for retrieval, embedding, and reasoning
- Recomposed into any format for different consumers (researchers, executives, systems)
- Cross-referenced with other documents (citation graphs, contract chains, invoice lineage)
- Validated for completeness (every component from source is accounted for)
- Versioned as enrichment passes add information
1.4 Prior Art
| System | Approach | Limitation |
|---|---|---|
| S2ORC (Allen AI) | JSON with body_text + cite_spans + ref_entries | Academic papers only, no layout info |
| Docling (IBM) | DoclingDocument Pydantic model with typed items | No multi-source alignment |
| PaperMage (Allen AI) | Character-span entities over symbols string | Academic papers only, complex API |
| GROBID | TEI XML with 68 labels | XML-only output, academic focus |
| Schema.org | ScholarlyArticle / Invoice / Order vocabulary | Metadata-only, no content structure |
Gap: No existing system combines: (1) universal document types, (2) multi-source extraction, (3) component-level alignment, (4) format-independent content model, and (5) lossless round-trip assembly.
2. Decision
Implement the Universal Document Object Model (UDOM) as a JSON schema and processing pipeline within CODITECT that:
- Defines a canonical component taxonomy covering all structured document types
- Extracts components from multiple source formats into a unified JSON representation
- Aligns and merges components across sources using a document mapping function
- Assembles final output in any target format from the canonical model
- Guarantees 100% content coverage with provenance tracking
2.1 Architecture Overview
Source Formats UDOM Pipeline Output Formats
────────────── ───────────── ──────────────
PDF (pymupdf4llm) ──┐ ┌────────────────┐ ┌──────────┐
│ │ │ │ │──► Markdown
HTML (ar5iv/web) ──┤ │ Extraction │ │ Assembly │──► HTML
├──►│ Layer │ │ Engine │──► PDF
LaTeX (arXiv src) ──┤ │ (per-source) │ │ │──► DOCX
│ │ │ │ │──► JSON-LD
DOCX (python-docx) ──┤ └───────┬────────┘ └────┬─────┘
│ │ │
XML/EDI ──┘ ▼ ▲
┌────────────────┐ │
│ │ │
│ Document Map │──────────┘
│ (Alignment + │
│ Selection) │
│ │
└───────┬────────┘
│
▼
┌────────────────┐
│ UDOM JSON │
│ (Canonical │
│ Model) │
└────────────────┘
2.2 Canonical Component Taxonomy
Every document is decomposed into typed components. Each component has content, position, provenance, and relationships.
Universal Components (all document types)
| Component Type | Description | Properties |
|---|---|---|
metadata | Document-level metadata | title, authors/parties, dates, identifiers, type |
heading | Section heading | level (1-6), text, numbering |
paragraph | Body text block | text, inline_refs (citations, links, cross-refs) |
table | Structured tabular data | headers, rows, caption, label, merged_cells |
figure | Image with context | image_data (path/base64), caption, label, alt_text |
list | Ordered or unordered list | items (recursive), list_type |
code_block | Code or pseudocode | content, language, caption |
blockquote | Quoted text | content, attribution |
page_break | Page boundary | page_number |
header_footer | Running headers/footers | content, position (header/footer), page_range |
Academic Components
| Component Type | Description | Properties |
|---|---|---|
equation | Mathematical expression | latex, display (inline/block), label, number |
algorithm | Formal algorithm | content, caption, label |
theorem | Theorem/lemma/proof/etc. | content, theorem_type, label, number |
citation | Inline citation reference | keys, style (author-year/numeric), expanded_text |
bibliography_entry | Reference list item | key, authors, title, venue, year, doi, urls |
abstract | Paper abstract | content |
footnote | Footnote/endnote | content, marker, position |
Business Document Components
| Component Type | Description | Properties |
|---|---|---|
line_item | Invoice/PO line | description, quantity, unit_price, amount, tax, sku |
clause | Contract clause | number, title, content, clause_type (obligation/right/condition) |
term | Defined term | term, definition |
signature_block | Signature area | signer_name, title, organization, date, signed (bool) |
field_value | Form key-value pair | key, value, field_type |
total | Summary amount | label, amount, currency |
address_block | Postal/billing address | lines, entity_name, role (sender/recipient/billing/shipping) |
date_field | Significant date | label, date, date_type (effective/expiry/due/payment) |
stamp | Official stamp/seal | type, text, image_data |
2.3 UDOM JSON Schema
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://coditect.ai/schemas/udom/v1.0.0",
"title": "Universal Document Object Model",
"version": "1.0.0",
"type": "object",
"required": ["schema_version", "document_id", "document_type", "metadata", "body"],
"properties": {
"schema_version": "1.0.0",
"document_id": "<unique identifier>",
"document_type": "<academic_paper|invoice|purchase_order|contract|proposal|report|other>",
"metadata": {
"title": "string",
"subtitle": "string | null",
"authors": [
{
"name": "string",
"role": "string | null",
"affiliation": "string | null",
"email": "string | null",
"identifiers": { "orcid": "...", "scopus": "..." }
}
],
"parties": [
{
"name": "string",
"role": "buyer | seller | licensor | licensee | party_a | party_b",
"address": "AddressBlock | null",
"identifiers": { "tax_id": "...", "duns": "..." }
}
],
"dates": {
"published": "ISO-8601 | null",
"effective": "ISO-8601 | null",
"expiry": "ISO-8601 | null",
"due": "ISO-8601 | null",
"created": "ISO-8601 | null"
},
"identifiers": {
"doi": "string | null",
"arxiv_id": "string | null",
"isbn": "string | null",
"invoice_number": "string | null",
"po_number": "string | null",
"contract_number": "string | null",
"internal_id": "string | null"
},
"keywords": ["string"],
"language": "en",
"page_count": "integer",
"abstract": "string | null"
},
"body": {
"type": "component_list",
"description": "Ordered list of document components in reading order",
"children": [
{
"id": "comp_001",
"type": "<component_type from taxonomy>",
"content": "<type-specific content>",
"position": {
"page": "integer | null",
"order": "integer (reading order index)",
"section_path": "string (e.g., '2.1.3')",
"bbox": {
"x0": "float", "y0": "float",
"x1": "float", "y1": "float",
"page": "integer"
}
},
"provenance": {
"source": "pdf | html | latex | docx | xml | manual",
"extractor": "pymupdf4llm | ar5iv | pandoc | python-docx | ...",
"confidence": "float (0-1)",
"alternatives": [
{
"source": "html",
"content": "<alternative extraction>",
"confidence": 0.85
}
]
},
"relationships": {
"parent": "comp_id | null",
"children": ["comp_id"],
"references": ["comp_id"],
"referenced_by": ["comp_id"]
}
}
]
},
"bibliography": {
"description": "Resolved reference entries (academic/legal citations)",
"entries": {
"BIBREF_001": {
"key": "string",
"authors": ["string"],
"title": "string",
"venue": "string | null",
"year": "integer | null",
"doi": "string | null",
"arxiv_id": "string | null",
"url": "string | null",
"raw_text": "string"
}
}
},
"assets": {
"description": "Binary assets (images, embedded files)",
"items": [
{
"id": "asset_001",
"filename": "figure1.png",
"mime_type": "image/png",
"storage": "file_path | base64 | url",
"data": "string (path, base64 data, or URL)",
"dimensions": { "width": "integer", "height": "integer" },
"referenced_by": ["comp_003"]
}
]
},
"extraction_report": {
"sources_used": [
{
"format": "pdf",
"path": "/path/to/source.pdf",
"extractor": "pymupdf4llm",
"components_extracted": 47,
"confidence_mean": 0.92
},
{
"format": "html",
"url": "https://ar5iv.labs.arxiv.org/html/2303.15256",
"extractor": "beautifulsoup4",
"components_extracted": 52,
"confidence_mean": 0.88
}
],
"alignment": {
"components_matched": 45,
"components_source_only": { "pdf": 2, "html": 7 },
"components_upgraded": 12,
"coverage_score": 1.0
},
"quality": {
"grade": "A",
"score": 96.5,
"dimension_scores": {},
"completeness": 1.0
},
"timestamp": "ISO-8601"
}
}
}
2.4 Document Mapping Function
The document mapper is the core algorithm that aligns components across sources:
Phase 1: Extract
For each source format:
Parse document → extract typed components with positions
Each component gets a provisional ID and source provenance
Phase 2: Align
For each component from primary source (usually PDF — preserves layout):
Search other sources for matching component using:
1. Position match: same page + similar bbox (±10% tolerance)
2. Content match: normalized text similarity (≥0.85 Jaccard on tokens)
3. Type match: same component type
4. Structural match: same section path + relative position
Link matched components across sources
Phase 3: Select
For each aligned component group:
Score each source version on type-specific quality metrics:
- Tables: column count, cell completeness, separator integrity
- Figures: resolution, caption presence, alt text
- Equations: LaTeX validity, delimiter balance
- Headings: hierarchy consistency, numbering preservation
- Text: completeness (char count), formatting preservation
- Citations: expansion completeness, link resolution
Select highest-scoring version as canonical
Store alternatives in provenance.alternatives
Phase 4: Supplement
For components found in only one source:
If confidence ≥ 0.7: include in model
If confidence < 0.7: flag for review
For components found in secondary sources but NOT primary:
Include with source attribution (e.g., HTML-only appendix)
Phase 5: Validate
Coverage check: every component from every source is either:
- Included as canonical, OR
- Stored as alternative, OR
- Explicitly flagged as duplicate/noise
Structural check: heading hierarchy valid, references resolve,
tables have consistent column counts
Completeness check: page coverage ≥ 95% of source page count
2.5 Assembly Engine
The assembly engine renders UDOM JSON into any target format:
| Target | Assembler | Strategy |
|---|---|---|
| Markdown | assemble_markdown(udom) | Component→markdown fragment, concatenate in reading order |
| HTML | assemble_html(udom) | Component→HTML element, semantic tags (article, section, figure, table) |
| DOCX | assemble_docx(udom) | Component→python-docx elements, style mapping |
assemble_pdf(udom) | Component→reportlab/weasyprint elements | |
| JSON-LD | assemble_jsonld(udom) | Component→Schema.org types (ScholarlyArticle, Invoice, etc.) |
| Plain text | assemble_text(udom) | Component→text content only, no formatting |
Each assembler is a pure function: UDOM → Format. No side effects. Deterministic output.
2.6 Document Type Profiles
Each document type has a profile that defines expected components and validation rules:
{
"academic_paper": {
"required_components": ["heading", "paragraph", "bibliography_entry"],
"expected_components": ["abstract", "equation", "figure", "table", "citation"],
"validation": {
"must_have_abstract": true,
"must_have_bibliography": true,
"heading_hierarchy_required": true
}
},
"invoice": {
"required_components": ["metadata", "line_item", "total", "address_block"],
"expected_components": ["date_field", "field_value", "stamp"],
"validation": {
"line_items_sum_to_total": true,
"must_have_invoice_number": true,
"must_have_dates": ["due"]
}
},
"contract": {
"required_components": ["heading", "clause", "signature_block"],
"expected_components": ["term", "date_field", "address_block", "paragraph"],
"validation": {
"must_have_effective_date": true,
"must_have_parties": true,
"clauses_numbered": true
}
},
"purchase_order": {
"required_components": ["line_item", "total", "address_block"],
"expected_components": ["field_value", "date_field", "signature_block"],
"validation": {
"must_have_po_number": true,
"must_have_shipping_address": true,
"line_items_sum_to_total": true
}
},
"proposal": {
"required_components": ["heading", "paragraph"],
"expected_components": ["table", "figure", "line_item", "total", "signature_block"],
"validation": {
"heading_hierarchy_required": true
}
}
}
3. Key Architecture Decisions
| # | Decision | Rationale | Alternatives Rejected |
|---|---|---|---|
| 1 | JSON as canonical format | Human-readable, universally parseable, schema-validatable, Git-diffable, embeddable in databases | XML/TEI (verbose, harder tooling), Protobuf (binary, not human-readable), custom binary (not inspectable) |
| 2 | Flat component list with relationships | Simple to iterate, filter, transform; relationships via IDs not nesting | Deep nesting (hard to flatten for search), graph database (infrastructure overhead), pure tree (can't represent cross-references) |
| 3 | Provenance on every component | Enables trust scoring, debugging, audit trail; critical for 100% coverage guarantee | Document-level provenance only (can't debug per-component), no provenance (no trust signal) |
| 4 | Multi-source extraction, not single-source conversion | Every source has strengths; combining gives best result for every component | Single best source (always has blind spots), manual enrichment (doesn't scale) |
| 5 | Document type profiles | Validation rules vary by document type; profiles make the model extensible without code changes | Hardcoded validation (rigid), no validation (no quality guarantee), universal rules (too loose or too strict) |
| 6 | Components carry alternatives | When alignment finds multiple versions, all are preserved; enables re-selection without re-extraction | Best-only (loses information), separate alternatives file (complex), re-extract on demand (expensive) |
| 7 | Position includes bbox + reading order + section path | Three independent positioning systems allow alignment even when one is missing; bbox for PDF, section path for HTML, reading order for text | Bbox-only (HTML has no bbox), reading-order-only (ambiguous for multi-column), section-path-only (flat documents have no sections) |
| 8 | Assets stored externally with references | Images can be large; storing as file paths or URLs keeps the JSON manageable; base64 option for self-contained use | Inline base64 only (bloats JSON), file paths only (not portable), URLs only (requires hosting) |
4. Implementation Plan
Phase 1: Schema & Core (Week 1)
| Task | Description |
|---|---|
| Define JSON Schema | Formal JSON Schema 2020-12 in config/schemas/udom-v1.schema.json |
| Component taxonomy | Python enums/dataclasses for all component types |
| UDOM dataclass | UDOMDocument with serialization/deserialization |
| Document type profiles | Profile definitions for 5 document types |
Phase 2: Extractors (Week 2-3)
| Extractor | Source | Components Extracted |
|---|---|---|
extract_from_pdf(pdf_path) | PDF via pymupdf4llm + pdfplumber + pdfminer | All universal + academic components |
extract_from_html(html/url) | HTML via BeautifulSoup (ar5iv, web) | Headings, tables, figures, citations, math |
extract_from_latex(tex_path) | LaTeX via pandoc + pylatexenc | Equations, structure, bibliography, theorems |
extract_from_docx(docx_path) | DOCX via python-docx | All universal + business components |
Each extractor returns list[Component] — raw components with provenance.
Phase 3: Document Mapper (Week 3-4)
| Module | Function |
|---|---|
align_components() | Match components across sources by position + content + type |
score_component() | Type-specific quality scoring for selection |
select_canonical() | Pick best version of each component, store alternatives |
validate_coverage() | Ensure 100% coverage, flag gaps |
build_udom() | Orchestrate: extract → align → select → validate → UDOM JSON |
Phase 4: Assemblers (Week 4-5)
| Assembler | Output |
|---|---|
assemble_markdown() | GitHub-flavored Markdown with proper tables, images, math |
assemble_html() | Semantic HTML5 with Schema.org microdata |
assemble_text() | Plain text with structural indicators |
Additional assemblers (DOCX, PDF) in Phase 5.
Phase 5: Integration (Week 5-6)
- Replace pipeline.py text-level enrichment with UDOM-based extraction → mapping → assembly
- Batch processing:
udom_batch.pyfor 218 academic papers - QA validation: compare UDOM-assembled markdown against v1.2 output
- Pipeline report enhancement: include UDOM extraction_report
5. File Structure
skills/pdf-to-markdown/
├── src/
│ ├── convert.py # Existing v2.5 converter (unchanged)
│ ├── pipeline.py # Enhanced to use UDOM
│ └── udom/
│ ├── __init__.py
│ ├── schema.py # UDOM dataclasses + JSON schema
│ ├── taxonomy.py # Component type enums + profiles
│ ├── extractors/
│ │ ├── __init__.py
│ │ ├── pdf_extractor.py # PDF → components
│ │ ├── html_extractor.py # HTML → components (ar5iv, web)
│ │ ├── latex_extractor.py # LaTeX → components
│ │ └── docx_extractor.py # DOCX → components
│ ├── mapper.py # Alignment + selection + validation
│ └── assemblers/
│ ├── __init__.py
│ ├── markdown.py # UDOM → Markdown
│ ├── html.py # UDOM → HTML
│ └── text.py # UDOM → Plain text
config/schemas/
└── udom-v1.schema.json # Formal JSON Schema
6. Quality Guarantees
6.1 Coverage Guarantee
Every component from every source must be accounted for:
Coverage = (canonical + alternatives + flagged_noise) / total_extracted
Target: Coverage = 1.0 (100%)
If coverage < 1.0, the pipeline fails rather than silently dropping content.
6.2 Round-Trip Fidelity
For any document D and any supported format F:
D → UDOM(D) → assemble_F(UDOM(D)) ≈ D_in_format_F
"Approximately equals" means: all textual content preserved, all structural relationships preserved, visual formatting may differ (font, spacing, pagination).
6.3 Idempotent Extraction
Re-extracting from the same source produces identical UDOM JSON (deterministic).
6.4 Monotonic Enrichment
Adding a new source can only improve quality, never degrade it:
quality(UDOM(source_A)) ≤ quality(UDOM(source_A, source_B))
This is guaranteed by the selection algorithm: alternatives are stored but never replace a higher-scoring canonical version.
7. Migration Path
From Pipeline v1.2
- Pipeline v1.2 continues to work unchanged (
--skip-enrichment) - Pipeline v1.3 uses UDOM internally but outputs identical Markdown
--udom-jsonflag outputs the canonical UDOM JSON alongside Markdown--udom-onlyflag outputs only UDOM JSON (no Markdown assembly)- Batch reprocessing: all 218 papers through UDOM pipeline, compare grades
Incremental Adoption
- Phase 1: Academic papers only (PDF + HTML + LaTeX extractors)
- Phase 2: Add DOCX extractor → contracts, proposals
- Phase 3: Add business document profiles → invoices, POs
- Phase 4: Add additional assemblers → DOCX, PDF output
8. Consequences
Positive
- Format independence: Content captured once, rendered anywhere
- 100% coverage: Provenance tracking ensures nothing is lost
- Multi-source quality: Best component from best source, every time
- Extensible: New document types via profiles, new formats via extractors/assemblers
- Agentic-ready: JSON components are individually embeddable, retrievable, referenceable
- Auditable: Provenance chain from source → extraction → alignment → selection → assembly
- Decomposable: Any component can be extracted, transformed, or replaced independently
Negative
- Complexity increase: Three-layer architecture (extract → map → assemble) vs single-pass conversion
- Processing time: Multi-source extraction is slower than single-source (mitigated by caching)
- Storage: UDOM JSON + alternatives + assets is larger than Markdown output alone
- Schema evolution: Adding new component types requires schema versioning
Risks
| Risk | Mitigation |
|---|---|
| Schema becomes too complex | Start minimal, extend via document type profiles |
| Alignment algorithm produces false matches | Conservative matching (require ≥2 of 4 match criteria) |
| Extraction quality varies by source | Confidence scores + fallback to primary source |
| Performance on large documents (100+ pages) | Streaming extraction, page-level parallelism |
9. References
- S2ORC (Allen AI) — Structured JSON for 81M papers:
body_textwithcite_spans/ref_spans - Docling (IBM Research) — DoclingDocument schema with typed content items (TextItem, TableItem, PictureItem, FormulaItem)
- PaperMage (Allen AI) — Character-span entity layers over unified symbols string
- GROBID — TEI XML with 68 fine-grained extraction labels
- OmniDocBench (CVPR 2025) — Multi-format verification benchmark
- Uni-Parser — Cross-page merging with JSON/Markdown output
- Schema.org — ScholarlyArticle, Invoice, Order vocabularies
ADR-164 | Track: F (Documentation) | Task: F.0 Author: Claude (Opus 4.6) Created: 2026-02-09