Skip to main content

ADR-164: Universal Document Object Model (UDOM)

Status

PROPOSED (2026-02-09)

Executive Summary

The Universal Document Object Model (UDOM) defines a canonical JSON intermediate representation for any structured document type — academic papers, invoices, purchase orders, contracts, proposals, technical reports, and more. Content is captured as typed, positioned components (text blocks, headings, tables, figures, equations, line items, clauses, signatures) independent of source format (PDF, HTML, LaTeX, DOCX) and output format (Markdown, HTML, PDF, DOCX, JSON).

A document mapping function cross-checks components extracted from multiple source formats, aligns them by position and content, selects the highest-fidelity version of each component, and assembles the final output. This guarantees 100% robust content copies where every element of the original document is represented in the canonical model.

Design principle: Content is sovereign. Format is rendering. The UDOM captures WHAT the content is (semantically), WHERE it appears (structurally), and FROM WHERE it was extracted (provenance). Any output format is a projection of this model.


1. Problem Statement

1.1 The Format Coupling Problem

Current document processing pipelines are format-to-format converters (PDF→Markdown, DOCX→HTML). They operate on text streams, not structured content. This creates three critical problems:

  1. Information loss: Tables flatten to text, images lose captions, equations break, hierarchies collapse
  2. No cross-validation: When a PDF table extraction fails, there's no mechanism to recover from an HTML or LaTeX source of the same document
  3. Format lock-in: Output is coupled to a single target format. Converting Markdown→DOCX requires re-parsing, losing structure again

1.2 The Multi-Source Opportunity

Many documents exist in multiple formats simultaneously:

Document TypeAvailable Sources
Academic paperPDF, arXiv LaTeX source, ar5iv HTML, PubMed XML, publisher HTML
InvoicePDF, email HTML, EDI/XML, accounting system export
ContractPDF, DOCX (redline), HTML (e-sign platform), plain text
ProposalDOCX, PDF, Google Docs HTML, presentation PPTX
Technical reportPDF, LaTeX, HTML, Confluence wiki

Each source captures different components with different fidelity. No single source is best at everything. PDF preserves layout but loses structure. HTML preserves structure but loses pagination. LaTeX preserves math but requires compilation. The UDOM enables a best-of-all-sources assembly.

1.3 The Agentic Processing Vision

CODITECT's goal is to automate agentic discovery using scientific papers and business documents. This requires documents that can be:

  • Decomposed into atomic components for retrieval, embedding, and reasoning
  • Recomposed into any format for different consumers (researchers, executives, systems)
  • Cross-referenced with other documents (citation graphs, contract chains, invoice lineage)
  • Validated for completeness (every component from source is accounted for)
  • Versioned as enrichment passes add information

1.4 Prior Art

SystemApproachLimitation
S2ORC (Allen AI)JSON with body_text + cite_spans + ref_entriesAcademic papers only, no layout info
Docling (IBM)DoclingDocument Pydantic model with typed itemsNo multi-source alignment
PaperMage (Allen AI)Character-span entities over symbols stringAcademic papers only, complex API
GROBIDTEI XML with 68 labelsXML-only output, academic focus
Schema.orgScholarlyArticle / Invoice / Order vocabularyMetadata-only, no content structure

Gap: No existing system combines: (1) universal document types, (2) multi-source extraction, (3) component-level alignment, (4) format-independent content model, and (5) lossless round-trip assembly.


2. Decision

Implement the Universal Document Object Model (UDOM) as a JSON schema and processing pipeline within CODITECT that:

  1. Defines a canonical component taxonomy covering all structured document types
  2. Extracts components from multiple source formats into a unified JSON representation
  3. Aligns and merges components across sources using a document mapping function
  4. Assembles final output in any target format from the canonical model
  5. Guarantees 100% content coverage with provenance tracking

2.1 Architecture Overview

Source Formats                 UDOM Pipeline                    Output Formats
────────────── ───────────── ──────────────

PDF (pymupdf4llm) ──┐ ┌────────────────┐ ┌──────────┐
│ │ │ │ │──► Markdown
HTML (ar5iv/web) ──┤ │ Extraction │ │ Assembly │──► HTML
├──►│ Layer │ │ Engine │──► PDF
LaTeX (arXiv src) ──┤ │ (per-source) │ │ │──► DOCX
│ │ │ │ │──► JSON-LD
DOCX (python-docx) ──┤ └───────┬────────┘ └────┬─────┘
│ │ │
XML/EDI ──┘ ▼ ▲
┌────────────────┐ │
│ │ │
│ Document Map │──────────┘
│ (Alignment + │
│ Selection) │
│ │
└───────┬────────┘


┌────────────────┐
│ UDOM JSON │
│ (Canonical │
│ Model) │
└────────────────┘

2.2 Canonical Component Taxonomy

Every document is decomposed into typed components. Each component has content, position, provenance, and relationships.

Universal Components (all document types)

Component TypeDescriptionProperties
metadataDocument-level metadatatitle, authors/parties, dates, identifiers, type
headingSection headinglevel (1-6), text, numbering
paragraphBody text blocktext, inline_refs (citations, links, cross-refs)
tableStructured tabular dataheaders, rows, caption, label, merged_cells
figureImage with contextimage_data (path/base64), caption, label, alt_text
listOrdered or unordered listitems (recursive), list_type
code_blockCode or pseudocodecontent, language, caption
blockquoteQuoted textcontent, attribution
page_breakPage boundarypage_number
header_footerRunning headers/footerscontent, position (header/footer), page_range

Academic Components

Component TypeDescriptionProperties
equationMathematical expressionlatex, display (inline/block), label, number
algorithmFormal algorithmcontent, caption, label
theoremTheorem/lemma/proof/etc.content, theorem_type, label, number
citationInline citation referencekeys, style (author-year/numeric), expanded_text
bibliography_entryReference list itemkey, authors, title, venue, year, doi, urls
abstractPaper abstractcontent
footnoteFootnote/endnotecontent, marker, position

Business Document Components

Component TypeDescriptionProperties
line_itemInvoice/PO linedescription, quantity, unit_price, amount, tax, sku
clauseContract clausenumber, title, content, clause_type (obligation/right/condition)
termDefined termterm, definition
signature_blockSignature areasigner_name, title, organization, date, signed (bool)
field_valueForm key-value pairkey, value, field_type
totalSummary amountlabel, amount, currency
address_blockPostal/billing addresslines, entity_name, role (sender/recipient/billing/shipping)
date_fieldSignificant datelabel, date, date_type (effective/expiry/due/payment)
stampOfficial stamp/sealtype, text, image_data

2.3 UDOM JSON Schema

{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://coditect.ai/schemas/udom/v1.0.0",
"title": "Universal Document Object Model",
"version": "1.0.0",

"type": "object",
"required": ["schema_version", "document_id", "document_type", "metadata", "body"],

"properties": {
"schema_version": "1.0.0",
"document_id": "<unique identifier>",
"document_type": "<academic_paper|invoice|purchase_order|contract|proposal|report|other>",

"metadata": {
"title": "string",
"subtitle": "string | null",
"authors": [
{
"name": "string",
"role": "string | null",
"affiliation": "string | null",
"email": "string | null",
"identifiers": { "orcid": "...", "scopus": "..." }
}
],
"parties": [
{
"name": "string",
"role": "buyer | seller | licensor | licensee | party_a | party_b",
"address": "AddressBlock | null",
"identifiers": { "tax_id": "...", "duns": "..." }
}
],
"dates": {
"published": "ISO-8601 | null",
"effective": "ISO-8601 | null",
"expiry": "ISO-8601 | null",
"due": "ISO-8601 | null",
"created": "ISO-8601 | null"
},
"identifiers": {
"doi": "string | null",
"arxiv_id": "string | null",
"isbn": "string | null",
"invoice_number": "string | null",
"po_number": "string | null",
"contract_number": "string | null",
"internal_id": "string | null"
},
"keywords": ["string"],
"language": "en",
"page_count": "integer",
"abstract": "string | null"
},

"body": {
"type": "component_list",
"description": "Ordered list of document components in reading order",
"children": [
{
"id": "comp_001",
"type": "<component_type from taxonomy>",
"content": "<type-specific content>",
"position": {
"page": "integer | null",
"order": "integer (reading order index)",
"section_path": "string (e.g., '2.1.3')",
"bbox": {
"x0": "float", "y0": "float",
"x1": "float", "y1": "float",
"page": "integer"
}
},
"provenance": {
"source": "pdf | html | latex | docx | xml | manual",
"extractor": "pymupdf4llm | ar5iv | pandoc | python-docx | ...",
"confidence": "float (0-1)",
"alternatives": [
{
"source": "html",
"content": "<alternative extraction>",
"confidence": 0.85
}
]
},
"relationships": {
"parent": "comp_id | null",
"children": ["comp_id"],
"references": ["comp_id"],
"referenced_by": ["comp_id"]
}
}
]
},

"bibliography": {
"description": "Resolved reference entries (academic/legal citations)",
"entries": {
"BIBREF_001": {
"key": "string",
"authors": ["string"],
"title": "string",
"venue": "string | null",
"year": "integer | null",
"doi": "string | null",
"arxiv_id": "string | null",
"url": "string | null",
"raw_text": "string"
}
}
},

"assets": {
"description": "Binary assets (images, embedded files)",
"items": [
{
"id": "asset_001",
"filename": "figure1.png",
"mime_type": "image/png",
"storage": "file_path | base64 | url",
"data": "string (path, base64 data, or URL)",
"dimensions": { "width": "integer", "height": "integer" },
"referenced_by": ["comp_003"]
}
]
},

"extraction_report": {
"sources_used": [
{
"format": "pdf",
"path": "/path/to/source.pdf",
"extractor": "pymupdf4llm",
"components_extracted": 47,
"confidence_mean": 0.92
},
{
"format": "html",
"url": "https://ar5iv.labs.arxiv.org/html/2303.15256",
"extractor": "beautifulsoup4",
"components_extracted": 52,
"confidence_mean": 0.88
}
],
"alignment": {
"components_matched": 45,
"components_source_only": { "pdf": 2, "html": 7 },
"components_upgraded": 12,
"coverage_score": 1.0
},
"quality": {
"grade": "A",
"score": 96.5,
"dimension_scores": {},
"completeness": 1.0
},
"timestamp": "ISO-8601"
}
}
}

2.4 Document Mapping Function

The document mapper is the core algorithm that aligns components across sources:

Phase 1: Extract
For each source format:
Parse document → extract typed components with positions
Each component gets a provisional ID and source provenance

Phase 2: Align
For each component from primary source (usually PDF — preserves layout):
Search other sources for matching component using:
1. Position match: same page + similar bbox (±10% tolerance)
2. Content match: normalized text similarity (≥0.85 Jaccard on tokens)
3. Type match: same component type
4. Structural match: same section path + relative position
Link matched components across sources

Phase 3: Select
For each aligned component group:
Score each source version on type-specific quality metrics:
- Tables: column count, cell completeness, separator integrity
- Figures: resolution, caption presence, alt text
- Equations: LaTeX validity, delimiter balance
- Headings: hierarchy consistency, numbering preservation
- Text: completeness (char count), formatting preservation
- Citations: expansion completeness, link resolution
Select highest-scoring version as canonical
Store alternatives in provenance.alternatives

Phase 4: Supplement
For components found in only one source:
If confidence ≥ 0.7: include in model
If confidence < 0.7: flag for review
For components found in secondary sources but NOT primary:
Include with source attribution (e.g., HTML-only appendix)

Phase 5: Validate
Coverage check: every component from every source is either:
- Included as canonical, OR
- Stored as alternative, OR
- Explicitly flagged as duplicate/noise
Structural check: heading hierarchy valid, references resolve,
tables have consistent column counts
Completeness check: page coverage ≥ 95% of source page count

2.5 Assembly Engine

The assembly engine renders UDOM JSON into any target format:

TargetAssemblerStrategy
Markdownassemble_markdown(udom)Component→markdown fragment, concatenate in reading order
HTMLassemble_html(udom)Component→HTML element, semantic tags (article, section, figure, table)
DOCXassemble_docx(udom)Component→python-docx elements, style mapping
PDFassemble_pdf(udom)Component→reportlab/weasyprint elements
JSON-LDassemble_jsonld(udom)Component→Schema.org types (ScholarlyArticle, Invoice, etc.)
Plain textassemble_text(udom)Component→text content only, no formatting

Each assembler is a pure function: UDOM → Format. No side effects. Deterministic output.

2.6 Document Type Profiles

Each document type has a profile that defines expected components and validation rules:

{
"academic_paper": {
"required_components": ["heading", "paragraph", "bibliography_entry"],
"expected_components": ["abstract", "equation", "figure", "table", "citation"],
"validation": {
"must_have_abstract": true,
"must_have_bibliography": true,
"heading_hierarchy_required": true
}
},
"invoice": {
"required_components": ["metadata", "line_item", "total", "address_block"],
"expected_components": ["date_field", "field_value", "stamp"],
"validation": {
"line_items_sum_to_total": true,
"must_have_invoice_number": true,
"must_have_dates": ["due"]
}
},
"contract": {
"required_components": ["heading", "clause", "signature_block"],
"expected_components": ["term", "date_field", "address_block", "paragraph"],
"validation": {
"must_have_effective_date": true,
"must_have_parties": true,
"clauses_numbered": true
}
},
"purchase_order": {
"required_components": ["line_item", "total", "address_block"],
"expected_components": ["field_value", "date_field", "signature_block"],
"validation": {
"must_have_po_number": true,
"must_have_shipping_address": true,
"line_items_sum_to_total": true
}
},
"proposal": {
"required_components": ["heading", "paragraph"],
"expected_components": ["table", "figure", "line_item", "total", "signature_block"],
"validation": {
"heading_hierarchy_required": true
}
}
}

3. Key Architecture Decisions

#DecisionRationaleAlternatives Rejected
1JSON as canonical formatHuman-readable, universally parseable, schema-validatable, Git-diffable, embeddable in databasesXML/TEI (verbose, harder tooling), Protobuf (binary, not human-readable), custom binary (not inspectable)
2Flat component list with relationshipsSimple to iterate, filter, transform; relationships via IDs not nestingDeep nesting (hard to flatten for search), graph database (infrastructure overhead), pure tree (can't represent cross-references)
3Provenance on every componentEnables trust scoring, debugging, audit trail; critical for 100% coverage guaranteeDocument-level provenance only (can't debug per-component), no provenance (no trust signal)
4Multi-source extraction, not single-source conversionEvery source has strengths; combining gives best result for every componentSingle best source (always has blind spots), manual enrichment (doesn't scale)
5Document type profilesValidation rules vary by document type; profiles make the model extensible without code changesHardcoded validation (rigid), no validation (no quality guarantee), universal rules (too loose or too strict)
6Components carry alternativesWhen alignment finds multiple versions, all are preserved; enables re-selection without re-extractionBest-only (loses information), separate alternatives file (complex), re-extract on demand (expensive)
7Position includes bbox + reading order + section pathThree independent positioning systems allow alignment even when one is missing; bbox for PDF, section path for HTML, reading order for textBbox-only (HTML has no bbox), reading-order-only (ambiguous for multi-column), section-path-only (flat documents have no sections)
8Assets stored externally with referencesImages can be large; storing as file paths or URLs keeps the JSON manageable; base64 option for self-contained useInline base64 only (bloats JSON), file paths only (not portable), URLs only (requires hosting)

4. Implementation Plan

Phase 1: Schema & Core (Week 1)

TaskDescription
Define JSON SchemaFormal JSON Schema 2020-12 in config/schemas/udom-v1.schema.json
Component taxonomyPython enums/dataclasses for all component types
UDOM dataclassUDOMDocument with serialization/deserialization
Document type profilesProfile definitions for 5 document types

Phase 2: Extractors (Week 2-3)

ExtractorSourceComponents Extracted
extract_from_pdf(pdf_path)PDF via pymupdf4llm + pdfplumber + pdfminerAll universal + academic components
extract_from_html(html/url)HTML via BeautifulSoup (ar5iv, web)Headings, tables, figures, citations, math
extract_from_latex(tex_path)LaTeX via pandoc + pylatexencEquations, structure, bibliography, theorems
extract_from_docx(docx_path)DOCX via python-docxAll universal + business components

Each extractor returns list[Component] — raw components with provenance.

Phase 3: Document Mapper (Week 3-4)

ModuleFunction
align_components()Match components across sources by position + content + type
score_component()Type-specific quality scoring for selection
select_canonical()Pick best version of each component, store alternatives
validate_coverage()Ensure 100% coverage, flag gaps
build_udom()Orchestrate: extract → align → select → validate → UDOM JSON

Phase 4: Assemblers (Week 4-5)

AssemblerOutput
assemble_markdown()GitHub-flavored Markdown with proper tables, images, math
assemble_html()Semantic HTML5 with Schema.org microdata
assemble_text()Plain text with structural indicators

Additional assemblers (DOCX, PDF) in Phase 5.

Phase 5: Integration (Week 5-6)

  • Replace pipeline.py text-level enrichment with UDOM-based extraction → mapping → assembly
  • Batch processing: udom_batch.py for 218 academic papers
  • QA validation: compare UDOM-assembled markdown against v1.2 output
  • Pipeline report enhancement: include UDOM extraction_report

5. File Structure

skills/pdf-to-markdown/
├── src/
│ ├── convert.py # Existing v2.5 converter (unchanged)
│ ├── pipeline.py # Enhanced to use UDOM
│ └── udom/
│ ├── __init__.py
│ ├── schema.py # UDOM dataclasses + JSON schema
│ ├── taxonomy.py # Component type enums + profiles
│ ├── extractors/
│ │ ├── __init__.py
│ │ ├── pdf_extractor.py # PDF → components
│ │ ├── html_extractor.py # HTML → components (ar5iv, web)
│ │ ├── latex_extractor.py # LaTeX → components
│ │ └── docx_extractor.py # DOCX → components
│ ├── mapper.py # Alignment + selection + validation
│ └── assemblers/
│ ├── __init__.py
│ ├── markdown.py # UDOM → Markdown
│ ├── html.py # UDOM → HTML
│ └── text.py # UDOM → Plain text

config/schemas/
└── udom-v1.schema.json # Formal JSON Schema

6. Quality Guarantees

6.1 Coverage Guarantee

Every component from every source must be accounted for:

Coverage = (canonical + alternatives + flagged_noise) / total_extracted
Target: Coverage = 1.0 (100%)

If coverage < 1.0, the pipeline fails rather than silently dropping content.

6.2 Round-Trip Fidelity

For any document D and any supported format F:

D → UDOM(D) → assemble_F(UDOM(D)) ≈ D_in_format_F

"Approximately equals" means: all textual content preserved, all structural relationships preserved, visual formatting may differ (font, spacing, pagination).

6.3 Idempotent Extraction

Re-extracting from the same source produces identical UDOM JSON (deterministic).

6.4 Monotonic Enrichment

Adding a new source can only improve quality, never degrade it:

quality(UDOM(source_A)) ≤ quality(UDOM(source_A, source_B))

This is guaranteed by the selection algorithm: alternatives are stored but never replace a higher-scoring canonical version.


7. Migration Path

From Pipeline v1.2

  1. Pipeline v1.2 continues to work unchanged (--skip-enrichment)
  2. Pipeline v1.3 uses UDOM internally but outputs identical Markdown
  3. --udom-json flag outputs the canonical UDOM JSON alongside Markdown
  4. --udom-only flag outputs only UDOM JSON (no Markdown assembly)
  5. Batch reprocessing: all 218 papers through UDOM pipeline, compare grades

Incremental Adoption

  • Phase 1: Academic papers only (PDF + HTML + LaTeX extractors)
  • Phase 2: Add DOCX extractor → contracts, proposals
  • Phase 3: Add business document profiles → invoices, POs
  • Phase 4: Add additional assemblers → DOCX, PDF output

8. Consequences

Positive

  • Format independence: Content captured once, rendered anywhere
  • 100% coverage: Provenance tracking ensures nothing is lost
  • Multi-source quality: Best component from best source, every time
  • Extensible: New document types via profiles, new formats via extractors/assemblers
  • Agentic-ready: JSON components are individually embeddable, retrievable, referenceable
  • Auditable: Provenance chain from source → extraction → alignment → selection → assembly
  • Decomposable: Any component can be extracted, transformed, or replaced independently

Negative

  • Complexity increase: Three-layer architecture (extract → map → assemble) vs single-pass conversion
  • Processing time: Multi-source extraction is slower than single-source (mitigated by caching)
  • Storage: UDOM JSON + alternatives + assets is larger than Markdown output alone
  • Schema evolution: Adding new component types requires schema versioning

Risks

RiskMitigation
Schema becomes too complexStart minimal, extend via document type profiles
Alignment algorithm produces false matchesConservative matching (require ≥2 of 4 match criteria)
Extraction quality varies by sourceConfidence scores + fallback to primary source
Performance on large documents (100+ pages)Streaming extraction, page-level parallelism

9. References

  • S2ORC (Allen AI) — Structured JSON for 81M papers: body_text with cite_spans/ref_spans
  • Docling (IBM Research) — DoclingDocument schema with typed content items (TextItem, TableItem, PictureItem, FormulaItem)
  • PaperMage (Allen AI) — Character-span entity layers over unified symbols string
  • GROBID — TEI XML with 68 fine-grained extraction labels
  • OmniDocBench (CVPR 2025) — Multi-format verification benchmark
  • Uni-Parser — Cross-page merging with JSON/Markdown output
  • Schema.org — ScholarlyArticle, Invoice, Order vocabularies

ADR-164 | Track: F (Documentation) | Task: F.0 Author: Claude (Opus 4.6) Created: 2026-02-09