ADR-164: Universal Document Object Model (UDOM)

Status

PROPOSED (2026-02-09)

Executive Summary

The Universal Document Object Model (UDOM) defines a canonical JSON intermediate representation for any structured document type — academic papers, invoices, purchase orders, contracts, proposals, technical reports, and more. Content is captured as typed, positioned components (text blocks, headings, tables, figures, equations, line items, clauses, signatures) independent of source format (PDF, HTML, LaTeX, DOCX) and output format (Markdown, HTML, PDF, DOCX, JSON).

A document mapping function cross-checks components extracted from multiple source formats, aligns them by position and content, selects the highest-fidelity version of each component, and assembles the final output. This guarantees 100% robust content copies where every element of the original document is represented in the canonical model.

Design principle: Content is sovereign. Format is rendering. The UDOM captures WHAT the content is (semantically), WHERE it appears (structurally), and FROM WHERE it was extracted (provenance). Any output format is a projection of this model.

1. Problem Statement

1.1 The Format Coupling Problem

Current document processing pipelines are format-to-format converters (PDF→Markdown, DOCX→HTML). They operate on text streams, not structured content. This creates three critical problems:

Information loss: Tables flatten to text, images lose captions, equations break, hierarchies collapse
No cross-validation: When a PDF table extraction fails, there's no mechanism to recover from an HTML or LaTeX source of the same document
Format lock-in: Output is coupled to a single target format. Converting Markdown→DOCX requires re-parsing, losing structure again

1.2 The Multi-Source Opportunity

Many documents exist in multiple formats simultaneously:

Document Type	Available Sources
Academic paper	PDF, arXiv LaTeX source, ar5iv HTML, PubMed XML, publisher HTML
Invoice	PDF, email HTML, EDI/XML, accounting system export
Contract	PDF, DOCX (redline), HTML (e-sign platform), plain text
Proposal	DOCX, PDF, Google Docs HTML, presentation PPTX
Technical report	PDF, LaTeX, HTML, Confluence wiki

Each source captures different components with different fidelity. No single source is best at everything. PDF preserves layout but loses structure. HTML preserves structure but loses pagination. LaTeX preserves math but requires compilation. The UDOM enables a best-of-all-sources assembly.

1.3 The Agentic Processing Vision

CODITECT's goal is to automate agentic discovery using scientific papers and business documents. This requires documents that can be:

Decomposed into atomic components for retrieval, embedding, and reasoning
Recomposed into any format for different consumers (researchers, executives, systems)
Cross-referenced with other documents (citation graphs, contract chains, invoice lineage)
Validated for completeness (every component from source is accounted for)
Versioned as enrichment passes add information

1.4 Prior Art

System	Approach	Limitation
S2ORC (Allen AI)	JSON with body_text + cite_spans + ref_entries	Academic papers only, no layout info
Docling (IBM)	DoclingDocument Pydantic model with typed items	No multi-source alignment
PaperMage (Allen AI)	Character-span entities over symbols string	Academic papers only, complex API
GROBID	TEI XML with 68 labels	XML-only output, academic focus
Schema.org	ScholarlyArticle / Invoice / Order vocabulary	Metadata-only, no content structure

Gap: No existing system combines: (1) universal document types, (2) multi-source extraction, (3) component-level alignment, (4) format-independent content model, and (5) lossless round-trip assembly.

2. Decision

Implement the Universal Document Object Model (UDOM) as a JSON schema and processing pipeline within CODITECT that:

Defines a canonical component taxonomy covering all structured document types
Extracts components from multiple source formats into a unified JSON representation
Aligns and merges components across sources using a document mapping function
Assembles final output in any target format from the canonical model
Guarantees 100% content coverage with provenance tracking

2.1 Architecture Overview

Source Formats                 UDOM Pipeline                    Output Formats
──────────────                 ─────────────                    ──────────────

PDF (pymupdf4llm)  ──┐    ┌────────────────┐    ┌──────────┐
                      │    │                │    │          │──► Markdown
HTML (ar5iv/web)   ──┤    │  Extraction    │    │ Assembly │──► HTML
                      ├──►│  Layer         │    │ Engine   │──► PDF
LaTeX (arXiv src)  ──┤    │  (per-source)  │    │          │──► DOCX
                      │    │                │    │          │──► JSON-LD
DOCX (python-docx) ──┤    └───────┬────────┘    └────┬─────┘
                      │            │                   │
XML/EDI            ──┘            ▼                   ▲
                          ┌────────────────┐          │
                          │                │          │
                          │  Document Map  │──────────┘
                          │  (Alignment +  │
                          │   Selection)   │
                          │                │
                          └───────┬────────┘
                                  │
                                  ▼
                          ┌────────────────┐
                          │  UDOM JSON     │
                          │  (Canonical    │
                          │   Model)       │
                          └────────────────┘

2.2 Canonical Component Taxonomy

Every document is decomposed into typed components. Each component has content, position, provenance, and relationships.

Universal Components (all document types)

Component Type	Description	Properties
`metadata`	Document-level metadata	title, authors/parties, dates, identifiers, type
`heading`	Section heading	level (1-6), text, numbering
`paragraph`	Body text block	text, inline_refs (citations, links, cross-refs)
`table`	Structured tabular data	headers, rows, caption, label, merged_cells
`figure`	Image with context	image_data (path/base64), caption, label, alt_text
`list`	Ordered or unordered list	items (recursive), list_type
`code_block`	Code or pseudocode	content, language, caption
`blockquote`	Quoted text	content, attribution
`page_break`	Page boundary	page_number
`header_footer`	Running headers/footers	content, position (header/footer), page_range

Academic Components

Component Type	Description	Properties
`equation`	Mathematical expression	latex, display (inline/block), label, number
`algorithm`	Formal algorithm	content, caption, label
`theorem`	Theorem/lemma/proof/etc.	content, theorem_type, label, number
`citation`	Inline citation reference	keys, style (author-year/numeric), expanded_text
`bibliography_entry`	Reference list item	key, authors, title, venue, year, doi, urls
`abstract`	Paper abstract	content
`footnote`	Footnote/endnote	content, marker, position

Business Document Components

Component Type	Description	Properties
`line_item`	Invoice/PO line	description, quantity, unit_price, amount, tax, sku
`clause`	Contract clause	number, title, content, clause_type (obligation/right/condition)
`term`	Defined term	term, definition
`signature_block`	Signature area	signer_name, title, organization, date, signed (bool)
`field_value`	Form key-value pair	key, value, field_type
`total`	Summary amount	label, amount, currency
`address_block`	Postal/billing address	lines, entity_name, role (sender/recipient/billing/shipping)
`date_field`	Significant date	label, date, date_type (effective/expiry/due/payment)
`stamp`	Official stamp/seal	type, text, image_data

2.3 UDOM JSON Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://coditect.ai/schemas/udom/v1.0.0",
  "title": "Universal Document Object Model",
  "version": "1.0.0",

  "type": "object",
  "required": ["schema_version", "document_id", "document_type", "metadata", "body"],

  "properties": {
    "schema_version": "1.0.0",
    "document_id": "<unique identifier>",
    "document_type": "<academic_paper|invoice|purchase_order|contract|proposal|report|other>",

    "metadata": {
      "title": "string",
      "subtitle": "string | null",
      "authors": [
        {
          "name": "string",
          "role": "string | null",
          "affiliation": "string | null",
          "email": "string | null",
          "identifiers": { "orcid": "...", "scopus": "..." }
        }
      ],
      "parties": [
        {
          "name": "string",
          "role": "buyer | seller | licensor | licensee | party_a | party_b",
          "address": "AddressBlock | null",
          "identifiers": { "tax_id": "...", "duns": "..." }
        }
      ],
      "dates": {
        "published": "ISO-8601 | null",
        "effective": "ISO-8601 | null",
        "expiry": "ISO-8601 | null",
        "due": "ISO-8601 | null",
        "created": "ISO-8601 | null"
      },
      "identifiers": {
        "doi": "string | null",
        "arxiv_id": "string | null",
        "isbn": "string | null",
        "invoice_number": "string | null",
        "po_number": "string | null",
        "contract_number": "string | null",
        "internal_id": "string | null"
      },
      "keywords": ["string"],
      "language": "en",
      "page_count": "integer",
      "abstract": "string | null"
    },

    "body": {
      "type": "component_list",
      "description": "Ordered list of document components in reading order",
      "children": [
        {
          "id": "comp_001",
          "type": "<component_type from taxonomy>",
          "content": "<type-specific content>",
          "position": {
            "page": "integer | null",
            "order": "integer (reading order index)",
            "section_path": "string (e.g., '2.1.3')",
            "bbox": {
              "x0": "float", "y0": "float",
              "x1": "float", "y1": "float",
              "page": "integer"
            }
          },
          "provenance": {
            "source": "pdf | html | latex | docx | xml | manual",
            "extractor": "pymupdf4llm | ar5iv | pandoc | python-docx | ...",
            "confidence": "float (0-1)",
            "alternatives": [
              {
                "source": "html",
                "content": "<alternative extraction>",
                "confidence": 0.85
              }
            ]
          },
          "relationships": {
            "parent": "comp_id | null",
            "children": ["comp_id"],
            "references": ["comp_id"],
            "referenced_by": ["comp_id"]
          }
        }
      ]
    },

    "bibliography": {
      "description": "Resolved reference entries (academic/legal citations)",
      "entries": {
        "BIBREF_001": {
          "key": "string",
          "authors": ["string"],
          "title": "string",
          "venue": "string | null",
          "year": "integer | null",
          "doi": "string | null",
          "arxiv_id": "string | null",
          "url": "string | null",
          "raw_text": "string"
        }
      }
    },

    "assets": {
      "description": "Binary assets (images, embedded files)",
      "items": [
        {
          "id": "asset_001",
          "filename": "figure1.png",
          "mime_type": "image/png",
          "storage": "file_path | base64 | url",
          "data": "string (path, base64 data, or URL)",
          "dimensions": { "width": "integer", "height": "integer" },
          "referenced_by": ["comp_003"]
        }
      ]
    },

    "extraction_report": {
      "sources_used": [
        {
          "format": "pdf",
          "path": "/path/to/source.pdf",
          "extractor": "pymupdf4llm",
          "components_extracted": 47,
          "confidence_mean": 0.92
        },
        {
          "format": "html",
          "url": "https://ar5iv.labs.arxiv.org/html/2303.15256",
          "extractor": "beautifulsoup4",
          "components_extracted": 52,
          "confidence_mean": 0.88
        }
      ],
      "alignment": {
        "components_matched": 45,
        "components_source_only": { "pdf": 2, "html": 7 },
        "components_upgraded": 12,
        "coverage_score": 1.0
      },
      "quality": {
        "grade": "A",
        "score": 96.5,
        "dimension_scores": {},
        "completeness": 1.0
      },
      "timestamp": "ISO-8601"
    }
  }
}

2.4 Document Mapping Function

The document mapper is the core algorithm that aligns components across sources:

Phase 1: Extract
  For each source format:
    Parse document → extract typed components with positions
    Each component gets a provisional ID and source provenance

Phase 2: Align
  For each component from primary source (usually PDF — preserves layout):
    Search other sources for matching component using:
      1. Position match: same page + similar bbox (±10% tolerance)
      2. Content match: normalized text similarity (≥0.85 Jaccard on tokens)
      3. Type match: same component type
      4. Structural match: same section path + relative position
    Link matched components across sources

Phase 3: Select
  For each aligned component group:
    Score each source version on type-specific quality metrics:
      - Tables: column count, cell completeness, separator integrity
      - Figures: resolution, caption presence, alt text
      - Equations: LaTeX validity, delimiter balance
      - Headings: hierarchy consistency, numbering preservation
      - Text: completeness (char count), formatting preservation
      - Citations: expansion completeness, link resolution
    Select highest-scoring version as canonical
    Store alternatives in provenance.alternatives

Phase 4: Supplement
  For components found in only one source:
    If confidence ≥ 0.7: include in model
    If confidence < 0.7: flag for review
  For components found in secondary sources but NOT primary:
    Include with source attribution (e.g., HTML-only appendix)

Phase 5: Validate
  Coverage check: every component from every source is either:
    - Included as canonical, OR
    - Stored as alternative, OR
    - Explicitly flagged as duplicate/noise
  Structural check: heading hierarchy valid, references resolve,
    tables have consistent column counts
  Completeness check: page coverage ≥ 95% of source page count

2.5 Assembly Engine

The assembly engine renders UDOM JSON into any target format:

Target	Assembler	Strategy
Markdown	`assemble_markdown(udom)`	Component→markdown fragment, concatenate in reading order
HTML	`assemble_html(udom)`	Component→HTML element, semantic tags (article, section, figure, table)
DOCX	`assemble_docx(udom)`	Component→python-docx elements, style mapping
PDF	`assemble_pdf(udom)`	Component→reportlab/weasyprint elements
JSON-LD	`assemble_jsonld(udom)`	Component→Schema.org types (ScholarlyArticle, Invoice, etc.)
Plain text	`assemble_text(udom)`	Component→text content only, no formatting

Each assembler is a pure function: UDOM → Format. No side effects. Deterministic output.

2.6 Document Type Profiles

Each document type has a profile that defines expected components and validation rules:

{
  "academic_paper": {
    "required_components": ["heading", "paragraph", "bibliography_entry"],
    "expected_components": ["abstract", "equation", "figure", "table", "citation"],
    "validation": {
      "must_have_abstract": true,
      "must_have_bibliography": true,
      "heading_hierarchy_required": true
    }
  },
  "invoice": {
    "required_components": ["metadata", "line_item", "total", "address_block"],
    "expected_components": ["date_field", "field_value", "stamp"],
    "validation": {
      "line_items_sum_to_total": true,
      "must_have_invoice_number": true,
      "must_have_dates": ["due"]
    }
  },
  "contract": {
    "required_components": ["heading", "clause", "signature_block"],
    "expected_components": ["term", "date_field", "address_block", "paragraph"],
    "validation": {
      "must_have_effective_date": true,
      "must_have_parties": true,
      "clauses_numbered": true
    }
  },
  "purchase_order": {
    "required_components": ["line_item", "total", "address_block"],
    "expected_components": ["field_value", "date_field", "signature_block"],
    "validation": {
      "must_have_po_number": true,
      "must_have_shipping_address": true,
      "line_items_sum_to_total": true
    }
  },
  "proposal": {
    "required_components": ["heading", "paragraph"],
    "expected_components": ["table", "figure", "line_item", "total", "signature_block"],
    "validation": {
      "heading_hierarchy_required": true
    }
  }
}

3. Key Architecture Decisions

#	Decision	Rationale	Alternatives Rejected
1	JSON as canonical format	Human-readable, universally parseable, schema-validatable, Git-diffable, embeddable in databases	XML/TEI (verbose, harder tooling), Protobuf (binary, not human-readable), custom binary (not inspectable)
2	Flat component list with relationships	Simple to iterate, filter, transform; relationships via IDs not nesting	Deep nesting (hard to flatten for search), graph database (infrastructure overhead), pure tree (can't represent cross-references)
3	Provenance on every component	Enables trust scoring, debugging, audit trail; critical for 100% coverage guarantee	Document-level provenance only (can't debug per-component), no provenance (no trust signal)
4	Multi-source extraction, not single-source conversion	Every source has strengths; combining gives best result for every component	Single best source (always has blind spots), manual enrichment (doesn't scale)
5	Document type profiles	Validation rules vary by document type; profiles make the model extensible without code changes	Hardcoded validation (rigid), no validation (no quality guarantee), universal rules (too loose or too strict)
6	Components carry alternatives	When alignment finds multiple versions, all are preserved; enables re-selection without re-extraction	Best-only (loses information), separate alternatives file (complex), re-extract on demand (expensive)
7	Position includes bbox + reading order + section path	Three independent positioning systems allow alignment even when one is missing; bbox for PDF, section path for HTML, reading order for text	Bbox-only (HTML has no bbox), reading-order-only (ambiguous for multi-column), section-path-only (flat documents have no sections)
8	Assets stored externally with references	Images can be large; storing as file paths or URLs keeps the JSON manageable; base64 option for self-contained use	Inline base64 only (bloats JSON), file paths only (not portable), URLs only (requires hosting)

4. Implementation Plan

Phase 1: Schema & Core (Week 1)

Task	Description
Define JSON Schema	Formal JSON Schema 2020-12 in `config/schemas/udom-v1.schema.json`
Component taxonomy	Python enums/dataclasses for all component types
UDOM dataclass	`UDOMDocument` with serialization/deserialization
Document type profiles	Profile definitions for 5 document types

Phase 2: Extractors (Week 2-3)

Extractor	Source	Components Extracted
`extract_from_pdf(pdf_path)`	PDF via pymupdf4llm + pdfplumber + pdfminer	All universal + academic components
`extract_from_html(html/url)`	HTML via BeautifulSoup (ar5iv, web)	Headings, tables, figures, citations, math
`extract_from_latex(tex_path)`	LaTeX via pandoc + pylatexenc	Equations, structure, bibliography, theorems
`extract_from_docx(docx_path)`	DOCX via python-docx	All universal + business components

Each extractor returns list[Component] — raw components with provenance.

Phase 3: Document Mapper (Week 3-4)

Module	Function
`align_components()`	Match components across sources by position + content + type
`score_component()`	Type-specific quality scoring for selection
`select_canonical()`	Pick best version of each component, store alternatives
`validate_coverage()`	Ensure 100% coverage, flag gaps
`build_udom()`	Orchestrate: extract → align → select → validate → UDOM JSON

Phase 4: Assemblers (Week 4-5)

Assembler	Output
`assemble_markdown()`	GitHub-flavored Markdown with proper tables, images, math
`assemble_html()`	Semantic HTML5 with Schema.org microdata
`assemble_text()`	Plain text with structural indicators

Additional assemblers (DOCX, PDF) in Phase 5.

Phase 5: Integration (Week 5-6)

Replace pipeline.py text-level enrichment with UDOM-based extraction → mapping → assembly
Batch processing: udom_batch.py for 218 academic papers
QA validation: compare UDOM-assembled markdown against v1.2 output
Pipeline report enhancement: include UDOM extraction_report

5. File Structure

skills/pdf-to-markdown/
├── src/
│   ├── convert.py                 # Existing v2.5 converter (unchanged)
│   ├── pipeline.py                # Enhanced to use UDOM
│   └── udom/
│       ├── __init__.py
│       ├── schema.py              # UDOM dataclasses + JSON schema
│       ├── taxonomy.py            # Component type enums + profiles
│       ├── extractors/
│       │   ├── __init__.py
│       │   ├── pdf_extractor.py   # PDF → components
│       │   ├── html_extractor.py  # HTML → components (ar5iv, web)
│       │   ├── latex_extractor.py # LaTeX → components
│       │   └── docx_extractor.py  # DOCX → components
│       ├── mapper.py              # Alignment + selection + validation
│       └── assemblers/
│           ├── __init__.py
│           ├── markdown.py        # UDOM → Markdown
│           ├── html.py            # UDOM → HTML
│           └── text.py            # UDOM → Plain text

config/schemas/
└── udom-v1.schema.json            # Formal JSON Schema

6. Quality Guarantees

6.1 Coverage Guarantee

Every component from every source must be accounted for:

Coverage = (canonical + alternatives + flagged_noise) / total_extracted
Target: Coverage = 1.0 (100%)

If coverage < 1.0, the pipeline fails rather than silently dropping content.

6.2 Round-Trip Fidelity

For any document D and any supported format F:

D → UDOM(D) → assemble_F(UDOM(D)) ≈ D_in_format_F

"Approximately equals" means: all textual content preserved, all structural relationships preserved, visual formatting may differ (font, spacing, pagination).

6.3 Idempotent Extraction

Re-extracting from the same source produces identical UDOM JSON (deterministic).

6.4 Monotonic Enrichment

Adding a new source can only improve quality, never degrade it:

quality(UDOM(source_A)) ≤ quality(UDOM(source_A, source_B))

This is guaranteed by the selection algorithm: alternatives are stored but never replace a higher-scoring canonical version.

7. Migration Path

From Pipeline v1.2

Pipeline v1.2 continues to work unchanged (--skip-enrichment)
Pipeline v1.3 uses UDOM internally but outputs identical Markdown
--udom-json flag outputs the canonical UDOM JSON alongside Markdown
--udom-only flag outputs only UDOM JSON (no Markdown assembly)
Batch reprocessing: all 218 papers through UDOM pipeline, compare grades

Incremental Adoption

Phase 1: Academic papers only (PDF + HTML + LaTeX extractors)
Phase 2: Add DOCX extractor → contracts, proposals
Phase 3: Add business document profiles → invoices, POs
Phase 4: Add additional assemblers → DOCX, PDF output

8. Consequences

Positive

Format independence: Content captured once, rendered anywhere
100% coverage: Provenance tracking ensures nothing is lost
Multi-source quality: Best component from best source, every time
Extensible: New document types via profiles, new formats via extractors/assemblers
Agentic-ready: JSON components are individually embeddable, retrievable, referenceable
Auditable: Provenance chain from source → extraction → alignment → selection → assembly
Decomposable: Any component can be extracted, transformed, or replaced independently

Negative

Complexity increase: Three-layer architecture (extract → map → assemble) vs single-pass conversion
Processing time: Multi-source extraction is slower than single-source (mitigated by caching)
Storage: UDOM JSON + alternatives + assets is larger than Markdown output alone
Schema evolution: Adding new component types requires schema versioning

Risks

Risk	Mitigation
Schema becomes too complex	Start minimal, extend via document type profiles
Alignment algorithm produces false matches	Conservative matching (require ≥2 of 4 match criteria)
Extraction quality varies by source	Confidence scores + fallback to primary source
Performance on large documents (100+ pages)	Streaming extraction, page-level parallelism

9. References

S2ORC (Allen AI) — Structured JSON for 81M papers: body_text with cite_spans/ref_spans
Docling (IBM Research) — DoclingDocument schema with typed content items (TextItem, TableItem, PictureItem, FormulaItem)
PaperMage (Allen AI) — Character-span entity layers over unified symbols string
GROBID — TEI XML with 68 fine-grained extraction labels
OmniDocBench (CVPR 2025) — Multi-format verification benchmark
Uni-Parser — Cross-page merging with JSON/Markdown output
Schema.org — ScholarlyArticle, Invoice, Order vocabularies

ADR-164 | Track: F (Documentation) | Task: F.0 Author: Claude (Opus 4.6) Created: 2026-02-09

Status​

Executive Summary​

1. Problem Statement​

1.1 The Format Coupling Problem​

1.2 The Multi-Source Opportunity​

1.3 The Agentic Processing Vision​

1.4 Prior Art​

2. Decision​

2.1 Architecture Overview​

2.2 Canonical Component Taxonomy​

Universal Components (all document types)​

Academic Components​

Business Document Components​

2.3 UDOM JSON Schema​

2.4 Document Mapping Function​

2.5 Assembly Engine​

2.6 Document Type Profiles​

3. Key Architecture Decisions​

4. Implementation Plan​

Phase 1: Schema & Core (Week 1)​

Phase 2: Extractors (Week 2-3)​

Phase 3: Document Mapper (Week 3-4)​

Phase 4: Assemblers (Week 4-5)​

Phase 5: Integration (Week 5-6)​

5. File Structure​

6. Quality Guarantees​

6.1 Coverage Guarantee​

6.2 Round-Trip Fidelity​

6.3 Idempotent Extraction​

6.4 Monotonic Enrichment​

7. Migration Path​

From Pipeline v1.2​

Incremental Adoption​

8. Consequences​

Positive​

Negative​

Risks​

9. References​