UDOM Pipeline — Glossary
Version: 1.0 | Date: 2026-02-09
| Term | Definition | CODITECT Equivalent | Ecosystem Analogs |
|---|---|---|---|
| UDOM | Universal Document Object Model — a 25-typed-component canonical representation of scientific papers, assembled from multiple extraction sources. | CODITECT's structured knowledge format for agent consumption. | JSON-LD (structured data), DOM (web documents), AST (code) |
| 3-Source Extraction | Pipeline architecture that extracts from PDF (Docling), HTML (ar5iv), and LaTeX source (pandoc) independently, then fuses results. | Orchestrator-Workers pattern with 3 specialized workers. | Multi-view learning, ensemble methods, data fusion |
| Docling | IBM's open-source document understanding engine. Converts PDFs to structured output at 62× the speed of pymupdf4llm. Primary source for document structure. | UDOM extraction worker (data plane). | pymupdf4llm, Marker (Meta), LlamaParse, Unstructured |
| ar5iv | Community service rendering arXiv papers as LaTeXML HTML. Provides high-fidelity math (via alttext attributes) and structured tables. | UDOM extraction worker (data plane). | LaTeXML, MathJax server-side rendering |
| Confidence Score | Per-component float (0.0–1.0) indicating extraction fidelity from a given source. Used by fusion engine for source selection. | Model routing confidence in Task Classifier. | Softmax probability, prediction confidence |
| Fusion Engine | Deterministic component merger that selects the highest-confidence version of each component across 3 sources. | Aggregator in parallelization pattern. | Ensemble voting, feature fusion, late fusion |
| Quality Score | 9-dimension weighted evaluation of UDOM document fidelity: structure, tables, math, citations, images, content density, LaTeX residual, heading hierarchy, bibliography. | Evaluator in evaluator-optimizer pattern. | Code coverage metrics, BLEU/ROUGE scores |
| Grade A/B/C | Quality classification. A ≥ 0.85 (production), B ≥ 0.70 (retry), C < 0.70 (human review). | Checkpoint gate outcomes (approve/retry/escalate). | Pass/fail thresholds, SLA tiers |
| Typed Component | A UDOM element with explicit semantic type (heading, equation, table, etc.) enabling structured agent tool-use. | Agent tool parameter types. | HTML semantic elements, schema.org types |
| Component Type | One of 25 enumerated types: heading, paragraph, equation, figure, table, citation, code, list, abstract, bibliography, caption, footnote, algorithm, theorem, proof, definition, example, remark, appendix, acknowledgment, author_info, metadata, reference, supplementary, unknown. | CODITECT document model types. | Markdown AST node types, DITA topic types |
| LaTeX Residual | Stray LaTeX commands (e.g., \begin{environment}, raw macros) remaining in extracted text. A quality dimension scored inversely — lower residual = higher score. | Compliance violation detection (unexpected content in output). | Code smell detection, lint errors |
| KaTeX | Fast math rendering library used in UDOM Navigator. Renders LaTeX equations to HTML without MathJax overhead. | UDOM Navigator rendering engine. | MathJax, MathML, LaTeXML |
| Custom Macros | Paper-specific LaTeX \newcommand definitions (e.g., \hatg → \hat{g}) that must be expanded or registered for rendering. | Tenant-specific configuration extensions. | Babel plugins, preprocessor macros |
| UDOM Navigator | Static HTML/JS viewer (viewer.html) for browsing UDOM batch results with KaTeX math, theme system, and quality dashboards. | CODITECT IDE Shell (Theia) widget. | Jupyter Notebook viewer, Overleaf preview |
| Batch Run | Processing of multiple papers as a unit with shared configuration, quality aggregation, and audit trail. Stored in run-* directories. | CODITECT batch job execution. | CI/CD pipeline runs, ETL batch jobs |
| Corpus | Named collection of UDOM documents within a tenant, organized by research domain or project. | CODITECT project/workspace. | Database schema, document collection, knowledge base |
| Knowledge Moat | Competitive strategy where cumulative processed UDOM corpora create switching costs — the structured knowledge base becomes tenant-specific IP. | CODITECT platform stickiness mechanism. | Data moat, network effects, ecosystem lock-in |
| Source Provenance | Metadata tracking which extraction source contributed each UDOM component. Required for compliance audit trails. | CODITECT audit event correlation. | Git blame, data lineage, supply chain traceability |
| Graceful Degradation | Pipeline behavior when a source is unavailable (e.g., ar5iv 429). Continues with available sources, adjusting quality expectations. | Circuit breaker recovery in agent workers. | Fallback patterns, feature flags, progressive enhancement |
| alttext | HTML attribute on <math> elements in ar5iv, containing the original LaTeX source of an equation. Primary extraction target for inline math. | Component metadata field. | alt text (images), ARIA labels |
| GIN Index | PostgreSQL Generalized Inverted Index used for efficient JSONB and full-text queries on UDOM component stores. | State Store query optimization. | Elasticsearch inverted index, MongoDB text index |
| RLS | PostgreSQL Row-Level Security — database-enforced tenant isolation on UDOM tables. | CODITECT multi-tenancy enforcement. | Oracle VPD, Citus row-level, application-level WHERE |
Glossary covers: pipeline terminology, quality framework, architecture patterns, and ecosystem analogs.