Skip to main content

UDOM Pipeline — Glossary

Version: 1.0 | Date: 2026-02-09


TermDefinitionCODITECT EquivalentEcosystem Analogs
UDOMUniversal Document Object Model — a 25-typed-component canonical representation of scientific papers, assembled from multiple extraction sources.CODITECT's structured knowledge format for agent consumption.JSON-LD (structured data), DOM (web documents), AST (code)
3-Source ExtractionPipeline architecture that extracts from PDF (Docling), HTML (ar5iv), and LaTeX source (pandoc) independently, then fuses results.Orchestrator-Workers pattern with 3 specialized workers.Multi-view learning, ensemble methods, data fusion
DoclingIBM's open-source document understanding engine. Converts PDFs to structured output at 62× the speed of pymupdf4llm. Primary source for document structure.UDOM extraction worker (data plane).pymupdf4llm, Marker (Meta), LlamaParse, Unstructured
ar5ivCommunity service rendering arXiv papers as LaTeXML HTML. Provides high-fidelity math (via alttext attributes) and structured tables.UDOM extraction worker (data plane).LaTeXML, MathJax server-side rendering
Confidence ScorePer-component float (0.0–1.0) indicating extraction fidelity from a given source. Used by fusion engine for source selection.Model routing confidence in Task Classifier.Softmax probability, prediction confidence
Fusion EngineDeterministic component merger that selects the highest-confidence version of each component across 3 sources.Aggregator in parallelization pattern.Ensemble voting, feature fusion, late fusion
Quality Score9-dimension weighted evaluation of UDOM document fidelity: structure, tables, math, citations, images, content density, LaTeX residual, heading hierarchy, bibliography.Evaluator in evaluator-optimizer pattern.Code coverage metrics, BLEU/ROUGE scores
Grade A/B/CQuality classification. A ≥ 0.85 (production), B ≥ 0.70 (retry), C < 0.70 (human review).Checkpoint gate outcomes (approve/retry/escalate).Pass/fail thresholds, SLA tiers
Typed ComponentA UDOM element with explicit semantic type (heading, equation, table, etc.) enabling structured agent tool-use.Agent tool parameter types.HTML semantic elements, schema.org types
Component TypeOne of 25 enumerated types: heading, paragraph, equation, figure, table, citation, code, list, abstract, bibliography, caption, footnote, algorithm, theorem, proof, definition, example, remark, appendix, acknowledgment, author_info, metadata, reference, supplementary, unknown.CODITECT document model types.Markdown AST node types, DITA topic types
LaTeX ResidualStray LaTeX commands (e.g., \begin{environment}, raw macros) remaining in extracted text. A quality dimension scored inversely — lower residual = higher score.Compliance violation detection (unexpected content in output).Code smell detection, lint errors
KaTeXFast math rendering library used in UDOM Navigator. Renders LaTeX equations to HTML without MathJax overhead.UDOM Navigator rendering engine.MathJax, MathML, LaTeXML
Custom MacrosPaper-specific LaTeX \newcommand definitions (e.g., \hatg → \hat{g}) that must be expanded or registered for rendering.Tenant-specific configuration extensions.Babel plugins, preprocessor macros
UDOM NavigatorStatic HTML/JS viewer (viewer.html) for browsing UDOM batch results with KaTeX math, theme system, and quality dashboards.CODITECT IDE Shell (Theia) widget.Jupyter Notebook viewer, Overleaf preview
Batch RunProcessing of multiple papers as a unit with shared configuration, quality aggregation, and audit trail. Stored in run-* directories.CODITECT batch job execution.CI/CD pipeline runs, ETL batch jobs
CorpusNamed collection of UDOM documents within a tenant, organized by research domain or project.CODITECT project/workspace.Database schema, document collection, knowledge base
Knowledge MoatCompetitive strategy where cumulative processed UDOM corpora create switching costs — the structured knowledge base becomes tenant-specific IP.CODITECT platform stickiness mechanism.Data moat, network effects, ecosystem lock-in
Source ProvenanceMetadata tracking which extraction source contributed each UDOM component. Required for compliance audit trails.CODITECT audit event correlation.Git blame, data lineage, supply chain traceability
Graceful DegradationPipeline behavior when a source is unavailable (e.g., ar5iv 429). Continues with available sources, adjusting quality expectations.Circuit breaker recovery in agent workers.Fallback patterns, feature flags, progressive enhancement
alttextHTML attribute on <math> elements in ar5iv, containing the original LaTeX source of an equation. Primary extraction target for inline math.Component metadata field.alt text (images), ARIA labels
GIN IndexPostgreSQL Generalized Inverted Index used for efficient JSONB and full-text queries on UDOM component stores.State Store query optimization.Elasticsearch inverted index, MongoDB text index
RLSPostgreSQL Row-Level Security — database-enforced tenant isolation on UDOM tables.CODITECT multi-tenancy enforcement.Oracle VPD, Citus row-level, application-level WHERE

Glossary covers: pipeline terminology, quality framework, architecture patterns, and ecosystem analogs.