UDOM Pipeline — Glossary

Version: 1.0 | Date: 2026-02-09

Term	Definition	CODITECT Equivalent	Ecosystem Analogs
UDOM	Universal Document Object Model — a 25-typed-component canonical representation of scientific papers, assembled from multiple extraction sources.	CODITECT's structured knowledge format for agent consumption.	JSON-LD (structured data), DOM (web documents), AST (code)
3-Source Extraction	Pipeline architecture that extracts from PDF (Docling), HTML (ar5iv), and LaTeX source (pandoc) independently, then fuses results.	Orchestrator-Workers pattern with 3 specialized workers.	Multi-view learning, ensemble methods, data fusion
Docling	IBM's open-source document understanding engine. Converts PDFs to structured output at 62× the speed of pymupdf4llm. Primary source for document structure.	UDOM extraction worker (data plane).	pymupdf4llm, Marker (Meta), LlamaParse, Unstructured
ar5iv	Community service rendering arXiv papers as LaTeXML HTML. Provides high-fidelity math (via alttext attributes) and structured tables.	UDOM extraction worker (data plane).	LaTeXML, MathJax server-side rendering
Confidence Score	Per-component float (0.0–1.0) indicating extraction fidelity from a given source. Used by fusion engine for source selection.	Model routing confidence in Task Classifier.	Softmax probability, prediction confidence
Fusion Engine	Deterministic component merger that selects the highest-confidence version of each component across 3 sources.	Aggregator in parallelization pattern.	Ensemble voting, feature fusion, late fusion
Quality Score	9-dimension weighted evaluation of UDOM document fidelity: structure, tables, math, citations, images, content density, LaTeX residual, heading hierarchy, bibliography.	Evaluator in evaluator-optimizer pattern.	Code coverage metrics, BLEU/ROUGE scores
Grade A/B/C	Quality classification. A ≥ 0.85 (production), B ≥ 0.70 (retry), C < 0.70 (human review).	Checkpoint gate outcomes (approve/retry/escalate).	Pass/fail thresholds, SLA tiers
Typed Component	A UDOM element with explicit semantic type (heading, equation, table, etc.) enabling structured agent tool-use.	Agent tool parameter types.	HTML semantic elements, schema.org types
Component Type	One of 25 enumerated types: heading, paragraph, equation, figure, table, citation, code, list, abstract, bibliography, caption, footnote, algorithm, theorem, proof, definition, example, remark, appendix, acknowledgment, author_info, metadata, reference, supplementary, unknown.	CODITECT document model types.	Markdown AST node types, DITA topic types
LaTeX Residual	Stray LaTeX commands (e.g., `\begin{environment}`, raw macros) remaining in extracted text. A quality dimension scored inversely — lower residual = higher score.	Compliance violation detection (unexpected content in output).	Code smell detection, lint errors
KaTeX	Fast math rendering library used in UDOM Navigator. Renders LaTeX equations to HTML without MathJax overhead.	UDOM Navigator rendering engine.	MathJax, MathML, LaTeXML
Custom Macros	Paper-specific LaTeX `\newcommand` definitions (e.g., `\hatg → \hat{g}`) that must be expanded or registered for rendering.	Tenant-specific configuration extensions.	Babel plugins, preprocessor macros
UDOM Navigator	Static HTML/JS viewer (viewer.html) for browsing UDOM batch results with KaTeX math, theme system, and quality dashboards.	CODITECT IDE Shell (Theia) widget.	Jupyter Notebook viewer, Overleaf preview
Batch Run	Processing of multiple papers as a unit with shared configuration, quality aggregation, and audit trail. Stored in `run-*` directories.	CODITECT batch job execution.	CI/CD pipeline runs, ETL batch jobs
Corpus	Named collection of UDOM documents within a tenant, organized by research domain or project.	CODITECT project/workspace.	Database schema, document collection, knowledge base
Knowledge Moat	Competitive strategy where cumulative processed UDOM corpora create switching costs — the structured knowledge base becomes tenant-specific IP.	CODITECT platform stickiness mechanism.	Data moat, network effects, ecosystem lock-in
Source Provenance	Metadata tracking which extraction source contributed each UDOM component. Required for compliance audit trails.	CODITECT audit event correlation.	Git blame, data lineage, supply chain traceability
Graceful Degradation	Pipeline behavior when a source is unavailable (e.g., ar5iv 429). Continues with available sources, adjusting quality expectations.	Circuit breaker recovery in agent workers.	Fallback patterns, feature flags, progressive enhancement
alttext	HTML attribute on `<math>` elements in ar5iv, containing the original LaTeX source of an equation. Primary extraction target for inline math.	Component metadata field.	alt text (images), ARIA labels
GIN Index	PostgreSQL Generalized Inverted Index used for efficient JSONB and full-text queries on UDOM component stores.	State Store query optimization.	Elasticsearch inverted index, MongoDB text index
RLS	PostgreSQL Row-Level Security — database-enforced tenant isolation on UDOM tables.	CODITECT multi-tenancy enforcement.	Oracle VPD, Citus row-level, application-level WHERE

Glossary covers: pipeline terminology, quality framework, architecture patterns, and ecosystem analogs.