Skip to main content

UDOM Pipeline — CODITECT Impact Analysis

Version: 1.0 | Date: 2026-02-09
Classification: Architecture Impact Assessment
Scope: Integration of UDOM Pipeline into CODITECT Platform


Integration Architecture

Control Plane vs. Data Plane Placement

The UDOM Pipeline spans both planes with clear separation:

Control Plane components:

  • Batch Orchestrator — lives in the Agent Orchestrator container, manages paper queues, dispatches extraction workers, enforces quality gates, and tracks token budgets. This is a natural fit for the existing orchestrator-workers pattern.
  • Quality Gate Evaluator — operates as an evaluator-optimizer loop. The 9-dimension scorer evaluates output; if below threshold (0.85), the pipeline retries with enhanced extraction parameters. This maps directly to CODITECT's existing checkpoint framework.
  • Model Router integration — research synthesis tasks route to Opus; extraction workers (deterministic, no LLM needed) bypass model routing entirely, consuming zero LLM tokens for the extraction phase.

Data Plane components:

  • Extraction Workers (Docling, ar5iv, LaTeX) — stateless, horizontally scalable worker containers. Each consumes a paper ID from NATS, produces UDOM components, and publishes results. No LLM dependency — pure document processing.
  • Fusion Engine — deterministic component merger. Takes 3 source outputs, applies confidence-weighted selection rules, produces canonical UDOM document. Runs as a post-processing step after all workers complete.
  • UDOM Store — PostgreSQL JSONB tables storing typed components with full-text search, GIN indexes, and per-tenant isolation. Integrates with existing State Store container.

Architecture Decision

The UDOM Pipeline does NOT require new containers. It deploys as specialized agent workers within the existing Agent Workers container, using the existing Event Bus (NATS) for coordination and the existing State Store (PostgreSQL) for persistence. This follows Anthropic Principle 1 (Simplicity First) — no new infrastructure beyond what CODITECT already provides.


Multi-Tenancy & Isolation

Isolation Model: Row-Level Security + Tenant-Scoped Corpora

LayerIsolation MechanismDetails
UDOM StorePostgreSQL Row-Level Security (RLS)tenant_id column on all UDOM tables, RLS policies enforce tenant boundary
Paper CorporaTenant-scoped collectionsEach tenant defines named corpora (e.g., "pharma-ssl-papers", "clinical-trials-2025")
Extraction WorkersShared compute, isolated I/OWorkers are stateless; tenant context passed via NATS message headers
Quality ScoresPer-tenant thresholdsRegulated tenants can require higher quality thresholds (e.g., 0.90 for FDA workflows)
Audit TrailTenant-partitioned audit eventsAll extraction, fusion, and scoring events tagged with tenant_id

Shared vs. Tenant-Specific Resources

ResourceSharedTenant-Specific
Docling extraction engine
ar5iv HTTP client
LaTeX pandoc converter
Fusion rules✅ (default)✅ (tenant can override confidence weights)
Quality thresholds✅ (per-tenant configurable)
UDOM component store✅ (RLS-isolated)
KaTeX macro dictionaries✅ (common)✅ (tenant can add domain-specific macros)
Audit logs✅ (partitioned)

Compliance Surface

Auditability Hooks

Every UDOM pipeline operation generates compliance-auditable events:

# Event: paper_extraction_started
- event_type: udom.extraction.started
tenant_id: tenant-123
arxiv_id: "2003.05991"
sources_requested: ["docling", "ar5iv", "latex"]
quality_threshold: 0.85
timestamp: "2026-02-09T11:35:24Z"
correlation_id: "batch-run-20260209-001"

# Event: component_extracted
- event_type: udom.component.extracted
source: "ar5iv"
component_type: "equation"
confidence: 0.95
content_hash: "sha256:a1b2c3..." # Content integrity verification

# Event: quality_gate_passed
- event_type: udom.quality.gate_passed
overall_score: 0.91
dimension_scores: {structure: 0.93, math: 0.95, tables: 0.88, ...}
grade: "A"
threshold: 0.85

# Event: document_published
- event_type: udom.document.published
component_count: 312
sources_used: ["docling", "ar5iv", "latex"]
provenance_chain: ["pdf→docling", "html→ar5iv", "tex→pandoc→latex"]

Policy Injection Points

PolicyInjection PointExample
Data classificationPre-extractionBlock papers containing PHI identifiers before processing
Source provenancePer-component metadataTrack which source contributed each component for traceability
Quality minimumQuality gate evaluatorRegulated tenants enforce Grade A minimum (0.85+)
Retention policyUDOM StoreAuto-purge components older than retention window per tenant policy
Access controlAPI GatewayRole-based access to corpora (researcher read-only, admin write)

E-Signature Support

For regulated workflows where UDOM output feeds into validated processes (e.g., FDA submission literature reviews), the pipeline integrates with CODITECT's existing e-signature infrastructure:

  • Quality scoring reports can be electronically signed by compliance officers
  • Batch run reports include immutable hash chains for data integrity (21 CFR Part 11 compliance)
  • Component provenance chains provide the "who, what, when, why" required by Part 11.10(e)

Validation Documentation

The 9-dimension quality scoring system provides built-in IQ/OQ/PQ evidence:

Validation PhaseUDOM Evidence
IQ (Installation)Docling version verification, ar5iv connectivity test, pandoc version check
OQ (Operational)9-dimension scoring against reference corpus (known-good papers with manually verified scores)
PQ (Performance)Batch-level statistics: Grade A rate, processing time, component count distributions

Observability

Tracing

The UDOM Pipeline integrates with CODITECT's OpenTelemetry stack:

Trace: udom.batch.process
├─ Span: udom.paper.2003.05991
│ ├─ Span: udom.extract.docling (5.2s)
│ ├─ Span: udom.extract.ar5iv (2.1s)
│ ├─ Span: udom.extract.latex (8.4s)
│ ├─ Span: udom.fusion (0.3s)
│ ├─ Span: udom.quality.score (0.1s)
│ └─ Span: udom.store.write (0.05s)
├─ Span: udom.paper.2104.14294
│ └─ ...
└─ Span: udom.batch.report

Metrics

MetricTypeLabelsPurpose
udom_papers_processed_totalCountertenant_id, gradeVolume tracking
udom_extraction_duration_secondsHistogramsource, tenant_idPerformance monitoring
udom_quality_scoreGaugedimension, tenant_idQuality trend analysis
udom_component_countHistogramtype, sourceExtraction completeness
udom_fusion_confidenceHistogramcomponent_typeFusion quality
udom_quality_gate_failures_totalCountertenant_idQuality regression detection

Logging

Structured JSON logs with correlation IDs linking extraction → fusion → scoring → storage.

Alerting Rules

AlertConditionSeverity
Quality degradationGrade A rate < 95% over 1-hour windowWarning
Extraction failure spike>10% failure rate in 15-minute windowCritical
ar5iv unavailability>5 consecutive 429/503 responsesWarning
Processing latencyP95 > 60s per paperWarning
Storage capacityUDOM store > 80% allocated spaceWarning

Multi-Agent Orchestration Fit

Agent Tasks Mapping

CODITECT Agent RoleUDOM InteractionPattern Used
OrchestratorDispatches extraction batches, manages quality gatesOrchestrator-Workers
Research AgentQueries UDOM store for literature synthesisAugmented LLM (search + retrieval)
Compliance AgentValidates UDOM provenance chains, checks data classificationEvaluator-Optimizer
Implementer AgentBuilds adapters for new paper sources (publishers)Prompt Chaining

Checkpoint Integration

The UDOM Pipeline surfaces mandatory checkpoints at:

  1. Batch configuration — human approves paper list, quality threshold, tenant assignment before processing starts
  2. Quality gate failure — if a paper fails Grade A after max retries, checkpoint triggers for human review (inspect specific dimension failures)
  3. New source adapter — when adding a publisher adapter (e.g., PubMed), architecture checkpoint for adapter design review
  4. Regulatory corpus update — when UDOM output feeds into a validated workflow, checkpoint for compliance officer sign-off

Circuit Breaker Mapping

WorkerCircuit Breaker TriggerRecovery
Docling worker3 consecutive OOM errorsRestart with reduced batch size
ar5iv worker5 consecutive HTTP 429Exponential backoff, 60s → 120s → 240s
LaTeX worker3 consecutive pandoc crashesSkip LaTeX source, degrade to 2-source mode
Fusion engineInvalid component schemaReject paper, flag for manual review

Advantages — What UDOM Gives CODITECT That Would Be Hard to Build

  1. 3-source fusion is architecturally unique. No existing tool combines Docling + ar5iv + LaTeX source. Building any single extraction engine to match the fidelity of 3-source fusion would require 10–50× the engineering effort.

  2. 25 typed components create a structured knowledge API. Agents don't consume raw text — they query typed components (equations, tables, figures, citations). This enables tool-use patterns that are impossible with unstructured markdown.

  3. 9-dimension quality scoring is a compliance accelerator. The scoring system provides built-in validation evidence (IQ/OQ/PQ) that would otherwise require months of manual test protocol development for regulated environments.

  4. Cumulative knowledge moat. Every paper processed becomes a queryable knowledge asset. Over time, tenant corpora become irreplaceable — the switching cost is the entire processed knowledge base, not just a software subscription.

  5. Zero LLM tokens for extraction. The extraction pipeline is entirely deterministic (no LLM calls). LLM tokens are only consumed when agents synthesize from UDOM output — a 30–50% token efficiency gain over feeding raw PDFs to agents.


Gaps & Risks — Explicit Assessment

Critical Gaps

GapImpactEffort to Close
Non-arXiv sourcesPipeline currently optimized for arXiv ecosystem. PubMed, IEEE, Springer, Elsevier, Nature papers require new adapters.High — each publisher has unique PDF layouts, HTML formats, and access APIs. Estimate 2–3 weeks per publisher adapter.
Figure extractionDocling extracts figure references but not figure content (images). ar5iv provides some images but inconsistently.Medium — need image extraction + captioning pipeline. Docling v2 has basic image support; needs quality scoring for images.
Real-time ingestionCurrent pipeline is batch-oriented. No push-based ingestion when new papers are published.Medium — add arXiv RSS feed listener + NATS topic for new paper events.
Cross-paper citation graphIndividual papers are processed independently. No citation graph linking papers within a corpus.Medium — extract citation references from bibliography components, build adjacency graph in PostgreSQL.

Risks

RiskProbabilityImpactMitigation
ar5iv service deprecationLowHighar5iv is community-maintained; monitor status, build fallback to direct LaTeXML rendering
Docling breaking changesMediumMediumVersion pin, maintain compatibility test suite against reference corpus
Publisher IP restrictionsMediumHighNegotiate API access for enterprise tenants; implement institutional proxy support
Quality scoring driftLowMediumMaintain golden reference set of manually scored papers; run regression tests weekly

Integration Patterns — Concrete Adapter Interfaces

UDOM Store Adapter (Python)

from abc import ABC, abstractmethod
from typing import AsyncIterator

class UDOMStoreAdapter(ABC):
"""Interface for UDOM document storage — swappable backends."""

@abstractmethod
async def store_document(self, doc: UDOMDocument, tenant_id: str) -> str:
"""Store UDOM document, return document ID."""
...

@abstractmethod
async def get_components(
self, arxiv_id: str, tenant_id: str,
component_type: str | None = None,
) -> list[UDOMComponent]:
"""Retrieve components, optionally filtered by type."""
...

@abstractmethod
async def semantic_search(
self, query: str, tenant_id: str,
corpus: str | None = None, limit: int = 20,
) -> list[UDOMComponent]:
"""Semantic search across components in tenant's corpus."""
...

@abstractmethod
async def get_quality_report(
self, tenant_id: str, corpus: str | None = None,
) -> dict:
"""Aggregate quality statistics for tenant's corpus."""
...

Source Adapter (Python) — For New Publishers

class SourceAdapter(ABC):
"""Interface for adding new paper sources (publishers, preprint servers)."""

@abstractmethod
async def extract(self, paper_id: str) -> list[UDOMComponent]:
"""Extract UDOM components from this source."""
...

@abstractmethod
def supports(self, paper_id: str) -> bool:
"""Whether this adapter handles the given paper ID format."""
...

@property
@abstractmethod
def source_name(self) -> str:
"""Unique source identifier (e.g., 'pubmed', 'ieee', 'springer')."""
...

@property
@abstractmethod
def default_confidence(self) -> float:
"""Baseline confidence for components from this source."""
...

Agent Tool Interface (TypeScript)

interface UDOMAgentTools {
/** Search UDOM corpus for components matching a semantic query. */
searchComponents(params: {
query: string;
tenantId: string;
corpus?: string;
componentType?: ComponentType;
limit?: number;
}): Promise<UDOMComponent[]>;

/** Get all components of a specific paper. */
getPaperComponents(params: {
arxivId: string;
tenantId: string;
componentType?: ComponentType;
}): Promise<UDOMComponent[]>;

/** Compare a metric across multiple papers. */
crossPaperComparison(params: {
metric: string;
paperIds: string[];
tenantId: string;
}): Promise<ComparisonResult>;

/** Get quality report for a corpus. */
getCorpusQuality(params: {
tenantId: string;
corpus?: string;
}): Promise<QualityReport>;
}

Impact assessment covers: architecture placement, multi-tenancy, compliance, observability, agent orchestration, advantages, gaps, and concrete integration interfaces.