UDOM Pipeline — CODITECT Impact Analysis

Version: 1.0 | Date: 2026-02-09
Classification: Architecture Impact Assessment
Scope: Integration of UDOM Pipeline into CODITECT Platform

Integration Architecture

Control Plane vs. Data Plane Placement

The UDOM Pipeline spans both planes with clear separation:

Control Plane components:

Batch Orchestrator — lives in the Agent Orchestrator container, manages paper queues, dispatches extraction workers, enforces quality gates, and tracks token budgets. This is a natural fit for the existing orchestrator-workers pattern.
Quality Gate Evaluator — operates as an evaluator-optimizer loop. The 9-dimension scorer evaluates output; if below threshold (0.85), the pipeline retries with enhanced extraction parameters. This maps directly to CODITECT's existing checkpoint framework.
Model Router integration — research synthesis tasks route to Opus; extraction workers (deterministic, no LLM needed) bypass model routing entirely, consuming zero LLM tokens for the extraction phase.

Data Plane components:

Extraction Workers (Docling, ar5iv, LaTeX) — stateless, horizontally scalable worker containers. Each consumes a paper ID from NATS, produces UDOM components, and publishes results. No LLM dependency — pure document processing.
Fusion Engine — deterministic component merger. Takes 3 source outputs, applies confidence-weighted selection rules, produces canonical UDOM document. Runs as a post-processing step after all workers complete.
UDOM Store — PostgreSQL JSONB tables storing typed components with full-text search, GIN indexes, and per-tenant isolation. Integrates with existing State Store container.

Architecture Decision

The UDOM Pipeline does NOT require new containers. It deploys as specialized agent workers within the existing Agent Workers container, using the existing Event Bus (NATS) for coordination and the existing State Store (PostgreSQL) for persistence. This follows Anthropic Principle 1 (Simplicity First) — no new infrastructure beyond what CODITECT already provides.

Multi-Tenancy & Isolation

Isolation Model: Row-Level Security + Tenant-Scoped Corpora

Layer	Isolation Mechanism	Details
UDOM Store	PostgreSQL Row-Level Security (RLS)	`tenant_id` column on all UDOM tables, RLS policies enforce tenant boundary
Paper Corpora	Tenant-scoped collections	Each tenant defines named corpora (e.g., "pharma-ssl-papers", "clinical-trials-2025")
Extraction Workers	Shared compute, isolated I/O	Workers are stateless; tenant context passed via NATS message headers
Quality Scores	Per-tenant thresholds	Regulated tenants can require higher quality thresholds (e.g., 0.90 for FDA workflows)
Audit Trail	Tenant-partitioned audit events	All extraction, fusion, and scoring events tagged with `tenant_id`

Shared vs. Tenant-Specific Resources

Resource	Shared	Tenant-Specific
Docling extraction engine	✅
ar5iv HTTP client	✅
LaTeX pandoc converter	✅
Fusion rules	✅ (default)	✅ (tenant can override confidence weights)
Quality thresholds		✅ (per-tenant configurable)
UDOM component store		✅ (RLS-isolated)
KaTeX macro dictionaries	✅ (common)	✅ (tenant can add domain-specific macros)
Audit logs		✅ (partitioned)

Compliance Surface

Auditability Hooks

Every UDOM pipeline operation generates compliance-auditable events:

# Event: paper_extraction_started
- event_type: udom.extraction.started
  tenant_id: tenant-123
  arxiv_id: "2003.05991"
  sources_requested: ["docling", "ar5iv", "latex"]
  quality_threshold: 0.85
  timestamp: "2026-02-09T11:35:24Z"
  correlation_id: "batch-run-20260209-001"

# Event: component_extracted
- event_type: udom.component.extracted
  source: "ar5iv"
  component_type: "equation"
  confidence: 0.95
  content_hash: "sha256:a1b2c3..."  # Content integrity verification

# Event: quality_gate_passed
- event_type: udom.quality.gate_passed
  overall_score: 0.91
  dimension_scores: {structure: 0.93, math: 0.95, tables: 0.88, ...}
  grade: "A"
  threshold: 0.85

# Event: document_published
- event_type: udom.document.published
  component_count: 312
  sources_used: ["docling", "ar5iv", "latex"]
  provenance_chain: ["pdf→docling", "html→ar5iv", "tex→pandoc→latex"]

Policy Injection Points

Policy	Injection Point	Example
Data classification	Pre-extraction	Block papers containing PHI identifiers before processing
Source provenance	Per-component metadata	Track which source contributed each component for traceability
Quality minimum	Quality gate evaluator	Regulated tenants enforce Grade A minimum (0.85+)
Retention policy	UDOM Store	Auto-purge components older than retention window per tenant policy
Access control	API Gateway	Role-based access to corpora (researcher read-only, admin write)

E-Signature Support

For regulated workflows where UDOM output feeds into validated processes (e.g., FDA submission literature reviews), the pipeline integrates with CODITECT's existing e-signature infrastructure:

Quality scoring reports can be electronically signed by compliance officers
Batch run reports include immutable hash chains for data integrity (21 CFR Part 11 compliance)
Component provenance chains provide the "who, what, when, why" required by Part 11.10(e)

Validation Documentation

The 9-dimension quality scoring system provides built-in IQ/OQ/PQ evidence:

Validation Phase	UDOM Evidence
IQ (Installation)	Docling version verification, ar5iv connectivity test, pandoc version check
OQ (Operational)	9-dimension scoring against reference corpus (known-good papers with manually verified scores)
PQ (Performance)	Batch-level statistics: Grade A rate, processing time, component count distributions

Observability

Tracing

The UDOM Pipeline integrates with CODITECT's OpenTelemetry stack:

Trace: udom.batch.process
├─ Span: udom.paper.2003.05991
│  ├─ Span: udom.extract.docling (5.2s)
│  ├─ Span: udom.extract.ar5iv (2.1s)
│  ├─ Span: udom.extract.latex (8.4s)
│  ├─ Span: udom.fusion (0.3s)
│  ├─ Span: udom.quality.score (0.1s)
│  └─ Span: udom.store.write (0.05s)
├─ Span: udom.paper.2104.14294
│  └─ ...
└─ Span: udom.batch.report

Metrics

Metric	Type	Labels	Purpose
`udom_papers_processed_total`	Counter	`tenant_id`, `grade`	Volume tracking
`udom_extraction_duration_seconds`	Histogram	`source`, `tenant_id`	Performance monitoring
`udom_quality_score`	Gauge	`dimension`, `tenant_id`	Quality trend analysis
`udom_component_count`	Histogram	`type`, `source`	Extraction completeness
`udom_fusion_confidence`	Histogram	`component_type`	Fusion quality
`udom_quality_gate_failures_total`	Counter	`tenant_id`	Quality regression detection

Logging

Structured JSON logs with correlation IDs linking extraction → fusion → scoring → storage.

Alerting Rules

Alert	Condition	Severity
Quality degradation	Grade A rate < 95% over 1-hour window	Warning
Extraction failure spike	>10% failure rate in 15-minute window	Critical
ar5iv unavailability	>5 consecutive 429/503 responses	Warning
Processing latency	P95 > 60s per paper	Warning
Storage capacity	UDOM store > 80% allocated space	Warning

Multi-Agent Orchestration Fit

Agent Tasks Mapping

CODITECT Agent Role	UDOM Interaction	Pattern Used
Orchestrator	Dispatches extraction batches, manages quality gates	Orchestrator-Workers
Research Agent	Queries UDOM store for literature synthesis	Augmented LLM (search + retrieval)
Compliance Agent	Validates UDOM provenance chains, checks data classification	Evaluator-Optimizer
Implementer Agent	Builds adapters for new paper sources (publishers)	Prompt Chaining

Checkpoint Integration

The UDOM Pipeline surfaces mandatory checkpoints at:

Batch configuration — human approves paper list, quality threshold, tenant assignment before processing starts
Quality gate failure — if a paper fails Grade A after max retries, checkpoint triggers for human review (inspect specific dimension failures)
New source adapter — when adding a publisher adapter (e.g., PubMed), architecture checkpoint for adapter design review
Regulatory corpus update — when UDOM output feeds into a validated workflow, checkpoint for compliance officer sign-off

Circuit Breaker Mapping

Worker	Circuit Breaker Trigger	Recovery
Docling worker	3 consecutive OOM errors	Restart with reduced batch size
ar5iv worker	5 consecutive HTTP 429	Exponential backoff, 60s → 120s → 240s
LaTeX worker	3 consecutive pandoc crashes	Skip LaTeX source, degrade to 2-source mode
Fusion engine	Invalid component schema	Reject paper, flag for manual review

Advantages — What UDOM Gives CODITECT That Would Be Hard to Build

3-source fusion is architecturally unique. No existing tool combines Docling + ar5iv + LaTeX source. Building any single extraction engine to match the fidelity of 3-source fusion would require 10–50× the engineering effort.
25 typed components create a structured knowledge API. Agents don't consume raw text — they query typed components (equations, tables, figures, citations). This enables tool-use patterns that are impossible with unstructured markdown.
9-dimension quality scoring is a compliance accelerator. The scoring system provides built-in validation evidence (IQ/OQ/PQ) that would otherwise require months of manual test protocol development for regulated environments.
Cumulative knowledge moat. Every paper processed becomes a queryable knowledge asset. Over time, tenant corpora become irreplaceable — the switching cost is the entire processed knowledge base, not just a software subscription.
Zero LLM tokens for extraction. The extraction pipeline is entirely deterministic (no LLM calls). LLM tokens are only consumed when agents synthesize from UDOM output — a 30–50% token efficiency gain over feeding raw PDFs to agents.

Gaps & Risks — Explicit Assessment

Critical Gaps

Gap	Impact	Effort to Close
Non-arXiv sources	Pipeline currently optimized for arXiv ecosystem. PubMed, IEEE, Springer, Elsevier, Nature papers require new adapters.	High — each publisher has unique PDF layouts, HTML formats, and access APIs. Estimate 2–3 weeks per publisher adapter.
Figure extraction	Docling extracts figure references but not figure content (images). ar5iv provides some images but inconsistently.	Medium — need image extraction + captioning pipeline. Docling v2 has basic image support; needs quality scoring for images.
Real-time ingestion	Current pipeline is batch-oriented. No push-based ingestion when new papers are published.	Medium — add arXiv RSS feed listener + NATS topic for new paper events.
Cross-paper citation graph	Individual papers are processed independently. No citation graph linking papers within a corpus.	Medium — extract citation references from bibliography components, build adjacency graph in PostgreSQL.

Risks

Risk	Probability	Impact	Mitigation
ar5iv service deprecation	Low	High	ar5iv is community-maintained; monitor status, build fallback to direct LaTeXML rendering
Docling breaking changes	Medium	Medium	Version pin, maintain compatibility test suite against reference corpus
Publisher IP restrictions	Medium	High	Negotiate API access for enterprise tenants; implement institutional proxy support
Quality scoring drift	Low	Medium	Maintain golden reference set of manually scored papers; run regression tests weekly

Integration Patterns — Concrete Adapter Interfaces

UDOM Store Adapter (Python)

from abc import ABC, abstractmethod
from typing import AsyncIterator

class UDOMStoreAdapter(ABC):
    """Interface for UDOM document storage — swappable backends."""
    
    @abstractmethod
    async def store_document(self, doc: UDOMDocument, tenant_id: str) -> str:
        """Store UDOM document, return document ID."""
        ...
    
    @abstractmethod
    async def get_components(
        self, arxiv_id: str, tenant_id: str,
        component_type: str | None = None,
    ) -> list[UDOMComponent]:
        """Retrieve components, optionally filtered by type."""
        ...
    
    @abstractmethod
    async def semantic_search(
        self, query: str, tenant_id: str,
        corpus: str | None = None, limit: int = 20,
    ) -> list[UDOMComponent]:
        """Semantic search across components in tenant's corpus."""
        ...
    
    @abstractmethod
    async def get_quality_report(
        self, tenant_id: str, corpus: str | None = None,
    ) -> dict:
        """Aggregate quality statistics for tenant's corpus."""
        ...

Source Adapter (Python) — For New Publishers

class SourceAdapter(ABC):
    """Interface for adding new paper sources (publishers, preprint servers)."""
    
    @abstractmethod
    async def extract(self, paper_id: str) -> list[UDOMComponent]:
        """Extract UDOM components from this source."""
        ...
    
    @abstractmethod
    def supports(self, paper_id: str) -> bool:
        """Whether this adapter handles the given paper ID format."""
        ...
    
    @property
    @abstractmethod
    def source_name(self) -> str:
        """Unique source identifier (e.g., 'pubmed', 'ieee', 'springer')."""
        ...
    
    @property
    @abstractmethod
    def default_confidence(self) -> float:
        """Baseline confidence for components from this source."""
        ...

Agent Tool Interface (TypeScript)

interface UDOMAgentTools {
  /** Search UDOM corpus for components matching a semantic query. */
  searchComponents(params: {
    query: string;
    tenantId: string;
    corpus?: string;
    componentType?: ComponentType;
    limit?: number;
  }): Promise<UDOMComponent[]>;

  /** Get all components of a specific paper. */
  getPaperComponents(params: {
    arxivId: string;
    tenantId: string;
    componentType?: ComponentType;
  }): Promise<UDOMComponent[]>;

  /** Compare a metric across multiple papers. */
  crossPaperComparison(params: {
    metric: string;
    paperIds: string[];
    tenantId: string;
  }): Promise<ComparisonResult>;

  /** Get quality report for a corpus. */
  getCorpusQuality(params: {
    tenantId: string;
    corpus?: string;
  }): Promise<QualityReport>;
}

Impact assessment covers: architecture placement, multi-tenancy, compliance, observability, agent orchestration, advantages, gaps, and concrete integration interfaces.

Integration Architecture​

Control Plane vs. Data Plane Placement​

Architecture Decision​

Multi-Tenancy & Isolation​

Isolation Model: Row-Level Security + Tenant-Scoped Corpora​

Shared vs. Tenant-Specific Resources​

Compliance Surface​

Auditability Hooks​

Policy Injection Points​

E-Signature Support​

Validation Documentation​

Observability​

Tracing​

Metrics​

Logging​

Alerting Rules​

Multi-Agent Orchestration Fit​

Agent Tasks Mapping​

Checkpoint Integration​

Circuit Breaker Mapping​

Advantages — What UDOM Gives CODITECT That Would Be Hard to Build​

Gaps & Risks — Explicit Assessment​

Critical Gaps​

Risks​

Integration Patterns — Concrete Adapter Interfaces​

UDOM Store Adapter (Python)​

Source Adapter (Python) — For New Publishers​

Agent Tool Interface (TypeScript)​