UDOM Pipeline — CODITECT Impact Analysis
Version: 1.0 | Date: 2026-02-09
Classification: Architecture Impact Assessment
Scope: Integration of UDOM Pipeline into CODITECT Platform
Integration Architecture
Control Plane vs. Data Plane Placement
The UDOM Pipeline spans both planes with clear separation:
Control Plane components:
- Batch Orchestrator — lives in the Agent Orchestrator container, manages paper queues, dispatches extraction workers, enforces quality gates, and tracks token budgets. This is a natural fit for the existing orchestrator-workers pattern.
- Quality Gate Evaluator — operates as an evaluator-optimizer loop. The 9-dimension scorer evaluates output; if below threshold (0.85), the pipeline retries with enhanced extraction parameters. This maps directly to CODITECT's existing checkpoint framework.
- Model Router integration — research synthesis tasks route to Opus; extraction workers (deterministic, no LLM needed) bypass model routing entirely, consuming zero LLM tokens for the extraction phase.
Data Plane components:
- Extraction Workers (Docling, ar5iv, LaTeX) — stateless, horizontally scalable worker containers. Each consumes a paper ID from NATS, produces UDOM components, and publishes results. No LLM dependency — pure document processing.
- Fusion Engine — deterministic component merger. Takes 3 source outputs, applies confidence-weighted selection rules, produces canonical UDOM document. Runs as a post-processing step after all workers complete.
- UDOM Store — PostgreSQL JSONB tables storing typed components with full-text search, GIN indexes, and per-tenant isolation. Integrates with existing State Store container.
Architecture Decision
The UDOM Pipeline does NOT require new containers. It deploys as specialized agent workers within the existing Agent Workers container, using the existing Event Bus (NATS) for coordination and the existing State Store (PostgreSQL) for persistence. This follows Anthropic Principle 1 (Simplicity First) — no new infrastructure beyond what CODITECT already provides.
Multi-Tenancy & Isolation
Isolation Model: Row-Level Security + Tenant-Scoped Corpora
| Layer | Isolation Mechanism | Details |
|---|---|---|
| UDOM Store | PostgreSQL Row-Level Security (RLS) | tenant_id column on all UDOM tables, RLS policies enforce tenant boundary |
| Paper Corpora | Tenant-scoped collections | Each tenant defines named corpora (e.g., "pharma-ssl-papers", "clinical-trials-2025") |
| Extraction Workers | Shared compute, isolated I/O | Workers are stateless; tenant context passed via NATS message headers |
| Quality Scores | Per-tenant thresholds | Regulated tenants can require higher quality thresholds (e.g., 0.90 for FDA workflows) |
| Audit Trail | Tenant-partitioned audit events | All extraction, fusion, and scoring events tagged with tenant_id |
Shared vs. Tenant-Specific Resources
| Resource | Shared | Tenant-Specific |
|---|---|---|
| Docling extraction engine | ✅ | |
| ar5iv HTTP client | ✅ | |
| LaTeX pandoc converter | ✅ | |
| Fusion rules | ✅ (default) | ✅ (tenant can override confidence weights) |
| Quality thresholds | ✅ (per-tenant configurable) | |
| UDOM component store | ✅ (RLS-isolated) | |
| KaTeX macro dictionaries | ✅ (common) | ✅ (tenant can add domain-specific macros) |
| Audit logs | ✅ (partitioned) |
Compliance Surface
Auditability Hooks
Every UDOM pipeline operation generates compliance-auditable events:
# Event: paper_extraction_started
- event_type: udom.extraction.started
tenant_id: tenant-123
arxiv_id: "2003.05991"
sources_requested: ["docling", "ar5iv", "latex"]
quality_threshold: 0.85
timestamp: "2026-02-09T11:35:24Z"
correlation_id: "batch-run-20260209-001"
# Event: component_extracted
- event_type: udom.component.extracted
source: "ar5iv"
component_type: "equation"
confidence: 0.95
content_hash: "sha256:a1b2c3..." # Content integrity verification
# Event: quality_gate_passed
- event_type: udom.quality.gate_passed
overall_score: 0.91
dimension_scores: {structure: 0.93, math: 0.95, tables: 0.88, ...}
grade: "A"
threshold: 0.85
# Event: document_published
- event_type: udom.document.published
component_count: 312
sources_used: ["docling", "ar5iv", "latex"]
provenance_chain: ["pdf→docling", "html→ar5iv", "tex→pandoc→latex"]
Policy Injection Points
| Policy | Injection Point | Example |
|---|---|---|
| Data classification | Pre-extraction | Block papers containing PHI identifiers before processing |
| Source provenance | Per-component metadata | Track which source contributed each component for traceability |
| Quality minimum | Quality gate evaluator | Regulated tenants enforce Grade A minimum (0.85+) |
| Retention policy | UDOM Store | Auto-purge components older than retention window per tenant policy |
| Access control | API Gateway | Role-based access to corpora (researcher read-only, admin write) |
E-Signature Support
For regulated workflows where UDOM output feeds into validated processes (e.g., FDA submission literature reviews), the pipeline integrates with CODITECT's existing e-signature infrastructure:
- Quality scoring reports can be electronically signed by compliance officers
- Batch run reports include immutable hash chains for data integrity (21 CFR Part 11 compliance)
- Component provenance chains provide the "who, what, when, why" required by Part 11.10(e)
Validation Documentation
The 9-dimension quality scoring system provides built-in IQ/OQ/PQ evidence:
| Validation Phase | UDOM Evidence |
|---|---|
| IQ (Installation) | Docling version verification, ar5iv connectivity test, pandoc version check |
| OQ (Operational) | 9-dimension scoring against reference corpus (known-good papers with manually verified scores) |
| PQ (Performance) | Batch-level statistics: Grade A rate, processing time, component count distributions |
Observability
Tracing
The UDOM Pipeline integrates with CODITECT's OpenTelemetry stack:
Trace: udom.batch.process
├─ Span: udom.paper.2003.05991
│ ├─ Span: udom.extract.docling (5.2s)
│ ├─ Span: udom.extract.ar5iv (2.1s)
│ ├─ Span: udom.extract.latex (8.4s)
│ ├─ Span: udom.fusion (0.3s)
│ ├─ Span: udom.quality.score (0.1s)
│ └─ Span: udom.store.write (0.05s)
├─ Span: udom.paper.2104.14294
│ └─ ...
└─ Span: udom.batch.report
Metrics
| Metric | Type | Labels | Purpose |
|---|---|---|---|
udom_papers_processed_total | Counter | tenant_id, grade | Volume tracking |
udom_extraction_duration_seconds | Histogram | source, tenant_id | Performance monitoring |
udom_quality_score | Gauge | dimension, tenant_id | Quality trend analysis |
udom_component_count | Histogram | type, source | Extraction completeness |
udom_fusion_confidence | Histogram | component_type | Fusion quality |
udom_quality_gate_failures_total | Counter | tenant_id | Quality regression detection |
Logging
Structured JSON logs with correlation IDs linking extraction → fusion → scoring → storage.
Alerting Rules
| Alert | Condition | Severity |
|---|---|---|
| Quality degradation | Grade A rate < 95% over 1-hour window | Warning |
| Extraction failure spike | >10% failure rate in 15-minute window | Critical |
| ar5iv unavailability | >5 consecutive 429/503 responses | Warning |
| Processing latency | P95 > 60s per paper | Warning |
| Storage capacity | UDOM store > 80% allocated space | Warning |
Multi-Agent Orchestration Fit
Agent Tasks Mapping
| CODITECT Agent Role | UDOM Interaction | Pattern Used |
|---|---|---|
| Orchestrator | Dispatches extraction batches, manages quality gates | Orchestrator-Workers |
| Research Agent | Queries UDOM store for literature synthesis | Augmented LLM (search + retrieval) |
| Compliance Agent | Validates UDOM provenance chains, checks data classification | Evaluator-Optimizer |
| Implementer Agent | Builds adapters for new paper sources (publishers) | Prompt Chaining |
Checkpoint Integration
The UDOM Pipeline surfaces mandatory checkpoints at:
- Batch configuration — human approves paper list, quality threshold, tenant assignment before processing starts
- Quality gate failure — if a paper fails Grade A after max retries, checkpoint triggers for human review (inspect specific dimension failures)
- New source adapter — when adding a publisher adapter (e.g., PubMed), architecture checkpoint for adapter design review
- Regulatory corpus update — when UDOM output feeds into a validated workflow, checkpoint for compliance officer sign-off
Circuit Breaker Mapping
| Worker | Circuit Breaker Trigger | Recovery |
|---|---|---|
| Docling worker | 3 consecutive OOM errors | Restart with reduced batch size |
| ar5iv worker | 5 consecutive HTTP 429 | Exponential backoff, 60s → 120s → 240s |
| LaTeX worker | 3 consecutive pandoc crashes | Skip LaTeX source, degrade to 2-source mode |
| Fusion engine | Invalid component schema | Reject paper, flag for manual review |
Advantages — What UDOM Gives CODITECT That Would Be Hard to Build
-
3-source fusion is architecturally unique. No existing tool combines Docling + ar5iv + LaTeX source. Building any single extraction engine to match the fidelity of 3-source fusion would require 10–50× the engineering effort.
-
25 typed components create a structured knowledge API. Agents don't consume raw text — they query typed components (equations, tables, figures, citations). This enables tool-use patterns that are impossible with unstructured markdown.
-
9-dimension quality scoring is a compliance accelerator. The scoring system provides built-in validation evidence (IQ/OQ/PQ) that would otherwise require months of manual test protocol development for regulated environments.
-
Cumulative knowledge moat. Every paper processed becomes a queryable knowledge asset. Over time, tenant corpora become irreplaceable — the switching cost is the entire processed knowledge base, not just a software subscription.
-
Zero LLM tokens for extraction. The extraction pipeline is entirely deterministic (no LLM calls). LLM tokens are only consumed when agents synthesize from UDOM output — a 30–50% token efficiency gain over feeding raw PDFs to agents.
Gaps & Risks — Explicit Assessment
Critical Gaps
| Gap | Impact | Effort to Close |
|---|---|---|
| Non-arXiv sources | Pipeline currently optimized for arXiv ecosystem. PubMed, IEEE, Springer, Elsevier, Nature papers require new adapters. | High — each publisher has unique PDF layouts, HTML formats, and access APIs. Estimate 2–3 weeks per publisher adapter. |
| Figure extraction | Docling extracts figure references but not figure content (images). ar5iv provides some images but inconsistently. | Medium — need image extraction + captioning pipeline. Docling v2 has basic image support; needs quality scoring for images. |
| Real-time ingestion | Current pipeline is batch-oriented. No push-based ingestion when new papers are published. | Medium — add arXiv RSS feed listener + NATS topic for new paper events. |
| Cross-paper citation graph | Individual papers are processed independently. No citation graph linking papers within a corpus. | Medium — extract citation references from bibliography components, build adjacency graph in PostgreSQL. |
Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| ar5iv service deprecation | Low | High | ar5iv is community-maintained; monitor status, build fallback to direct LaTeXML rendering |
| Docling breaking changes | Medium | Medium | Version pin, maintain compatibility test suite against reference corpus |
| Publisher IP restrictions | Medium | High | Negotiate API access for enterprise tenants; implement institutional proxy support |
| Quality scoring drift | Low | Medium | Maintain golden reference set of manually scored papers; run regression tests weekly |
Integration Patterns — Concrete Adapter Interfaces
UDOM Store Adapter (Python)
from abc import ABC, abstractmethod
from typing import AsyncIterator
class UDOMStoreAdapter(ABC):
"""Interface for UDOM document storage — swappable backends."""
@abstractmethod
async def store_document(self, doc: UDOMDocument, tenant_id: str) -> str:
"""Store UDOM document, return document ID."""
...
@abstractmethod
async def get_components(
self, arxiv_id: str, tenant_id: str,
component_type: str | None = None,
) -> list[UDOMComponent]:
"""Retrieve components, optionally filtered by type."""
...
@abstractmethod
async def semantic_search(
self, query: str, tenant_id: str,
corpus: str | None = None, limit: int = 20,
) -> list[UDOMComponent]:
"""Semantic search across components in tenant's corpus."""
...
@abstractmethod
async def get_quality_report(
self, tenant_id: str, corpus: str | None = None,
) -> dict:
"""Aggregate quality statistics for tenant's corpus."""
...
Source Adapter (Python) — For New Publishers
class SourceAdapter(ABC):
"""Interface for adding new paper sources (publishers, preprint servers)."""
@abstractmethod
async def extract(self, paper_id: str) -> list[UDOMComponent]:
"""Extract UDOM components from this source."""
...
@abstractmethod
def supports(self, paper_id: str) -> bool:
"""Whether this adapter handles the given paper ID format."""
...
@property
@abstractmethod
def source_name(self) -> str:
"""Unique source identifier (e.g., 'pubmed', 'ieee', 'springer')."""
...
@property
@abstractmethod
def default_confidence(self) -> float:
"""Baseline confidence for components from this source."""
...
Agent Tool Interface (TypeScript)
interface UDOMAgentTools {
/** Search UDOM corpus for components matching a semantic query. */
searchComponents(params: {
query: string;
tenantId: string;
corpus?: string;
componentType?: ComponentType;
limit?: number;
}): Promise<UDOMComponent[]>;
/** Get all components of a specific paper. */
getPaperComponents(params: {
arxivId: string;
tenantId: string;
componentType?: ComponentType;
}): Promise<UDOMComponent[]>;
/** Compare a metric across multiple papers. */
crossPaperComparison(params: {
metric: string;
paperIds: string[];
tenantId: string;
}): Promise<ComparisonResult>;
/** Get quality report for a corpus. */
getCorpusQuality(params: {
tenantId: string;
corpus?: string;
}): Promise<QualityReport>;
}
Impact assessment covers: architecture placement, multi-tenancy, compliance, observability, agent orchestration, advantages, gaps, and concrete integration interfaces.