Skip to main content

C4 Architecture: UDOM Pipeline Integration into CODITECT

Version: 1.0 | Date: 2026-02-09
Classification: Architecture — C4 Model
Scope: UDOM Pipeline as subsystem within CODITECT Platform


C1 — System Context

Narrative

At the system context level, the UDOM Pipeline extends CODITECT's capabilities by adding a new interaction pathway: CODITECT now reaches into the scientific publishing ecosystem (arXiv, ar5iv, future publisher APIs) to autonomously acquire, process, and structure research knowledge for agent consumption. This is a significant expansion of CODITECT's system boundary — from "automating work tasks" to "automating knowledge acquisition."

The key insight at C1 is that the UDOM Pipeline turns CODITECT from a task automation platform into a knowledge automation platform. Agents don't just execute pre-configured workflows; they actively build and query structured knowledge bases derived from the global scientific literature.

For value creation, C1 reveals three stakeholder value streams: researchers gain automated literature processing (time savings), compliance officers gain provenance-tracked evidence chains (regulatory value), and executives gain a defensible competitive position (knowledge moat).

Diagram

Value Stream Analysis (C1)

ActorValue DeliveredMetric
ResearcherAutonomous literature processing360–1,440× time savings per paper
Compliance OfficerProvenance-tracked quality evidence100% audit trail coverage
ExecutiveDefensible competitive positionCumulative knowledge moat (switching cost = corpus)
Research AgentStructured knowledge tools47–90% token reduction vs. raw PDF

C2 — Container Diagram

Narrative

Zooming into the CODITECT platform boundary, the UDOM Pipeline integrates into 3 existing containers and adds 1 new conceptual grouping:

Modified containers:

  1. Agent Orchestrator — gains UDOM batch management, quality gate logic, and extraction worker dispatch. This is the biggest change: the orchestrator now handles a new task type ("extract_paper") alongside existing code/compliance tasks.
  2. Agent Workers — gains 3 new specialized workers (Docling, ar5iv, LaTeX) plus Fusion Engine and Quality Scorer. These are stateless, horizontally scalable, and communicate via NATS.
  3. State Store (PostgreSQL) — gains UDOM tables (udom_documents, udom_batch_runs, udom_audit_events, udom_corpora) with RLS policies and GIN indexes.

New consumer interface: 4. UDOM Navigator — lightweight static HTML/JS viewer for human browsing of batch results. Deployed as nginx container, reads from UDOM Store read replica.

The compliance and observability implications at C2 are significant: every extraction event flows through the existing Compliance Engine (provenance tracking) and Observability Stack (tracing, metrics). No new cross-cutting infrastructure is needed.

Diagram

Container Impact Assessment

ContainerChange TypeEffortRisk
Agent OrchestratorExtend (new task type)Medium (2 weeks)Low — follows existing patterns
Agent WorkersAdd (3 new workers + fusion + scorer)High (4 weeks)Medium — new extraction dependencies
State StoreExtend (new tables + indexes)Low (1 week)Low — standard PostgreSQL
Event BusExtend (new event types)Low (< 1 week)Low — standard NATS topics
Compliance EngineExtend (new provenance rules)Medium (1 week)Low — follows existing patterns
UDOM NavigatorNew containerMedium (2 weeks)Low — static files, read-only
API GatewayExtend (new routes)Low (< 1 week)Low — standard route addition

C3 — Component Diagram (UDOM Pipeline Internals)

Narrative

Inside the UDOM Pipeline "virtual container" (actually distributed across Agent Workers and Agent Orchestrator), six components handle the complete extraction-to-consumption lifecycle.

The Batch Manager receives extraction requests from the Orchestrator and manages paper queues, concurrency limits, and retry state. The Source Dispatcher routes papers to the appropriate extraction workers via NATS, handling the parallel dispatch pattern. The Fusion Engine is the most architecturally novel component — it implements confidence-weighted selection across the 3 sources, with a configurable weight matrix per tenant. The Quality Evaluator implements the 9-dimension scoring system and controls the quality gate (Grade A pass, retry, or human checkpoint). The UDOM Serializer converts the fused document into both JSONB (for PostgreSQL storage) and markdown (for agent consumption and Navigator display). The Agent API provides the structured query interface that research agents use to search, compare, and traverse the UDOM corpus.

Diagram

Component Interfaces

ComponentInputOutputValue Contribution
Batch ManagerExtraction requestsQueued paper jobsEnables batch-scale processing (218+ papers)
Source DispatcherPaper IDs3 parallel extraction tasks3-source architecture enables 100% Grade A
Fusion Engine3× component listsCanonical UDOM documentConfidence-weighted selection = higher fidelity than any single source
Quality EvaluatorFused documentGrade + dimension scoresQuality gate = compliance evidence + regression detection
UDOM SerializerGraded documentJSONB + Markdown + audit eventDual format enables both agent consumption and human review
Agent APISearch/compare queriesTyped component results47–90% token reduction for agent workflows

C4 — Code Diagram (Fusion Engine)

Narrative

The Fusion Engine is the most differentiated component — it's where CODITECT's architectural advantage materializes in code. At the code level, four classes implement the fusion logic.

FusionEngine is the entry point. It receives three lists of UDOMComponent objects (one per source) and a FusionConfig (tenant-specific confidence weights). It iterates through the structural backbone (Docling components, ordered by position) and for each component, queries the ComponentMatcher to find corresponding components from ar5iv and LaTeX sources. The ConfidenceSelector then applies the weight matrix to select the highest-confidence version. The FusionAuditLog records every selection decision (which source won, by what margin) for compliance traceability.

The design is intentionally simple and deterministic. There are no ML models, no probabilistic decisions, and no LLM calls. Every fusion decision can be replayed and audited — a hard requirement for FDA 21 CFR Part 11 and SOC2 compliance.

Diagram

Implementation Reference

class FusionEngine:
"""
Deterministic 3-source fusion — auditable, explainable, tenant-configurable.

Value creation: This is where single-source extraction quality ceilings are broken.
By selecting the highest-confidence component per position across 3 independent
extraction engines, fusion achieves fidelity that no single engine can match.

The confidence-weighted approach is deliberately non-ML: every decision is
traceable, reproducible, and compliance-auditable.
"""

def __init__(self, config: FusionConfig):
self.config = config
self.matcher = ComponentMatcher(similarity_threshold=0.7)
self.selector = ConfidenceSelector(config)
self.audit_log = FusionAuditLog()

def fuse(
self,
docling: list[UDOMComponent],
ar5iv: list[UDOMComponent],
latex: list[UDOMComponent],
) -> UDOMDocument:
"""
Fuse 3 sources into canonical UDOM document.

Strategy:
1. Use Docling as structural backbone (best heading/paragraph ordering)
2. For each Docling component, find matching ar5iv and LaTeX components
3. Select highest-confidence version using tenant-configurable weights
4. Log every decision for compliance audit trail
"""
fused_components = []
all_candidates = {"ar5iv": ar5iv, "latex": latex}

for position, doc_component in enumerate(docling):
candidates = [doc_component]

# Find matching components from other sources
for source_name, source_components in all_candidates.items():
matches = self.matcher.find_matches(doc_component, source_components)
candidates.extend([m.source_component for m in matches])

# Select best using confidence weights
winner = self.selector.select_best(
component_type=doc_component.type.value,
candidates=candidates,
)

# Log decision for audit trail
self.audit_log.log_decision(
position=position,
winner=winner,
candidates=candidates,
rationale=self.selector.get_selection_rationale(),
)

fused_components.append(winner)

return UDOMDocument(
components=fused_components,
source_stats={
"docling": len(docling),
"ar5iv": len(ar5iv),
"latex": len(latex),
"fused": len(fused_components),
},
audit_trail=self.audit_log.to_audit_event(),
)

Compliance Implications at C4

The Fusion Engine's code-level design directly addresses regulatory requirements:

RequirementImplementation
21 CFR Part 11 — Audit TrailFusionAuditLog records every selection decision with timestamp, rationale, and candidate list
21 CFR Part 11 — Data IntegrityContent hashes (CandidateInfo.content_hash) verify component integrity through the fusion pipeline
SOC2 — Change ManagementFusionConfig is version-controlled; weight changes are tracked as ADRs
HIPAA — Access ControlFusionConfig.tenant_id ensures tenant-specific weights are isolated

C4 Architecture covers all four levels: System Context (C1), Container Diagram (C2), Component Diagram (C3), and Code Diagram (C4) with narratives, diagrams, and compliance analysis.