C4 Architecture: UDOM Pipeline Integration into CODITECT

Version: 1.0 | Date: 2026-02-09
Classification: Architecture — C4 Model
Scope: UDOM Pipeline as subsystem within CODITECT Platform

C1 — System Context

Narrative

At the system context level, the UDOM Pipeline extends CODITECT's capabilities by adding a new interaction pathway: CODITECT now reaches into the scientific publishing ecosystem (arXiv, ar5iv, future publisher APIs) to autonomously acquire, process, and structure research knowledge for agent consumption. This is a significant expansion of CODITECT's system boundary — from "automating work tasks" to "automating knowledge acquisition."

The key insight at C1 is that the UDOM Pipeline turns CODITECT from a task automation platform into a knowledge automation platform. Agents don't just execute pre-configured workflows; they actively build and query structured knowledge bases derived from the global scientific literature.

For value creation, C1 reveals three stakeholder value streams: researchers gain automated literature processing (time savings), compliance officers gain provenance-tracked evidence chains (regulatory value), and executives gain a defensible competitive position (knowledge moat).

Diagram

Value Stream Analysis (C1)

Actor	Value Delivered	Metric
Researcher	Autonomous literature processing	360–1,440× time savings per paper
Compliance Officer	Provenance-tracked quality evidence	100% audit trail coverage
Executive	Defensible competitive position	Cumulative knowledge moat (switching cost = corpus)
Research Agent	Structured knowledge tools	47–90% token reduction vs. raw PDF

C2 — Container Diagram

Narrative

Zooming into the CODITECT platform boundary, the UDOM Pipeline integrates into 3 existing containers and adds 1 new conceptual grouping:

Modified containers:

Agent Orchestrator — gains UDOM batch management, quality gate logic, and extraction worker dispatch. This is the biggest change: the orchestrator now handles a new task type ("extract_paper") alongside existing code/compliance tasks.
Agent Workers — gains 3 new specialized workers (Docling, ar5iv, LaTeX) plus Fusion Engine and Quality Scorer. These are stateless, horizontally scalable, and communicate via NATS.
State Store (PostgreSQL) — gains UDOM tables (udom_documents, udom_batch_runs, udom_audit_events, udom_corpora) with RLS policies and GIN indexes.

New consumer interface: 4. UDOM Navigator — lightweight static HTML/JS viewer for human browsing of batch results. Deployed as nginx container, reads from UDOM Store read replica.

The compliance and observability implications at C2 are significant: every extraction event flows through the existing Compliance Engine (provenance tracking) and Observability Stack (tracing, metrics). No new cross-cutting infrastructure is needed.

Diagram

Container Impact Assessment

Container	Change Type	Effort	Risk
Agent Orchestrator	Extend (new task type)	Medium (2 weeks)	Low — follows existing patterns
Agent Workers	Add (3 new workers + fusion + scorer)	High (4 weeks)	Medium — new extraction dependencies
State Store	Extend (new tables + indexes)	Low (1 week)	Low — standard PostgreSQL
Event Bus	Extend (new event types)	Low (< 1 week)	Low — standard NATS topics
Compliance Engine	Extend (new provenance rules)	Medium (1 week)	Low — follows existing patterns
UDOM Navigator	New container	Medium (2 weeks)	Low — static files, read-only
API Gateway	Extend (new routes)	Low (< 1 week)	Low — standard route addition

C3 — Component Diagram (UDOM Pipeline Internals)

Narrative

Inside the UDOM Pipeline "virtual container" (actually distributed across Agent Workers and Agent Orchestrator), six components handle the complete extraction-to-consumption lifecycle.

The Batch Manager receives extraction requests from the Orchestrator and manages paper queues, concurrency limits, and retry state. The Source Dispatcher routes papers to the appropriate extraction workers via NATS, handling the parallel dispatch pattern. The Fusion Engine is the most architecturally novel component — it implements confidence-weighted selection across the 3 sources, with a configurable weight matrix per tenant. The Quality Evaluator implements the 9-dimension scoring system and controls the quality gate (Grade A pass, retry, or human checkpoint). The UDOM Serializer converts the fused document into both JSONB (for PostgreSQL storage) and markdown (for agent consumption and Navigator display). The Agent API provides the structured query interface that research agents use to search, compare, and traverse the UDOM corpus.

Diagram

Component Interfaces

Component	Input	Output	Value Contribution
Batch Manager	Extraction requests	Queued paper jobs	Enables batch-scale processing (218+ papers)
Source Dispatcher	Paper IDs	3 parallel extraction tasks	3-source architecture enables 100% Grade A
Fusion Engine	3× component lists	Canonical UDOM document	Confidence-weighted selection = higher fidelity than any single source
Quality Evaluator	Fused document	Grade + dimension scores	Quality gate = compliance evidence + regression detection
UDOM Serializer	Graded document	JSONB + Markdown + audit event	Dual format enables both agent consumption and human review
Agent API	Search/compare queries	Typed component results	47–90% token reduction for agent workflows

C4 — Code Diagram (Fusion Engine)

Narrative

The Fusion Engine is the most differentiated component — it's where CODITECT's architectural advantage materializes in code. At the code level, four classes implement the fusion logic.

FusionEngine is the entry point. It receives three lists of UDOMComponent objects (one per source) and a FusionConfig (tenant-specific confidence weights). It iterates through the structural backbone (Docling components, ordered by position) and for each component, queries the ComponentMatcher to find corresponding components from ar5iv and LaTeX sources. The ConfidenceSelector then applies the weight matrix to select the highest-confidence version. The FusionAuditLog records every selection decision (which source won, by what margin) for compliance traceability.

The design is intentionally simple and deterministic. There are no ML models, no probabilistic decisions, and no LLM calls. Every fusion decision can be replayed and audited — a hard requirement for FDA 21 CFR Part 11 and SOC2 compliance.

Diagram

Implementation Reference

class FusionEngine:
    """
    Deterministic 3-source fusion — auditable, explainable, tenant-configurable.
    
    Value creation: This is where single-source extraction quality ceilings are broken.
    By selecting the highest-confidence component per position across 3 independent
    extraction engines, fusion achieves fidelity that no single engine can match.
    
    The confidence-weighted approach is deliberately non-ML: every decision is
    traceable, reproducible, and compliance-auditable.
    """
    
    def __init__(self, config: FusionConfig):
        self.config = config
        self.matcher = ComponentMatcher(similarity_threshold=0.7)
        self.selector = ConfidenceSelector(config)
        self.audit_log = FusionAuditLog()
    
    def fuse(
        self,
        docling: list[UDOMComponent],
        ar5iv: list[UDOMComponent],
        latex: list[UDOMComponent],
    ) -> UDOMDocument:
        """
        Fuse 3 sources into canonical UDOM document.
        
        Strategy:
        1. Use Docling as structural backbone (best heading/paragraph ordering)
        2. For each Docling component, find matching ar5iv and LaTeX components
        3. Select highest-confidence version using tenant-configurable weights
        4. Log every decision for compliance audit trail
        """
        fused_components = []
        all_candidates = {"ar5iv": ar5iv, "latex": latex}
        
        for position, doc_component in enumerate(docling):
            candidates = [doc_component]
            
            # Find matching components from other sources
            for source_name, source_components in all_candidates.items():
                matches = self.matcher.find_matches(doc_component, source_components)
                candidates.extend([m.source_component for m in matches])
            
            # Select best using confidence weights
            winner = self.selector.select_best(
                component_type=doc_component.type.value,
                candidates=candidates,
            )
            
            # Log decision for audit trail
            self.audit_log.log_decision(
                position=position,
                winner=winner,
                candidates=candidates,
                rationale=self.selector.get_selection_rationale(),
            )
            
            fused_components.append(winner)
        
        return UDOMDocument(
            components=fused_components,
            source_stats={
                "docling": len(docling),
                "ar5iv": len(ar5iv),
                "latex": len(latex),
                "fused": len(fused_components),
            },
            audit_trail=self.audit_log.to_audit_event(),
        )

Compliance Implications at C4

The Fusion Engine's code-level design directly addresses regulatory requirements:

Requirement	Implementation
21 CFR Part 11 — Audit Trail	`FusionAuditLog` records every selection decision with timestamp, rationale, and candidate list
21 CFR Part 11 — Data Integrity	Content hashes (`CandidateInfo.content_hash`) verify component integrity through the fusion pipeline
SOC2 — Change Management	`FusionConfig` is version-controlled; weight changes are tracked as ADRs
HIPAA — Access Control	`FusionConfig.tenant_id` ensures tenant-specific weights are isolated

C4 Architecture covers all four levels: System Context (C1), Container Diagram (C2), Component Diagram (C3), and Code Diagram (C4) with narratives, diagrams, and compliance analysis.

C1 — System Context​

Narrative​

Diagram​

Value Stream Analysis (C1)​

C2 — Container Diagram​

Narrative​

Diagram​

Container Impact Assessment​

C3 — Component Diagram (UDOM Pipeline Internals)​

Narrative​

Diagram​

Component Interfaces​

C4 — Code Diagram (Fusion Engine)​

Narrative​

Diagram​

Implementation Reference​

Compliance Implications at C4​

C1 — System Context

Narrative

Diagram

Value Stream Analysis (C1)

C2 — Container Diagram

Narrative

Diagram

Container Impact Assessment

C3 — Component Diagram (UDOM Pipeline Internals)

Narrative

Diagram

Component Interfaces

C4 — Code Diagram (Fusion Engine)

Narrative

Diagram

Implementation Reference

Compliance Implications at C4