C4 Architecture: UDOM Pipeline Integration into CODITECT
Version: 1.0 | Date: 2026-02-09
Classification: Architecture — C4 Model
Scope: UDOM Pipeline as subsystem within CODITECT Platform
C1 — System Context
Narrative
At the system context level, the UDOM Pipeline extends CODITECT's capabilities by adding a new interaction pathway: CODITECT now reaches into the scientific publishing ecosystem (arXiv, ar5iv, future publisher APIs) to autonomously acquire, process, and structure research knowledge for agent consumption. This is a significant expansion of CODITECT's system boundary — from "automating work tasks" to "automating knowledge acquisition."
The key insight at C1 is that the UDOM Pipeline turns CODITECT from a task automation platform into a knowledge automation platform. Agents don't just execute pre-configured workflows; they actively build and query structured knowledge bases derived from the global scientific literature.
For value creation, C1 reveals three stakeholder value streams: researchers gain automated literature processing (time savings), compliance officers gain provenance-tracked evidence chains (regulatory value), and executives gain a defensible competitive position (knowledge moat).
Diagram
Value Stream Analysis (C1)
| Actor | Value Delivered | Metric |
|---|---|---|
| Researcher | Autonomous literature processing | 360–1,440× time savings per paper |
| Compliance Officer | Provenance-tracked quality evidence | 100% audit trail coverage |
| Executive | Defensible competitive position | Cumulative knowledge moat (switching cost = corpus) |
| Research Agent | Structured knowledge tools | 47–90% token reduction vs. raw PDF |
C2 — Container Diagram
Narrative
Zooming into the CODITECT platform boundary, the UDOM Pipeline integrates into 3 existing containers and adds 1 new conceptual grouping:
Modified containers:
- Agent Orchestrator — gains UDOM batch management, quality gate logic, and extraction worker dispatch. This is the biggest change: the orchestrator now handles a new task type ("extract_paper") alongside existing code/compliance tasks.
- Agent Workers — gains 3 new specialized workers (Docling, ar5iv, LaTeX) plus Fusion Engine and Quality Scorer. These are stateless, horizontally scalable, and communicate via NATS.
- State Store (PostgreSQL) — gains UDOM tables (
udom_documents,udom_batch_runs,udom_audit_events,udom_corpora) with RLS policies and GIN indexes.
New consumer interface: 4. UDOM Navigator — lightweight static HTML/JS viewer for human browsing of batch results. Deployed as nginx container, reads from UDOM Store read replica.
The compliance and observability implications at C2 are significant: every extraction event flows through the existing Compliance Engine (provenance tracking) and Observability Stack (tracing, metrics). No new cross-cutting infrastructure is needed.
Diagram
Container Impact Assessment
| Container | Change Type | Effort | Risk |
|---|---|---|---|
| Agent Orchestrator | Extend (new task type) | Medium (2 weeks) | Low — follows existing patterns |
| Agent Workers | Add (3 new workers + fusion + scorer) | High (4 weeks) | Medium — new extraction dependencies |
| State Store | Extend (new tables + indexes) | Low (1 week) | Low — standard PostgreSQL |
| Event Bus | Extend (new event types) | Low (< 1 week) | Low — standard NATS topics |
| Compliance Engine | Extend (new provenance rules) | Medium (1 week) | Low — follows existing patterns |
| UDOM Navigator | New container | Medium (2 weeks) | Low — static files, read-only |
| API Gateway | Extend (new routes) | Low (< 1 week) | Low — standard route addition |
C3 — Component Diagram (UDOM Pipeline Internals)
Narrative
Inside the UDOM Pipeline "virtual container" (actually distributed across Agent Workers and Agent Orchestrator), six components handle the complete extraction-to-consumption lifecycle.
The Batch Manager receives extraction requests from the Orchestrator and manages paper queues, concurrency limits, and retry state. The Source Dispatcher routes papers to the appropriate extraction workers via NATS, handling the parallel dispatch pattern. The Fusion Engine is the most architecturally novel component — it implements confidence-weighted selection across the 3 sources, with a configurable weight matrix per tenant. The Quality Evaluator implements the 9-dimension scoring system and controls the quality gate (Grade A pass, retry, or human checkpoint). The UDOM Serializer converts the fused document into both JSONB (for PostgreSQL storage) and markdown (for agent consumption and Navigator display). The Agent API provides the structured query interface that research agents use to search, compare, and traverse the UDOM corpus.
Diagram
Component Interfaces
| Component | Input | Output | Value Contribution |
|---|---|---|---|
| Batch Manager | Extraction requests | Queued paper jobs | Enables batch-scale processing (218+ papers) |
| Source Dispatcher | Paper IDs | 3 parallel extraction tasks | 3-source architecture enables 100% Grade A |
| Fusion Engine | 3× component lists | Canonical UDOM document | Confidence-weighted selection = higher fidelity than any single source |
| Quality Evaluator | Fused document | Grade + dimension scores | Quality gate = compliance evidence + regression detection |
| UDOM Serializer | Graded document | JSONB + Markdown + audit event | Dual format enables both agent consumption and human review |
| Agent API | Search/compare queries | Typed component results | 47–90% token reduction for agent workflows |
C4 — Code Diagram (Fusion Engine)
Narrative
The Fusion Engine is the most differentiated component — it's where CODITECT's architectural advantage materializes in code. At the code level, four classes implement the fusion logic.
FusionEngine is the entry point. It receives three lists of UDOMComponent objects (one per source) and a FusionConfig (tenant-specific confidence weights). It iterates through the structural backbone (Docling components, ordered by position) and for each component, queries the ComponentMatcher to find corresponding components from ar5iv and LaTeX sources. The ConfidenceSelector then applies the weight matrix to select the highest-confidence version. The FusionAuditLog records every selection decision (which source won, by what margin) for compliance traceability.
The design is intentionally simple and deterministic. There are no ML models, no probabilistic decisions, and no LLM calls. Every fusion decision can be replayed and audited — a hard requirement for FDA 21 CFR Part 11 and SOC2 compliance.
Diagram
Implementation Reference
class FusionEngine:
"""
Deterministic 3-source fusion — auditable, explainable, tenant-configurable.
Value creation: This is where single-source extraction quality ceilings are broken.
By selecting the highest-confidence component per position across 3 independent
extraction engines, fusion achieves fidelity that no single engine can match.
The confidence-weighted approach is deliberately non-ML: every decision is
traceable, reproducible, and compliance-auditable.
"""
def __init__(self, config: FusionConfig):
self.config = config
self.matcher = ComponentMatcher(similarity_threshold=0.7)
self.selector = ConfidenceSelector(config)
self.audit_log = FusionAuditLog()
def fuse(
self,
docling: list[UDOMComponent],
ar5iv: list[UDOMComponent],
latex: list[UDOMComponent],
) -> UDOMDocument:
"""
Fuse 3 sources into canonical UDOM document.
Strategy:
1. Use Docling as structural backbone (best heading/paragraph ordering)
2. For each Docling component, find matching ar5iv and LaTeX components
3. Select highest-confidence version using tenant-configurable weights
4. Log every decision for compliance audit trail
"""
fused_components = []
all_candidates = {"ar5iv": ar5iv, "latex": latex}
for position, doc_component in enumerate(docling):
candidates = [doc_component]
# Find matching components from other sources
for source_name, source_components in all_candidates.items():
matches = self.matcher.find_matches(doc_component, source_components)
candidates.extend([m.source_component for m in matches])
# Select best using confidence weights
winner = self.selector.select_best(
component_type=doc_component.type.value,
candidates=candidates,
)
# Log decision for audit trail
self.audit_log.log_decision(
position=position,
winner=winner,
candidates=candidates,
rationale=self.selector.get_selection_rationale(),
)
fused_components.append(winner)
return UDOMDocument(
components=fused_components,
source_stats={
"docling": len(docling),
"ar5iv": len(ar5iv),
"latex": len(latex),
"fused": len(fused_components),
},
audit_trail=self.audit_log.to_audit_event(),
)
Compliance Implications at C4
The Fusion Engine's code-level design directly addresses regulatory requirements:
| Requirement | Implementation |
|---|---|
| 21 CFR Part 11 — Audit Trail | FusionAuditLog records every selection decision with timestamp, rationale, and candidate list |
| 21 CFR Part 11 — Data Integrity | Content hashes (CandidateInfo.content_hash) verify component integrity through the fusion pipeline |
| SOC2 — Change Management | FusionConfig is version-controlled; weight changes are tracked as ADRs |
| HIPAA — Access Control | FusionConfig.tenant_id ensures tenant-specific weights are isolated |
C4 Architecture covers all four levels: System Context (C1), Container Diagram (C2), Component Diagram (C3), and Code Diagram (C4) with narratives, diagrams, and compliance analysis.