ADR-174: Research Continuum — Agentic Knowledge Infrastructure

Document: ADR-174-research-continuum-agentic-knowledge-infrastructure
Version: 1.0.0
Purpose: Establish the architectural foundation for the Research Continuum product — an agentic knowledge infrastructure that transforms static document collections into compounding knowledge assets
Audience: Framework contributors, product architects, engineering leadership
Date Created: 2026-02-11
Status: PROPOSED
Related ADRs:
  - ADR-164-universal-document-object-model (extraction layer — UDOM schema and pipeline)
  - ADR-165 through ADR-169 (sidecar architecture — browser-native code intelligence)
Related Documents:
  - internal/analysis/research-continuum/CODITECT-Research-Continuum-Vision-Document.md
  - internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md

Context and Problem Statement

The Knowledge Production-Consumption Asymmetry

Academic and enterprise knowledge production has reached a scale that exceeds human consumption capacity:

2.5 million scientific papers published annually (growing 4-5% per year)
A dedicated researcher can read approximately 250 papers/year at deep comprehension
This creates a 10,000:1 production-to-consumption ratio — 99.99% of relevant knowledge is invisible to any individual researcher
Enterprise organizations face the same asymmetry with internal reports, market analyses, regulatory filings, and competitive intelligence

Current Solutions Are Inadequate

Approach	What It Does	Why It Fails
Search engines (Google Scholar)	Finds papers by keyword	No comprehension, no synthesis, no cross-document linking
Chatbots (ChatGPT)	Summarizes individual papers	No persistent memory, no compounding knowledge, hallucination risk
Reference managers (Zotero, Mendeley)	Organizes citations	No content understanding, no knowledge extraction
Literature review tools (Elicit, Consensus)	Searches + summarizes	Single-paper focus, no knowledge graph, no cross-document synthesis

The gap: No system transforms a collection of documents into a compounding knowledge asset where each new document enriches the understanding of every prior document.

CODITECT's Position

CODITECT has built and validated the extraction layer of a complete agentic knowledge infrastructure:

UDOM Pipeline v1.7 — Multi-source extraction (Docling PDF + ar5iv HTML + arXiv LaTeX) with 25-type component taxonomy
Validated at scale — 218/218 papers Grade A (100%), average score 0.898
Production-grade — 62x faster than prior approach (Docling), zero failures in batch processing
Universal schema — UDOM supports academic papers, business documents, legal filings, patents

This ADR establishes the architectural decisions for building the remaining layers on top of this validated foundation.

Decision

We will architect the Research Continuum as a four-layer system that transforms document collections into compounding knowledge assets through agentic processing.

Architecture: Four-Layer Stack

┌─────────────────────────────────────────────────┐
│  Layer 4: INTERFACE                             │
│  Query, explore, generate from knowledge graph  │
├─────────────────────────────────────────────────┤
│  Layer 3: SYNTHESIS                             │
│  Cross-document reasoning, gap detection,       │
│  contradiction identification, trend analysis   │
├─────────────────────────────────────────────────┤
│  Layer 2: KNOWLEDGE GRAPH                       │
│  Entity extraction, relationship mapping,       │
│  cross-document linking, temporal tracking      │
├─────────────────────────────────────────────────┤
│  Layer 1: EXTRACTION (UDOM) ✅ BUILT            │
│  Multi-source ingestion, typed components,      │
│  provenance tracking, quality grading           │
└─────────────────────────────────────────────────┘

Layer 1: Extraction (UDOM) — Status: BUILT

Decision: Use UDOM Pipeline v1.7 as the extraction foundation.

Component	Technology	Status
PDF extraction	Docling v2.72.0 + Tesseract OCR	Production
HTML extraction	ar5iv + BeautifulSoup	Production
LaTeX extraction	arXiv e-print + pandoc	Production
Component taxonomy	UDOM 25-type schema	Production
Quality grading	9-dimension QA scoring	Production
Output formats	.md, .udom.json, .content.jsonl, .audit.jsonl	Production

Reference: ADR-164 (Universal Document Object Model)

Layer 2: Knowledge Graph — Status: PROPOSED

Decision: Build a knowledge graph that extracts entities and relationships from UDOM components and links them across documents.

Technology choice (to be validated):

Primary: FoundationDB with custom graph layer (aligns with CODITECT's existing FoundationDB expertise per database architecture skills)
Alternative: Neo4j for rapid prototyping, migrate to FoundationDB for production
Evaluation criteria: Write throughput (>1,000 entities/sec), query latency (<100ms for 2-hop traversals), horizontal scalability

Entity types (initial taxonomy):

Entity Type	Example	Source
Concept	"attention mechanism", "gradient descent"	Headings, abstracts, definitions
Method	"transformer architecture", "RLHF"	Method sections, equations
Result	"97.3% accuracy on ImageNet"	Results sections, tables
Dataset	"CIFAR-10", "Common Crawl"	Data sections, references
Author	"Yann LeCun", "Geoffrey Hinton"	Metadata, citations
Institution	"Meta AI", "DeepMind"	Affiliations
Claim	"attention is all you need"	Abstracts, conclusions
Citation	paper-to-paper reference	Bibliography

Relationship types:

Relationship	Example
`USES_METHOD`	Paper A uses transformer architecture
`EXTENDS`	Paper B extends Paper A's approach
`CONTRADICTS`	Paper C's results contradict Paper D
`CITES`	Paper E cites Paper F
`AUTHORED_BY`	Paper G authored by researcher H
`EVALUATES_ON`	Method I evaluated on dataset J
`IMPROVES_OVER`	Result K improves over baseline L
`TEMPORAL_FOLLOWS`	Work M chronologically follows work N

Knowledge graph ADR (follow-up): A detailed ADR for knowledge graph schema, technology selection, and entity extraction pipeline will be created as ADR-175 once prototyping validates the approach.

Layer 3: Synthesis — Status: PLANNED

Decision: Build an agentic synthesis layer that performs cross-document reasoning over the knowledge graph.

Capabilities:

Gap detection — Identify unexplored combinations of methods and domains
Contradiction identification — Surface conflicting claims across papers
Trend analysis — Track how concepts, methods, and results evolve over time
Literature review generation — Produce structured reviews of any topic from the graph
Research question generation — Suggest novel research directions based on graph topology

Architecture: Agent orchestration using CODITECT's existing multi-agent framework, with the knowledge graph as the shared memory/context layer.

Layer 4: Interface — Status: PLANNED

Decision: Build interactive interfaces for querying, exploring, and generating from the knowledge graph.

Components:

Natural language query — "What methods improve transformer efficiency?" → structured graph traversal + synthesis
Visual graph explorer — Interactive knowledge graph visualization
Report generator — Generate structured reports (literature reviews, competitive analyses, trend summaries)
API — Programmatic access for integration with existing research workflows

Rationale

Why Four Layers?

Each layer builds on the one below, and no layer can be skipped:

Without extraction, there is no structured data to work with
Without a knowledge graph, cross-document reasoning is impossible (each paper is isolated)
Without synthesis, the graph is a data structure, not a knowledge asset
Without an interface, the system is inaccessible to non-technical users

Why Build on UDOM?

The UDOM extraction pipeline is the validated foundation because:

Multi-source fusion — No other system combines 3 independent sources (PDF, HTML, LaTeX) for a single document
Universal types — 25 component types cover academic, business, legal, and patent documents
Provenance tracking — Every component traces back to its extraction source, enabling confidence scoring
Production-validated — 218/218 Grade A, zero failures, 62x performance improvement over prior approach

Why Not Use Existing Knowledge Graph Systems?

System	Limitation
Semantic Scholar API	Read-only, no custom entity types, no cross-domain support
Google Knowledge Graph	Web-focused, not document-focused, no synthesis capabilities
Neo4j + manual schema	No extraction pipeline, no provenance, no multi-source alignment
Elicit's internal graph	Proprietary, single-domain (academic only), no API access

The gap is in the combination: extraction + graph + synthesis + interface as an integrated system.

Consequences

Positive

Compounding knowledge moat — Each document processed enriches the entire graph, creating increasing returns to scale
Multi-vertical applicability — UDOM's universal types enable expansion beyond academic papers to legal, medical, financial, and patent domains
Platform foundation — The four-layer architecture enables multiple products (research assistant, competitive intelligence, regulatory monitoring)
Differentiated from incumbents — No existing product combines multi-source extraction, knowledge graph, agentic synthesis, and interactive interface

Negative

Significant engineering investment — Estimated 6-8 months to MVP for layers 2-4 (per Judge 5 assessment)
Knowledge graph design risk — Entity and relationship taxonomy must be validated empirically; over-engineering the schema risks building the wrong abstractions
Diverse corpus validation required — Current 218-paper corpus is homogeneous (ML/arXiv); medical, legal, and business documents have different structures
Inference cost at scale — Agentic synthesis may be expensive ($5-$50/deep query per MoE assessment); cost optimization is critical for unit economics

Risks and Mitigations

Risk	Impact	Mitigation
Knowledge graph schema doesn't generalize	High	Validate on 3+ domains before committing schema
Inference costs exceed revenue per query	High	Tiered processing (fast/cheap for simple queries, deep/expensive for synthesis)
Competitor ships similar product first	Medium	Focus on depth of knowledge graph moat, not breadth of features
Extraction quality degrades on non-arXiv content	Medium	Expand UDOM extractors (DOCX, HTML-native, OCR-only) before graph layer
Team bandwidth insufficient for 4-layer build	Medium	Phase implementation: graph first, synthesis second, interface third

Implementation Plan

Phase 1: Knowledge Graph Foundation (Months 1-2)

Design entity and relationship taxonomy (ADR-175)
Prototype graph storage (FoundationDB graph layer or Neo4j)
Build entity extraction pipeline from UDOM components
Validate on 1,000+ papers across 3+ domains
Deliverable: Populated knowledge graph with cross-document entity linking

Phase 2: Synthesis Engine (Months 3-5)

Build cross-document reasoning agents
Implement gap detection, contradiction identification, trend analysis
Define agent orchestration patterns for multi-step synthesis
Deliverable: Automated literature review generation from graph queries

Phase 3: Interface Layer (Months 4-6)

Natural language query interface
Visual graph explorer (React + D3/vis.js)
Report generation templates
API for programmatic access
Deliverable: Interactive Research Continuum product

Phase 4: Validation and Scale (Months 6-8)

Multi-domain validation (medical, legal, financial)
Performance optimization (query latency, inference cost)
Customer pilot programs (2-3 LOIs)
Deliverable: Investor-ready metrics and customer evidence

ADR	Relationship
ADR-164	Parent — UDOM schema defines Layer 1 extraction
ADR-165	Related — Sidecar architecture for browser-native intelligence
ADR-175 (future)	Child — Knowledge graph schema and technology selection
ADR-176 (future)	Child — Synthesis agent orchestration patterns

MoE Assessment Reference

This ADR was informed by a 5-judge MoE evaluation panel (2026-02-11):

Metric	Value
Weighted consensus score	7.4/10
Verdict	APPROVED WITH CONDITIONS
Strongest dimension	Technical Feasibility (8.7/10)
Weakest dimension	Revenue Model (5.2/10)
Critical finding	Only 20-25% of full stack exists (extraction layer only)

Full assessment: internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md

Decision Made: 2026-02-11 Decision Maker: Hal Casteel (Lead Architect) Status: PROPOSED — pending team review and Phase 1 prototype validation

Context and Problem Statement​

The Knowledge Production-Consumption Asymmetry​

Current Solutions Are Inadequate​

CODITECT's Position​

Decision​

Architecture: Four-Layer Stack​

Layer 1: Extraction (UDOM) — Status: BUILT​

Layer 2: Knowledge Graph — Status: PROPOSED​

Layer 3: Synthesis — Status: PLANNED​

Layer 4: Interface — Status: PLANNED​

Rationale​

Why Four Layers?​

Why Build on UDOM?​

Why Not Use Existing Knowledge Graph Systems?​

Consequences​

Positive​

Negative​

Risks and Mitigations​

Implementation Plan​

Phase 1: Knowledge Graph Foundation (Months 1-2)​

Phase 2: Synthesis Engine (Months 3-5)​

Phase 3: Interface Layer (Months 4-6)​

Phase 4: Validation and Scale (Months 6-8)​

Related Decisions​

MoE Assessment Reference​