Skip to main content

ADR-174: Research Continuum — Agentic Knowledge Infrastructure

Document: ADR-174-research-continuum-agentic-knowledge-infrastructure
Version: 1.0.0
Purpose: Establish the architectural foundation for the Research Continuum product — an agentic knowledge infrastructure that transforms static document collections into compounding knowledge assets
Audience: Framework contributors, product architects, engineering leadership
Date Created: 2026-02-11
Status: PROPOSED
Related ADRs:
- ADR-164-universal-document-object-model (extraction layer — UDOM schema and pipeline)
- ADR-165 through ADR-169 (sidecar architecture — browser-native code intelligence)
Related Documents:
- internal/analysis/research-continuum/CODITECT-Research-Continuum-Vision-Document.md
- internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md

Context and Problem Statement

The Knowledge Production-Consumption Asymmetry

Academic and enterprise knowledge production has reached a scale that exceeds human consumption capacity:

  • 2.5 million scientific papers published annually (growing 4-5% per year)
  • A dedicated researcher can read approximately 250 papers/year at deep comprehension
  • This creates a 10,000:1 production-to-consumption ratio — 99.99% of relevant knowledge is invisible to any individual researcher
  • Enterprise organizations face the same asymmetry with internal reports, market analyses, regulatory filings, and competitive intelligence

Current Solutions Are Inadequate

ApproachWhat It DoesWhy It Fails
Search engines (Google Scholar)Finds papers by keywordNo comprehension, no synthesis, no cross-document linking
Chatbots (ChatGPT)Summarizes individual papersNo persistent memory, no compounding knowledge, hallucination risk
Reference managers (Zotero, Mendeley)Organizes citationsNo content understanding, no knowledge extraction
Literature review tools (Elicit, Consensus)Searches + summarizesSingle-paper focus, no knowledge graph, no cross-document synthesis

The gap: No system transforms a collection of documents into a compounding knowledge asset where each new document enriches the understanding of every prior document.

CODITECT's Position

CODITECT has built and validated the extraction layer of a complete agentic knowledge infrastructure:

  • UDOM Pipeline v1.7 — Multi-source extraction (Docling PDF + ar5iv HTML + arXiv LaTeX) with 25-type component taxonomy
  • Validated at scale — 218/218 papers Grade A (100%), average score 0.898
  • Production-grade — 62x faster than prior approach (Docling), zero failures in batch processing
  • Universal schema — UDOM supports academic papers, business documents, legal filings, patents

This ADR establishes the architectural decisions for building the remaining layers on top of this validated foundation.


Decision

We will architect the Research Continuum as a four-layer system that transforms document collections into compounding knowledge assets through agentic processing.

Architecture: Four-Layer Stack

┌─────────────────────────────────────────────────┐
│ Layer 4: INTERFACE │
│ Query, explore, generate from knowledge graph │
├─────────────────────────────────────────────────┤
│ Layer 3: SYNTHESIS │
│ Cross-document reasoning, gap detection, │
│ contradiction identification, trend analysis │
├─────────────────────────────────────────────────┤
│ Layer 2: KNOWLEDGE GRAPH │
│ Entity extraction, relationship mapping, │
│ cross-document linking, temporal tracking │
├─────────────────────────────────────────────────┤
│ Layer 1: EXTRACTION (UDOM) ✅ BUILT │
│ Multi-source ingestion, typed components, │
│ provenance tracking, quality grading │
└─────────────────────────────────────────────────┘

Layer 1: Extraction (UDOM) — Status: BUILT

Decision: Use UDOM Pipeline v1.7 as the extraction foundation.

ComponentTechnologyStatus
PDF extractionDocling v2.72.0 + Tesseract OCRProduction
HTML extractionar5iv + BeautifulSoupProduction
LaTeX extractionarXiv e-print + pandocProduction
Component taxonomyUDOM 25-type schemaProduction
Quality grading9-dimension QA scoringProduction
Output formats.md, .udom.json, .content.jsonl, .audit.jsonlProduction

Reference: ADR-164 (Universal Document Object Model)

Layer 2: Knowledge Graph — Status: PROPOSED

Decision: Build a knowledge graph that extracts entities and relationships from UDOM components and links them across documents.

Technology choice (to be validated):

  • Primary: FoundationDB with custom graph layer (aligns with CODITECT's existing FoundationDB expertise per database architecture skills)
  • Alternative: Neo4j for rapid prototyping, migrate to FoundationDB for production
  • Evaluation criteria: Write throughput (>1,000 entities/sec), query latency (<100ms for 2-hop traversals), horizontal scalability

Entity types (initial taxonomy):

Entity TypeExampleSource
Concept"attention mechanism", "gradient descent"Headings, abstracts, definitions
Method"transformer architecture", "RLHF"Method sections, equations
Result"97.3% accuracy on ImageNet"Results sections, tables
Dataset"CIFAR-10", "Common Crawl"Data sections, references
Author"Yann LeCun", "Geoffrey Hinton"Metadata, citations
Institution"Meta AI", "DeepMind"Affiliations
Claim"attention is all you need"Abstracts, conclusions
Citationpaper-to-paper referenceBibliography

Relationship types:

RelationshipExample
USES_METHODPaper A uses transformer architecture
EXTENDSPaper B extends Paper A's approach
CONTRADICTSPaper C's results contradict Paper D
CITESPaper E cites Paper F
AUTHORED_BYPaper G authored by researcher H
EVALUATES_ONMethod I evaluated on dataset J
IMPROVES_OVERResult K improves over baseline L
TEMPORAL_FOLLOWSWork M chronologically follows work N

Knowledge graph ADR (follow-up): A detailed ADR for knowledge graph schema, technology selection, and entity extraction pipeline will be created as ADR-175 once prototyping validates the approach.

Layer 3: Synthesis — Status: PLANNED

Decision: Build an agentic synthesis layer that performs cross-document reasoning over the knowledge graph.

Capabilities:

  • Gap detection — Identify unexplored combinations of methods and domains
  • Contradiction identification — Surface conflicting claims across papers
  • Trend analysis — Track how concepts, methods, and results evolve over time
  • Literature review generation — Produce structured reviews of any topic from the graph
  • Research question generation — Suggest novel research directions based on graph topology

Architecture: Agent orchestration using CODITECT's existing multi-agent framework, with the knowledge graph as the shared memory/context layer.

Layer 4: Interface — Status: PLANNED

Decision: Build interactive interfaces for querying, exploring, and generating from the knowledge graph.

Components:

  • Natural language query — "What methods improve transformer efficiency?" → structured graph traversal + synthesis
  • Visual graph explorer — Interactive knowledge graph visualization
  • Report generator — Generate structured reports (literature reviews, competitive analyses, trend summaries)
  • API — Programmatic access for integration with existing research workflows

Rationale

Why Four Layers?

Each layer builds on the one below, and no layer can be skipped:

  1. Without extraction, there is no structured data to work with
  2. Without a knowledge graph, cross-document reasoning is impossible (each paper is isolated)
  3. Without synthesis, the graph is a data structure, not a knowledge asset
  4. Without an interface, the system is inaccessible to non-technical users

Why Build on UDOM?

The UDOM extraction pipeline is the validated foundation because:

  • Multi-source fusion — No other system combines 3 independent sources (PDF, HTML, LaTeX) for a single document
  • Universal types — 25 component types cover academic, business, legal, and patent documents
  • Provenance tracking — Every component traces back to its extraction source, enabling confidence scoring
  • Production-validated — 218/218 Grade A, zero failures, 62x performance improvement over prior approach

Why Not Use Existing Knowledge Graph Systems?

SystemLimitation
Semantic Scholar APIRead-only, no custom entity types, no cross-domain support
Google Knowledge GraphWeb-focused, not document-focused, no synthesis capabilities
Neo4j + manual schemaNo extraction pipeline, no provenance, no multi-source alignment
Elicit's internal graphProprietary, single-domain (academic only), no API access

The gap is in the combination: extraction + graph + synthesis + interface as an integrated system.


Consequences

Positive

  1. Compounding knowledge moat — Each document processed enriches the entire graph, creating increasing returns to scale
  2. Multi-vertical applicability — UDOM's universal types enable expansion beyond academic papers to legal, medical, financial, and patent domains
  3. Platform foundation — The four-layer architecture enables multiple products (research assistant, competitive intelligence, regulatory monitoring)
  4. Differentiated from incumbents — No existing product combines multi-source extraction, knowledge graph, agentic synthesis, and interactive interface

Negative

  1. Significant engineering investment — Estimated 6-8 months to MVP for layers 2-4 (per Judge 5 assessment)
  2. Knowledge graph design risk — Entity and relationship taxonomy must be validated empirically; over-engineering the schema risks building the wrong abstractions
  3. Diverse corpus validation required — Current 218-paper corpus is homogeneous (ML/arXiv); medical, legal, and business documents have different structures
  4. Inference cost at scale — Agentic synthesis may be expensive ($5-$50/deep query per MoE assessment); cost optimization is critical for unit economics

Risks and Mitigations

RiskImpactMitigation
Knowledge graph schema doesn't generalizeHighValidate on 3+ domains before committing schema
Inference costs exceed revenue per queryHighTiered processing (fast/cheap for simple queries, deep/expensive for synthesis)
Competitor ships similar product firstMediumFocus on depth of knowledge graph moat, not breadth of features
Extraction quality degrades on non-arXiv contentMediumExpand UDOM extractors (DOCX, HTML-native, OCR-only) before graph layer
Team bandwidth insufficient for 4-layer buildMediumPhase implementation: graph first, synthesis second, interface third

Implementation Plan

Phase 1: Knowledge Graph Foundation (Months 1-2)

  • Design entity and relationship taxonomy (ADR-175)
  • Prototype graph storage (FoundationDB graph layer or Neo4j)
  • Build entity extraction pipeline from UDOM components
  • Validate on 1,000+ papers across 3+ domains
  • Deliverable: Populated knowledge graph with cross-document entity linking

Phase 2: Synthesis Engine (Months 3-5)

  • Build cross-document reasoning agents
  • Implement gap detection, contradiction identification, trend analysis
  • Define agent orchestration patterns for multi-step synthesis
  • Deliverable: Automated literature review generation from graph queries

Phase 3: Interface Layer (Months 4-6)

  • Natural language query interface
  • Visual graph explorer (React + D3/vis.js)
  • Report generation templates
  • API for programmatic access
  • Deliverable: Interactive Research Continuum product

Phase 4: Validation and Scale (Months 6-8)

  • Multi-domain validation (medical, legal, financial)
  • Performance optimization (query latency, inference cost)
  • Customer pilot programs (2-3 LOIs)
  • Deliverable: Investor-ready metrics and customer evidence

ADRRelationship
ADR-164Parent — UDOM schema defines Layer 1 extraction
ADR-165Related — Sidecar architecture for browser-native intelligence
ADR-175 (future)Child — Knowledge graph schema and technology selection
ADR-176 (future)Child — Synthesis agent orchestration patterns

MoE Assessment Reference

This ADR was informed by a 5-judge MoE evaluation panel (2026-02-11):

MetricValue
Weighted consensus score7.4/10
VerdictAPPROVED WITH CONDITIONS
Strongest dimensionTechnical Feasibility (8.7/10)
Weakest dimensionRevenue Model (5.2/10)
Critical findingOnly 20-25% of full stack exists (extraction layer only)

Full assessment: internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md


Decision Made: 2026-02-11 Decision Maker: Hal Casteel (Lead Architect) Status: PROPOSED — pending team review and Phase 1 prototype validation