ADR-174: Research Continuum — Agentic Knowledge Infrastructure
Document: ADR-174-research-continuum-agentic-knowledge-infrastructure
Version: 1.0.0
Purpose: Establish the architectural foundation for the Research Continuum product — an agentic knowledge infrastructure that transforms static document collections into compounding knowledge assets
Audience: Framework contributors, product architects, engineering leadership
Date Created: 2026-02-11
Status: PROPOSED
Related ADRs:
- ADR-164-universal-document-object-model (extraction layer — UDOM schema and pipeline)
- ADR-165 through ADR-169 (sidecar architecture — browser-native code intelligence)
Related Documents:
- internal/analysis/research-continuum/CODITECT-Research-Continuum-Vision-Document.md
- internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md
Context and Problem Statement
The Knowledge Production-Consumption Asymmetry
Academic and enterprise knowledge production has reached a scale that exceeds human consumption capacity:
- 2.5 million scientific papers published annually (growing 4-5% per year)
- A dedicated researcher can read approximately 250 papers/year at deep comprehension
- This creates a 10,000:1 production-to-consumption ratio — 99.99% of relevant knowledge is invisible to any individual researcher
- Enterprise organizations face the same asymmetry with internal reports, market analyses, regulatory filings, and competitive intelligence
Current Solutions Are Inadequate
| Approach | What It Does | Why It Fails |
|---|---|---|
| Search engines (Google Scholar) | Finds papers by keyword | No comprehension, no synthesis, no cross-document linking |
| Chatbots (ChatGPT) | Summarizes individual papers | No persistent memory, no compounding knowledge, hallucination risk |
| Reference managers (Zotero, Mendeley) | Organizes citations | No content understanding, no knowledge extraction |
| Literature review tools (Elicit, Consensus) | Searches + summarizes | Single-paper focus, no knowledge graph, no cross-document synthesis |
The gap: No system transforms a collection of documents into a compounding knowledge asset where each new document enriches the understanding of every prior document.
CODITECT's Position
CODITECT has built and validated the extraction layer of a complete agentic knowledge infrastructure:
- UDOM Pipeline v1.7 — Multi-source extraction (Docling PDF + ar5iv HTML + arXiv LaTeX) with 25-type component taxonomy
- Validated at scale — 218/218 papers Grade A (100%), average score 0.898
- Production-grade — 62x faster than prior approach (Docling), zero failures in batch processing
- Universal schema — UDOM supports academic papers, business documents, legal filings, patents
This ADR establishes the architectural decisions for building the remaining layers on top of this validated foundation.
Decision
We will architect the Research Continuum as a four-layer system that transforms document collections into compounding knowledge assets through agentic processing.
Architecture: Four-Layer Stack
┌─────────────────────────────────────────────────┐
│ Layer 4: INTERFACE │
│ Query, explore, generate from knowledge graph │
├─────────────────────────────────────────────────┤
│ Layer 3: SYNTHESIS │
│ Cross-document reasoning, gap detection, │
│ contradiction identification, trend analysis │
├─────────────────────────────────────────────────┤
│ Layer 2: KNOWLEDGE GRAPH │
│ Entity extraction, relationship mapping, │
│ cross-document linking, temporal tracking │
├─────────────────────────────────────────────────┤
│ Layer 1: EXTRACTION (UDOM) ✅ BUILT │
│ Multi-source ingestion, typed components, │
│ provenance tracking, quality grading │
└─────────────────────────────────────────────────┘
Layer 1: Extraction (UDOM) — Status: BUILT
Decision: Use UDOM Pipeline v1.7 as the extraction foundation.
| Component | Technology | Status |
|---|---|---|
| PDF extraction | Docling v2.72.0 + Tesseract OCR | Production |
| HTML extraction | ar5iv + BeautifulSoup | Production |
| LaTeX extraction | arXiv e-print + pandoc | Production |
| Component taxonomy | UDOM 25-type schema | Production |
| Quality grading | 9-dimension QA scoring | Production |
| Output formats | .md, .udom.json, .content.jsonl, .audit.jsonl | Production |
Reference: ADR-164 (Universal Document Object Model)
Layer 2: Knowledge Graph — Status: PROPOSED
Decision: Build a knowledge graph that extracts entities and relationships from UDOM components and links them across documents.
Technology choice (to be validated):
- Primary: FoundationDB with custom graph layer (aligns with CODITECT's existing FoundationDB expertise per database architecture skills)
- Alternative: Neo4j for rapid prototyping, migrate to FoundationDB for production
- Evaluation criteria: Write throughput (>1,000 entities/sec), query latency (<100ms for 2-hop traversals), horizontal scalability
Entity types (initial taxonomy):
| Entity Type | Example | Source |
|---|---|---|
| Concept | "attention mechanism", "gradient descent" | Headings, abstracts, definitions |
| Method | "transformer architecture", "RLHF" | Method sections, equations |
| Result | "97.3% accuracy on ImageNet" | Results sections, tables |
| Dataset | "CIFAR-10", "Common Crawl" | Data sections, references |
| Author | "Yann LeCun", "Geoffrey Hinton" | Metadata, citations |
| Institution | "Meta AI", "DeepMind" | Affiliations |
| Claim | "attention is all you need" | Abstracts, conclusions |
| Citation | paper-to-paper reference | Bibliography |
Relationship types:
| Relationship | Example |
|---|---|
USES_METHOD | Paper A uses transformer architecture |
EXTENDS | Paper B extends Paper A's approach |
CONTRADICTS | Paper C's results contradict Paper D |
CITES | Paper E cites Paper F |
AUTHORED_BY | Paper G authored by researcher H |
EVALUATES_ON | Method I evaluated on dataset J |
IMPROVES_OVER | Result K improves over baseline L |
TEMPORAL_FOLLOWS | Work M chronologically follows work N |
Knowledge graph ADR (follow-up): A detailed ADR for knowledge graph schema, technology selection, and entity extraction pipeline will be created as ADR-175 once prototyping validates the approach.
Layer 3: Synthesis — Status: PLANNED
Decision: Build an agentic synthesis layer that performs cross-document reasoning over the knowledge graph.
Capabilities:
- Gap detection — Identify unexplored combinations of methods and domains
- Contradiction identification — Surface conflicting claims across papers
- Trend analysis — Track how concepts, methods, and results evolve over time
- Literature review generation — Produce structured reviews of any topic from the graph
- Research question generation — Suggest novel research directions based on graph topology
Architecture: Agent orchestration using CODITECT's existing multi-agent framework, with the knowledge graph as the shared memory/context layer.
Layer 4: Interface — Status: PLANNED
Decision: Build interactive interfaces for querying, exploring, and generating from the knowledge graph.
Components:
- Natural language query — "What methods improve transformer efficiency?" → structured graph traversal + synthesis
- Visual graph explorer — Interactive knowledge graph visualization
- Report generator — Generate structured reports (literature reviews, competitive analyses, trend summaries)
- API — Programmatic access for integration with existing research workflows
Rationale
Why Four Layers?
Each layer builds on the one below, and no layer can be skipped:
- Without extraction, there is no structured data to work with
- Without a knowledge graph, cross-document reasoning is impossible (each paper is isolated)
- Without synthesis, the graph is a data structure, not a knowledge asset
- Without an interface, the system is inaccessible to non-technical users
Why Build on UDOM?
The UDOM extraction pipeline is the validated foundation because:
- Multi-source fusion — No other system combines 3 independent sources (PDF, HTML, LaTeX) for a single document
- Universal types — 25 component types cover academic, business, legal, and patent documents
- Provenance tracking — Every component traces back to its extraction source, enabling confidence scoring
- Production-validated — 218/218 Grade A, zero failures, 62x performance improvement over prior approach
Why Not Use Existing Knowledge Graph Systems?
| System | Limitation |
|---|---|
| Semantic Scholar API | Read-only, no custom entity types, no cross-domain support |
| Google Knowledge Graph | Web-focused, not document-focused, no synthesis capabilities |
| Neo4j + manual schema | No extraction pipeline, no provenance, no multi-source alignment |
| Elicit's internal graph | Proprietary, single-domain (academic only), no API access |
The gap is in the combination: extraction + graph + synthesis + interface as an integrated system.
Consequences
Positive
- Compounding knowledge moat — Each document processed enriches the entire graph, creating increasing returns to scale
- Multi-vertical applicability — UDOM's universal types enable expansion beyond academic papers to legal, medical, financial, and patent domains
- Platform foundation — The four-layer architecture enables multiple products (research assistant, competitive intelligence, regulatory monitoring)
- Differentiated from incumbents — No existing product combines multi-source extraction, knowledge graph, agentic synthesis, and interactive interface
Negative
- Significant engineering investment — Estimated 6-8 months to MVP for layers 2-4 (per Judge 5 assessment)
- Knowledge graph design risk — Entity and relationship taxonomy must be validated empirically; over-engineering the schema risks building the wrong abstractions
- Diverse corpus validation required — Current 218-paper corpus is homogeneous (ML/arXiv); medical, legal, and business documents have different structures
- Inference cost at scale — Agentic synthesis may be expensive ($5-$50/deep query per MoE assessment); cost optimization is critical for unit economics
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Knowledge graph schema doesn't generalize | High | Validate on 3+ domains before committing schema |
| Inference costs exceed revenue per query | High | Tiered processing (fast/cheap for simple queries, deep/expensive for synthesis) |
| Competitor ships similar product first | Medium | Focus on depth of knowledge graph moat, not breadth of features |
| Extraction quality degrades on non-arXiv content | Medium | Expand UDOM extractors (DOCX, HTML-native, OCR-only) before graph layer |
| Team bandwidth insufficient for 4-layer build | Medium | Phase implementation: graph first, synthesis second, interface third |
Implementation Plan
Phase 1: Knowledge Graph Foundation (Months 1-2)
- Design entity and relationship taxonomy (ADR-175)
- Prototype graph storage (FoundationDB graph layer or Neo4j)
- Build entity extraction pipeline from UDOM components
- Validate on 1,000+ papers across 3+ domains
- Deliverable: Populated knowledge graph with cross-document entity linking
Phase 2: Synthesis Engine (Months 3-5)
- Build cross-document reasoning agents
- Implement gap detection, contradiction identification, trend analysis
- Define agent orchestration patterns for multi-step synthesis
- Deliverable: Automated literature review generation from graph queries
Phase 3: Interface Layer (Months 4-6)
- Natural language query interface
- Visual graph explorer (React + D3/vis.js)
- Report generation templates
- API for programmatic access
- Deliverable: Interactive Research Continuum product
Phase 4: Validation and Scale (Months 6-8)
- Multi-domain validation (medical, legal, financial)
- Performance optimization (query latency, inference cost)
- Customer pilot programs (2-3 LOIs)
- Deliverable: Investor-ready metrics and customer evidence
Related Decisions
| ADR | Relationship |
|---|---|
| ADR-164 | Parent — UDOM schema defines Layer 1 extraction |
| ADR-165 | Related — Sidecar architecture for browser-native intelligence |
| ADR-175 (future) | Child — Knowledge graph schema and technology selection |
| ADR-176 (future) | Child — Synthesis agent orchestration patterns |
MoE Assessment Reference
This ADR was informed by a 5-judge MoE evaluation panel (2026-02-11):
| Metric | Value |
|---|---|
| Weighted consensus score | 7.4/10 |
| Verdict | APPROVED WITH CONDITIONS |
| Strongest dimension | Technical Feasibility (8.7/10) |
| Weakest dimension | Revenue Model (5.2/10) |
| Critical finding | Only 20-25% of full stack exists (extraction layer only) |
Full assessment: internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md
Decision Made: 2026-02-11 Decision Maker: Hal Casteel (Lead Architect) Status: PROPOSED — pending team review and Phase 1 prototype validation