Biomni / Research Continuum Integration Analysis

Date: 2026-02-14 Author: Claude (Opus 4.6)

1. Project Comparison

Dimension	Research Continuum	Biomni
Purpose	Agentic Knowledge Infrastructure - extraction, knowledge graphs, synthesis across all domains	General-purpose biomedical AI agent for autonomous research tasks
Scope	All document types, all research domains	Biomedical-specific (genomics, pharmacology, cell biology, etc.)
Phase	Architecture & Vision (Layer 1 built, Layers 2-4 planned)	Production-ready v0.0.8 with 20+ tool modules
License	Proprietary (AZ1.AI)	Apache 2.0 (Stanford SNAP)
Tech Stack	Python, UDOM pipeline, Docling, multi-source extractors	Python, LangChain/LangGraph, pydantic, code execution sandbox
LLM Integration	CODITECT's 776 agents via Claude	Multi-provider (OpenAI, Anthropic, Ollama, Gemini, Bedrock, Groq, Custom)
Data Assets	218 academic papers (UDOM processed), ar5iv HTML coverage	11GB biomedical data lake (auto-download), tool descriptions, know-how library
Key Capability	Document extraction (100% Grade A), UDOM schema (25 types)	Task execution: CRISPR screens, scRNA-seq, ADMET prediction, literature search
UI	None (CLI pipeline)	Gradio web interface, PDF report generation
Architecture	4-layer: Extraction, Knowledge Graph, Synthesis, Interface	Single agent loop: Plan -> Code -> Execute -> Observe
MCP Support	CODITECT MCP servers (semantic search, call graph, impact)	MCP client support for external tool integration
Eval	QA grading (A-F per paper)	Biomni-Eval1: 433 instances across 10 biological reasoning tasks
LOC	~15,478 (Python, UDOM pipeline)	~83K utils.py alone, 20+ tool modules, massive env
Size on disk	824K (code)	~11GB (with data lake)

2. Overlap Analysis

Where They Overlap

Both process academic papers (Biomni has literature.py, Research Continuum has UDOM)
Both use LLMs for reasoning about research content
Both aim to automate research workflows
Both are Python-based

Where They Are Complementary

Research Continuum Provides	Biomni Provides
Superior document extraction (UDOM, 100% Grade A)	Domain-specific biomedical tools (20+ modules)
Multi-source alignment (PDF + HTML + LaTeX)	Executable research tasks (CRISPR, scRNA-seq, ADMET)
Provenance tracking to exact sentences	Pre-built biomedical data lake (11GB)
Knowledge graph architecture (planned)	Evaluation benchmarks (Biomni-Eval1)
776 CODITECT agents for general reasoning	Know-How library for lab protocols
Proprietary competitive moat	Open-source community contributions

Where They Diverge

Domain specificity: Research Continuum is domain-agnostic; Biomni is biomedical-only
Execution model: Research Continuum is document-centric pipeline; Biomni is task-centric agent
Code execution: Biomni executes LLM-generated code with full system privileges; Research Continuum doesn't execute code
Data dependency: Biomni requires 11GB data lake; Research Continuum is lightweight
Release cadence: Biomni is community-driven open-source; Research Continuum is proprietary roadmapped

3. Integration Options Evaluated

Option A: Merge Into One Project

Verdict: NOT RECOMMENDED

License incompatibility: Apache 2.0 (Biomni) vs Proprietary (Research Continuum) creates legal complexity
Scope mismatch: Research Continuum targets all domains; merging Biomni narrows the brand
Dependency bloat: Biomni's massive conda env (setup.sh) and 11GB data lake would bloat the general-purpose product
Upstream divergence: Biomni is actively developed by Stanford; merging makes upstream tracking impossible

Option B: Fork Biomni Into Research Continuum

Verdict: NOT RECOMMENDED

Loses upstream updates from Stanford SNAP's active development community
Creates maintenance burden to cherry-pick fixes
Biomni's setup.sh/conda env is deeply tied to its own structure
Research Continuum's UDOM pipeline has no structural relationship to Biomni's agent loop

Option C: Keep Separate, Integrate at API/Interface Boundary (RECOMMENDED)

Verdict: RECOMMENDED

Architecture:

coditect-research-continuum/          # Proprietary - Agentic Knowledge Infrastructure
  src/pdf-to-markdown/                # UDOM extraction pipeline
  [planned] knowledge-graph/          # Entity/relationship layer
  [planned] synthesis/                # Agent orchestration layer
  [planned] interface/                # Query API

coditect-biomedical-research/         # Submodule of snap-stanford/Biomni (Apache 2.0)
  biomni/                             # Third-party biomedical agent
    agent/                            # A1 agent with LangGraph
    tool/                             # 20+ biomedical tool modules

Integration points (built in research-continuum):
  1. UDOM → Biomni: Feed extracted documents to Biomni agent as context
  2. Biomni → Knowledge Graph: Write Biomni task results back to KG
  3. CODITECT agents → Biomni tools: Expose Biomni tools as MCP servers
  4. Shared LLM routing: Both use Claude/Anthropic; unify via CODITECT's LLM layer

Why this works:

Clean IP boundary: Proprietary code stays in research-continuum; open-source stays in submodule
Upstream tracking: git submodule update pulls latest Biomni without merge conflicts
Domain extensibility: Other domain-specific agents (legal, financial) can be added as sibling submodules
No dependency contamination: Biomni's massive env stays isolated; research-continuum remains lightweight
MCP bridge: Biomni already supports MCP; CODITECT already has MCP servers. The integration protocol exists.

4. Recommendation

Keep them as separate submodules under submodules/products/.

Project	Path	Role
`coditect-research-continuum`	`submodules/products/coditect-research-continuum`	Platform: extraction, knowledge graph, synthesis engine
`coditect-biomedical-research`	`submodules/products/coditect-biomedical-research`	Domain module: biomedical-specific agent + tools

Integration Roadmap

Phase	Work	Priority
Phase 1 (Now)	Add Biomni submodule, document architecture boundary	Done
Phase 2	Build UDOM-to-Biomni adapter: feed extracted papers as Biomni context	Medium
Phase 3	Expose Biomni tools as MCP servers for CODITECT agent access	Medium
Phase 4	Build results-to-KG writer: Biomni outputs flow into Research Continuum knowledge graph	High (when KG layer is built)
Phase 5	Unified UI: Research Continuum interface embeds Biomni biomedical capabilities	Low

Key Design Principle

Research Continuum is the platform. Biomni is a domain module. The platform handles extraction, knowledge graphs, and synthesis orchestration. Domain modules (biomedical, legal, financial) plug in via well-defined interfaces (MCP, UDOM JSON, knowledge graph API). This allows the Research Continuum to grow horizontally across domains while each domain module evolves independently.

5. License & IP Considerations

Biomni is Apache 2.0 - commercial use permitted, but some integrated tools/databases may have restrictions
Biomni's commercial_mode=True flag filters non-commercial datasets
Any proprietary integration code must live in coditect-research-continuum, not in the Biomni submodule
The Biomni submodule remote is https://github.com/snap-stanford/Biomni.git - NEVER push to this remote (Git Push Safety directive)

6. Immediate Actions Completed

Biomni added as git submodule at submodules/products/coditect-biomedical-research (commit 400c1f3)
This analysis document created at internal/analysis/biomni-integration/
Session log updated with inception and milestone entries

1. Project Comparison​

2. Overlap Analysis​

Where They Overlap​

Where They Are Complementary​

Where They Diverge​

3. Integration Options Evaluated​

Option A: Merge Into One Project​

Option B: Fork Biomni Into Research Continuum​

Option C: Keep Separate, Integrate at API/Interface Boundary (RECOMMENDED)​

4. Recommendation​

Integration Roadmap​

Key Design Principle​

5. License & IP Considerations​

6. Immediate Actions Completed​