Biomni / Research Continuum Integration Analysis
Date: 2026-02-14 Author: Claude (Opus 4.6)
1. Project Comparison
| Dimension | Research Continuum | Biomni |
|---|---|---|
| Purpose | Agentic Knowledge Infrastructure - extraction, knowledge graphs, synthesis across all domains | General-purpose biomedical AI agent for autonomous research tasks |
| Scope | All document types, all research domains | Biomedical-specific (genomics, pharmacology, cell biology, etc.) |
| Phase | Architecture & Vision (Layer 1 built, Layers 2-4 planned) | Production-ready v0.0.8 with 20+ tool modules |
| License | Proprietary (AZ1.AI) | Apache 2.0 (Stanford SNAP) |
| Tech Stack | Python, UDOM pipeline, Docling, multi-source extractors | Python, LangChain/LangGraph, pydantic, code execution sandbox |
| LLM Integration | CODITECT's 776 agents via Claude | Multi-provider (OpenAI, Anthropic, Ollama, Gemini, Bedrock, Groq, Custom) |
| Data Assets | 218 academic papers (UDOM processed), ar5iv HTML coverage | 11GB biomedical data lake (auto-download), tool descriptions, know-how library |
| Key Capability | Document extraction (100% Grade A), UDOM schema (25 types) | Task execution: CRISPR screens, scRNA-seq, ADMET prediction, literature search |
| UI | None (CLI pipeline) | Gradio web interface, PDF report generation |
| Architecture | 4-layer: Extraction, Knowledge Graph, Synthesis, Interface | Single agent loop: Plan -> Code -> Execute -> Observe |
| MCP Support | CODITECT MCP servers (semantic search, call graph, impact) | MCP client support for external tool integration |
| Eval | QA grading (A-F per paper) | Biomni-Eval1: 433 instances across 10 biological reasoning tasks |
| LOC | ~15,478 (Python, UDOM pipeline) | ~83K utils.py alone, 20+ tool modules, massive env |
| Size on disk | 824K (code) | ~11GB (with data lake) |
2. Overlap Analysis
Where They Overlap
- Both process academic papers (Biomni has literature.py, Research Continuum has UDOM)
- Both use LLMs for reasoning about research content
- Both aim to automate research workflows
- Both are Python-based
Where They Are Complementary
| Research Continuum Provides | Biomni Provides |
|---|---|
| Superior document extraction (UDOM, 100% Grade A) | Domain-specific biomedical tools (20+ modules) |
| Multi-source alignment (PDF + HTML + LaTeX) | Executable research tasks (CRISPR, scRNA-seq, ADMET) |
| Provenance tracking to exact sentences | Pre-built biomedical data lake (11GB) |
| Knowledge graph architecture (planned) | Evaluation benchmarks (Biomni-Eval1) |
| 776 CODITECT agents for general reasoning | Know-How library for lab protocols |
| Proprietary competitive moat | Open-source community contributions |
Where They Diverge
- Domain specificity: Research Continuum is domain-agnostic; Biomni is biomedical-only
- Execution model: Research Continuum is document-centric pipeline; Biomni is task-centric agent
- Code execution: Biomni executes LLM-generated code with full system privileges; Research Continuum doesn't execute code
- Data dependency: Biomni requires 11GB data lake; Research Continuum is lightweight
- Release cadence: Biomni is community-driven open-source; Research Continuum is proprietary roadmapped
3. Integration Options Evaluated
Option A: Merge Into One Project
Verdict: NOT RECOMMENDED
- License incompatibility: Apache 2.0 (Biomni) vs Proprietary (Research Continuum) creates legal complexity
- Scope mismatch: Research Continuum targets all domains; merging Biomni narrows the brand
- Dependency bloat: Biomni's massive conda env (setup.sh) and 11GB data lake would bloat the general-purpose product
- Upstream divergence: Biomni is actively developed by Stanford; merging makes upstream tracking impossible
Option B: Fork Biomni Into Research Continuum
Verdict: NOT RECOMMENDED
- Loses upstream updates from Stanford SNAP's active development community
- Creates maintenance burden to cherry-pick fixes
- Biomni's setup.sh/conda env is deeply tied to its own structure
- Research Continuum's UDOM pipeline has no structural relationship to Biomni's agent loop
Option C: Keep Separate, Integrate at API/Interface Boundary (RECOMMENDED)
Verdict: RECOMMENDED
Architecture:
coditect-research-continuum/ # Proprietary - Agentic Knowledge Infrastructure
src/pdf-to-markdown/ # UDOM extraction pipeline
[planned] knowledge-graph/ # Entity/relationship layer
[planned] synthesis/ # Agent orchestration layer
[planned] interface/ # Query API
coditect-biomedical-research/ # Submodule of snap-stanford/Biomni (Apache 2.0)
biomni/ # Third-party biomedical agent
agent/ # A1 agent with LangGraph
tool/ # 20+ biomedical tool modules
Integration points (built in research-continuum):
1. UDOM → Biomni: Feed extracted documents to Biomni agent as context
2. Biomni → Knowledge Graph: Write Biomni task results back to KG
3. CODITECT agents → Biomni tools: Expose Biomni tools as MCP servers
4. Shared LLM routing: Both use Claude/Anthropic; unify via CODITECT's LLM layer
Why this works:
- Clean IP boundary: Proprietary code stays in research-continuum; open-source stays in submodule
- Upstream tracking:
git submodule updatepulls latest Biomni without merge conflicts - Domain extensibility: Other domain-specific agents (legal, financial) can be added as sibling submodules
- No dependency contamination: Biomni's massive env stays isolated; research-continuum remains lightweight
- MCP bridge: Biomni already supports MCP; CODITECT already has MCP servers. The integration protocol exists.
4. Recommendation
Keep them as separate submodules under submodules/products/.
| Project | Path | Role |
|---|---|---|
coditect-research-continuum | submodules/products/coditect-research-continuum | Platform: extraction, knowledge graph, synthesis engine |
coditect-biomedical-research | submodules/products/coditect-biomedical-research | Domain module: biomedical-specific agent + tools |
Integration Roadmap
| Phase | Work | Priority |
|---|---|---|
| Phase 1 (Now) | Add Biomni submodule, document architecture boundary | Done |
| Phase 2 | Build UDOM-to-Biomni adapter: feed extracted papers as Biomni context | Medium |
| Phase 3 | Expose Biomni tools as MCP servers for CODITECT agent access | Medium |
| Phase 4 | Build results-to-KG writer: Biomni outputs flow into Research Continuum knowledge graph | High (when KG layer is built) |
| Phase 5 | Unified UI: Research Continuum interface embeds Biomni biomedical capabilities | Low |
Key Design Principle
Research Continuum is the platform. Biomni is a domain module. The platform handles extraction, knowledge graphs, and synthesis orchestration. Domain modules (biomedical, legal, financial) plug in via well-defined interfaces (MCP, UDOM JSON, knowledge graph API). This allows the Research Continuum to grow horizontally across domains while each domain module evolves independently.
5. License & IP Considerations
- Biomni is Apache 2.0 - commercial use permitted, but some integrated tools/databases may have restrictions
- Biomni's
commercial_mode=Trueflag filters non-commercial datasets - Any proprietary integration code must live in
coditect-research-continuum, not in the Biomni submodule - The Biomni submodule remote is
https://github.com/snap-stanford/Biomni.git- NEVER push to this remote (Git Push Safety directive)
6. Immediate Actions Completed
- Biomni added as git submodule at
submodules/products/coditect-biomedical-research(commit400c1f3) - This analysis document created at
internal/analysis/biomni-integration/ - Session log updated with inception and milestone entries