Research Continuum Code Migration Analysis
Date: 2026-02-11 Author: Claude (Opus 4.6) Task: T.6 — Research Continuum product architecture Project: PILOT Status: PROPOSED — awaiting architectural decision
Problem Statement
The pdf-to-markdown pipeline and UDOM extraction system (~15K LOC, 25 Python files) currently lives in coditect-core/skills/pdf-to-markdown/. Per the Research Continuum vision (session log 2026-02-11T08:56:47Z), this code is Layer 1 (Extraction) of a four-layer product (ADR-174). It needs to move to the coditect-research-continuum submodule for independent product development.
The question: what is the optimal long-term code ownership model?
Current Location (coditect-core)
skills/pdf-to-markdown/
├── SKILL.md # 16KB skill documentation
├── .venv/ # Python venv (regenerable)
├── analyze-new-artifacts/ # ~1.4GB batch outputs (artifacts)
└── src/
├── convert.py # PDF-to-Markdown converter (v2.5, standalone)
├── pipeline.py # Batch pipeline + UDOM orchestrator (v1.7)
├── udom/ # UDOM package
│ ├── __init__.py # Package init
│ ├── schema.py # Type definitions (29K)
│ ├── taxonomy.py # 25-type component taxonomy (13K)
│ ├── mapper.py # Cross-source alignment (38K)
│ ├── assembler.py # Canonical output generation (40K)
│ ├── extractors/ # PDF/HTML/LaTeX extractors
│ ├── providers/ # Data source providers
│ └── assemblers/ # Output format assemblers
├── udom_artifact_finder.py # Artifact discovery utilities
├── udom_dup_finder.py # Deduplication utilities
└── udom_linter.py # Schema linting
config/schemas/
└── udom-v1.schema.json # JSON Schema for UDOM
Dependency Graph
pipeline.py (orchestrator)
├── imports convert.py (PDF extraction engine)
├── imports udom/ (UDOM extraction, mapping, assembly)
│ ├── extractors/ (PDF/HTML/LaTeX source extraction)
│ ├── mapper.py (cross-source alignment)
│ ├── assembler.py (canonical output generation)
│ ├── schema.py (UDOM type definitions)
│ └── taxonomy.py (component type taxonomy)
└── outputs: .md, .udom.json, .content.jsonl, .audit.jsonl
convert.py (standalone)
├── NO dependency on pipeline.py or udom/
├── Standalone PDF-to-Markdown conversion
└── outputs: .md only
Key observation: convert.py is independently useful and has zero dependency on UDOM. pipeline.py depends on both convert.py and udom/.
Architecture Options
Option A: Full Move + Stub
Move ALL source code to research-continuum. Leave a stub SKILL.md in coditect-core pointing to the new location.
| Dimension | Assessment |
|---|---|
| SSOT | Clean — one location for all code |
| Framework impact | Breaks /pdf-to-markdown skill for non-RC users |
| Dependency | None — everything in one repo |
| Migration effort | Medium — move files, update imports |
| Long-term maintenance | Simple — single repo |
Verdict: Clean but removes general-purpose PDF utility from the framework.
Option B: Split — Generic Stays, Product Moves
convert.py (generic PDF-to-Markdown) stays in coditect-core. pipeline.py + udom/ moves to research-continuum.
| Dimension | Assessment |
|---|---|
| SSOT | Split — convert.py in core, pipeline+udom in product |
| Framework impact | None — /pdf-to-markdown still works for basic conversion |
| Dependency | Cross-repo: pipeline.py imports convert.py |
| Migration effort | Medium — move subset, add import path |
| Long-term maintenance | Moderate — two repos to maintain, import bridge |
Verdict: Natural boundary respects dual purpose but creates cross-repo dependency.
Option C: Product Owns, Framework Installs
All code moves to research-continuum. Published as installable package. coditect-core's skill invokes the installed package.
| Dimension | Assessment |
|---|---|
| SSOT | Clean — product owns, framework consumes |
| Framework impact | None if package installed; broken if not |
| Dependency | Proper: pip install or path-based |
| Migration effort | High — needs package infrastructure |
| Long-term maintenance | Best — versioned, proper dependency management |
Verdict: Best long-term architecture but premature for current stage.
Option D: Keep in coditect-core, Product References It
Code stays in coditect-core. research-continuum depends on coditect-core for extraction.
| Dimension | Assessment |
|---|---|
| SSOT | Single location |
| Framework impact | None |
| Dependency | Product depends on framework (inversion) |
| Migration effort | Zero |
| Long-term maintenance | Poor — product code in framework, bloats core |
Verdict: Easy but wrong direction architecturally. Product code should not live in the framework.
Recommendation: Option B (Split) with Evolution Path
Immediate (now)
Split at the natural boundary:
| Stays in coditect-core | Moves to research-continuum |
|---|---|
src/convert.py | src/pipeline.py |
SKILL.md (updated for convert.py only) | src/udom/ (entire package) |
.venv/ (regenerable) | src/udom_artifact_finder.py |
analyze-new-artifacts/ (batch outputs) | src/udom_dup_finder.py |
src/udom_linter.py | |
config/schemas/udom-v1.schema.json |
Cross-repo import solution
pipeline.py imports convert.py. In the rollout-master monorepo tree, both submodules are siblings:
coditect-rollout-master/
├── submodules/core/coditect-core/skills/pdf-to-markdown/src/convert.py
└── submodules/products/coditect-research-continuum/src/pipeline/pipeline.py
Approach: sys.path injection at pipeline.py startup to find convert.py via relative traversal or CODITECT_CORE_PATH environment variable.
Short-term (1-2 months)
- research-continuum's pipeline imports convert.py via relative path (both in same rollout-master tree)
- SKILL.md in coditect-core updated to document convert.py-only usage
- New README/docs in research-continuum for the full UDOM pipeline
Medium-term (3-6 months)
- Extract convert.py into a shared
coditect-pdf-utilspackage - Both repos depend on it via pip install
Long-term (6+ months)
- Full Option C — research-continuum is self-contained product package
- coditect-core's
/pdf-to-markdownskill is a thin wrapper around the package
Target Structure (research-continuum after move)
coditect-research-continuum/
├── README.md
├── docs/
│ ├── vision/
│ │ └── CODITECT-Research-Continuum-Vision-Document.md
│ ├── architecture/
│ │ └── ADR-174-research-continuum-agentic-knowledge-infrastructure.md
│ └── analysis/
│ └── CODITECT-Research-Continuum-MoE-Assessment.md
├── schemas/
│ └── udom-v1.schema.json
└── src/
├── pipeline/
│ └── pipeline.py
├── udom/
│ ├── __init__.py
│ ├── schema.py
│ ├── taxonomy.py
│ ├── mapper.py
│ ├── assembler.py
│ ├── extractors/
│ ├── providers/
│ └── assemblers/
└── tools/
├── udom_artifact_finder.py
├── udom_dup_finder.py
└── udom_linter.py
Risk Assessment
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Import breakage (pipeline cannot find convert.py) | High | Medium | sys.path injection + CODITECT_CORE_PATH env var |
| UDOM schema referenced by other tools | Medium | Low | Schema stays in both repos; research-continuum is SSOT |
| Batch pipeline regression | High | Low | Validate with 5-paper test batch post-move |
| Git history loss | Low | Certain | Document lineage in commit message; git log --follow pre-move |
| coditect-core SKILL.md references break | Medium | Medium | Update SKILL.md to document convert.py only |
| analyze-new-artifacts becomes orphaned | Low | Low | Stays in coditect-core; future batches output to research-continuum |
Decision Required
Awaiting approval from Lead Architect on:
- Which option? (Recommended: Option B — Split)
- Cross-repo import strategy? (sys.path vs env var vs symlink)
- Timeline? (Immediate move vs. staged migration)
- Batch artifacts disposition? (Stay in core vs. move to product)
Related Documents
- ADR-174:
internal/architecture/adrs/ADR-174-research-continuum-agentic-knowledge-infrastructure.md - Vision Document:
internal/analysis/research-continuum/CODITECT-Research-Continuum-Vision-Document.md - MoE Assessment:
internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md - UDOM Schema ADR:
internal/architecture/adrs/ADR-164-universal-document-object-model.md