Research Continuum Code Migration Analysis

Date: 2026-02-11 Author: Claude (Opus 4.6) Task: T.6 — Research Continuum product architecture Project: PILOT Status: PROPOSED — awaiting architectural decision

Problem Statement

The pdf-to-markdown pipeline and UDOM extraction system (~15K LOC, 25 Python files) currently lives in coditect-core/skills/pdf-to-markdown/. Per the Research Continuum vision (session log 2026-02-11T08:56:47Z), this code is Layer 1 (Extraction) of a four-layer product (ADR-174). It needs to move to the coditect-research-continuum submodule for independent product development.

The question: what is the optimal long-term code ownership model?

Current Location (coditect-core)

skills/pdf-to-markdown/
├── SKILL.md                    # 16KB skill documentation
├── .venv/                      # Python venv (regenerable)
├── analyze-new-artifacts/      # ~1.4GB batch outputs (artifacts)
└── src/
    ├── convert.py              # PDF-to-Markdown converter (v2.5, standalone)
    ├── pipeline.py             # Batch pipeline + UDOM orchestrator (v1.7)
    ├── udom/                   # UDOM package
    │   ├── __init__.py         # Package init
    │   ├── schema.py           # Type definitions (29K)
    │   ├── taxonomy.py         # 25-type component taxonomy (13K)
    │   ├── mapper.py           # Cross-source alignment (38K)
    │   ├── assembler.py        # Canonical output generation (40K)
    │   ├── extractors/         # PDF/HTML/LaTeX extractors
    │   ├── providers/          # Data source providers
    │   └── assemblers/         # Output format assemblers
    ├── udom_artifact_finder.py # Artifact discovery utilities
    ├── udom_dup_finder.py      # Deduplication utilities
    └── udom_linter.py          # Schema linting

config/schemas/
    └── udom-v1.schema.json     # JSON Schema for UDOM

Dependency Graph

pipeline.py (orchestrator)
  ├── imports convert.py (PDF extraction engine)
  ├── imports udom/ (UDOM extraction, mapping, assembly)
  │   ├── extractors/ (PDF/HTML/LaTeX source extraction)
  │   ├── mapper.py (cross-source alignment)
  │   ├── assembler.py (canonical output generation)
  │   ├── schema.py (UDOM type definitions)
  │   └── taxonomy.py (component type taxonomy)
  └── outputs: .md, .udom.json, .content.jsonl, .audit.jsonl

convert.py (standalone)
  ├── NO dependency on pipeline.py or udom/
  ├── Standalone PDF-to-Markdown conversion
  └── outputs: .md only

Key observation: convert.py is independently useful and has zero dependency on UDOM. pipeline.py depends on both convert.py and udom/.

Architecture Options

Option A: Full Move + Stub

Move ALL source code to research-continuum. Leave a stub SKILL.md in coditect-core pointing to the new location.

Dimension	Assessment
SSOT	Clean — one location for all code
Framework impact	Breaks `/pdf-to-markdown` skill for non-RC users
Dependency	None — everything in one repo
Migration effort	Medium — move files, update imports
Long-term maintenance	Simple — single repo

Verdict: Clean but removes general-purpose PDF utility from the framework.

Option B: Split — Generic Stays, Product Moves

convert.py (generic PDF-to-Markdown) stays in coditect-core. pipeline.py + udom/ moves to research-continuum.

Dimension	Assessment
SSOT	Split — convert.py in core, pipeline+udom in product
Framework impact	None — `/pdf-to-markdown` still works for basic conversion
Dependency	Cross-repo: pipeline.py imports convert.py
Migration effort	Medium — move subset, add import path
Long-term maintenance	Moderate — two repos to maintain, import bridge

Verdict: Natural boundary respects dual purpose but creates cross-repo dependency.

Option C: Product Owns, Framework Installs

All code moves to research-continuum. Published as installable package. coditect-core's skill invokes the installed package.

Dimension	Assessment
SSOT	Clean — product owns, framework consumes
Framework impact	None if package installed; broken if not
Dependency	Proper: pip install or path-based
Migration effort	High — needs package infrastructure
Long-term maintenance	Best — versioned, proper dependency management

Verdict: Best long-term architecture but premature for current stage.

Option D: Keep in coditect-core, Product References It

Code stays in coditect-core. research-continuum depends on coditect-core for extraction.

Dimension	Assessment
SSOT	Single location
Framework impact	None
Dependency	Product depends on framework (inversion)
Migration effort	Zero
Long-term maintenance	Poor — product code in framework, bloats core

Verdict: Easy but wrong direction architecturally. Product code should not live in the framework.

Recommendation: Option B (Split) with Evolution Path

Immediate (now)

Split at the natural boundary:

Stays in coditect-core	Moves to research-continuum
`src/convert.py`	`src/pipeline.py`
`SKILL.md` (updated for convert.py only)	`src/udom/` (entire package)
`.venv/` (regenerable)	`src/udom_artifact_finder.py`
`analyze-new-artifacts/` (batch outputs)	`src/udom_dup_finder.py`
	`src/udom_linter.py`
	`config/schemas/udom-v1.schema.json`

Cross-repo import solution

pipeline.py imports convert.py. In the rollout-master monorepo tree, both submodules are siblings:

coditect-rollout-master/
├── submodules/core/coditect-core/skills/pdf-to-markdown/src/convert.py
└── submodules/products/coditect-research-continuum/src/pipeline/pipeline.py

Approach: sys.path injection at pipeline.py startup to find convert.py via relative traversal or CODITECT_CORE_PATH environment variable.

Short-term (1-2 months)

research-continuum's pipeline imports convert.py via relative path (both in same rollout-master tree)
SKILL.md in coditect-core updated to document convert.py-only usage
New README/docs in research-continuum for the full UDOM pipeline

Medium-term (3-6 months)

Extract convert.py into a shared coditect-pdf-utils package
Both repos depend on it via pip install

Long-term (6+ months)

Full Option C — research-continuum is self-contained product package
coditect-core's /pdf-to-markdown skill is a thin wrapper around the package

Target Structure (research-continuum after move)

coditect-research-continuum/
├── README.md
├── docs/
│   ├── vision/
│   │   └── CODITECT-Research-Continuum-Vision-Document.md
│   ├── architecture/
│   │   └── ADR-174-research-continuum-agentic-knowledge-infrastructure.md
│   └── analysis/
│       └── CODITECT-Research-Continuum-MoE-Assessment.md
├── schemas/
│   └── udom-v1.schema.json
└── src/
    ├── pipeline/
    │   └── pipeline.py
    ├── udom/
    │   ├── __init__.py
    │   ├── schema.py
    │   ├── taxonomy.py
    │   ├── mapper.py
    │   ├── assembler.py
    │   ├── extractors/
    │   ├── providers/
    │   └── assemblers/
    └── tools/
        ├── udom_artifact_finder.py
        ├── udom_dup_finder.py
        └── udom_linter.py

Risk Assessment

Risk	Impact	Likelihood	Mitigation
Import breakage (pipeline cannot find convert.py)	High	Medium	sys.path injection + CODITECT_CORE_PATH env var
UDOM schema referenced by other tools	Medium	Low	Schema stays in both repos; research-continuum is SSOT
Batch pipeline regression	High	Low	Validate with 5-paper test batch post-move
Git history loss	Low	Certain	Document lineage in commit message; `git log --follow` pre-move
coditect-core SKILL.md references break	Medium	Medium	Update SKILL.md to document convert.py only
analyze-new-artifacts becomes orphaned	Low	Low	Stays in coditect-core; future batches output to research-continuum

Decision Required

Awaiting approval from Lead Architect on:

Which option? (Recommended: Option B — Split)
Cross-repo import strategy? (sys.path vs env var vs symlink)
Timeline? (Immediate move vs. staged migration)
Batch artifacts disposition? (Stay in core vs. move to product)

ADR-174: internal/architecture/adrs/ADR-174-research-continuum-agentic-knowledge-infrastructure.md
Vision Document: internal/analysis/research-continuum/CODITECT-Research-Continuum-Vision-Document.md
MoE Assessment: internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md
UDOM Schema ADR: internal/architecture/adrs/ADR-164-universal-document-object-model.md

Problem Statement​

Current Location (coditect-core)​

Dependency Graph​

Architecture Options​

Option A: Full Move + Stub​

Option B: Split — Generic Stays, Product Moves​

Option C: Product Owns, Framework Installs​

Option D: Keep in coditect-core, Product References It​

Recommendation: Option B (Split) with Evolution Path​

Immediate (now)​

Cross-repo import solution​

Short-term (1-2 months)​

Medium-term (3-6 months)​

Long-term (6+ months)​

Target Structure (research-continuum after move)​

Risk Assessment​

Decision Required​

Related Documents​