Skip to main content

Research Continuum Code Migration Analysis

Date: 2026-02-11 Author: Claude (Opus 4.6) Task: T.6 — Research Continuum product architecture Project: PILOT Status: PROPOSED — awaiting architectural decision


Problem Statement

The pdf-to-markdown pipeline and UDOM extraction system (~15K LOC, 25 Python files) currently lives in coditect-core/skills/pdf-to-markdown/. Per the Research Continuum vision (session log 2026-02-11T08:56:47Z), this code is Layer 1 (Extraction) of a four-layer product (ADR-174). It needs to move to the coditect-research-continuum submodule for independent product development.

The question: what is the optimal long-term code ownership model?


Current Location (coditect-core)

skills/pdf-to-markdown/
├── SKILL.md # 16KB skill documentation
├── .venv/ # Python venv (regenerable)
├── analyze-new-artifacts/ # ~1.4GB batch outputs (artifacts)
└── src/
├── convert.py # PDF-to-Markdown converter (v2.5, standalone)
├── pipeline.py # Batch pipeline + UDOM orchestrator (v1.7)
├── udom/ # UDOM package
│ ├── __init__.py # Package init
│ ├── schema.py # Type definitions (29K)
│ ├── taxonomy.py # 25-type component taxonomy (13K)
│ ├── mapper.py # Cross-source alignment (38K)
│ ├── assembler.py # Canonical output generation (40K)
│ ├── extractors/ # PDF/HTML/LaTeX extractors
│ ├── providers/ # Data source providers
│ └── assemblers/ # Output format assemblers
├── udom_artifact_finder.py # Artifact discovery utilities
├── udom_dup_finder.py # Deduplication utilities
└── udom_linter.py # Schema linting

config/schemas/
└── udom-v1.schema.json # JSON Schema for UDOM

Dependency Graph

pipeline.py (orchestrator)
├── imports convert.py (PDF extraction engine)
├── imports udom/ (UDOM extraction, mapping, assembly)
│ ├── extractors/ (PDF/HTML/LaTeX source extraction)
│ ├── mapper.py (cross-source alignment)
│ ├── assembler.py (canonical output generation)
│ ├── schema.py (UDOM type definitions)
│ └── taxonomy.py (component type taxonomy)
└── outputs: .md, .udom.json, .content.jsonl, .audit.jsonl

convert.py (standalone)
├── NO dependency on pipeline.py or udom/
├── Standalone PDF-to-Markdown conversion
└── outputs: .md only

Key observation: convert.py is independently useful and has zero dependency on UDOM. pipeline.py depends on both convert.py and udom/.


Architecture Options

Option A: Full Move + Stub

Move ALL source code to research-continuum. Leave a stub SKILL.md in coditect-core pointing to the new location.

DimensionAssessment
SSOTClean — one location for all code
Framework impactBreaks /pdf-to-markdown skill for non-RC users
DependencyNone — everything in one repo
Migration effortMedium — move files, update imports
Long-term maintenanceSimple — single repo

Verdict: Clean but removes general-purpose PDF utility from the framework.

Option B: Split — Generic Stays, Product Moves

convert.py (generic PDF-to-Markdown) stays in coditect-core. pipeline.py + udom/ moves to research-continuum.

DimensionAssessment
SSOTSplit — convert.py in core, pipeline+udom in product
Framework impactNone — /pdf-to-markdown still works for basic conversion
DependencyCross-repo: pipeline.py imports convert.py
Migration effortMedium — move subset, add import path
Long-term maintenanceModerate — two repos to maintain, import bridge

Verdict: Natural boundary respects dual purpose but creates cross-repo dependency.

Option C: Product Owns, Framework Installs

All code moves to research-continuum. Published as installable package. coditect-core's skill invokes the installed package.

DimensionAssessment
SSOTClean — product owns, framework consumes
Framework impactNone if package installed; broken if not
DependencyProper: pip install or path-based
Migration effortHigh — needs package infrastructure
Long-term maintenanceBest — versioned, proper dependency management

Verdict: Best long-term architecture but premature for current stage.

Option D: Keep in coditect-core, Product References It

Code stays in coditect-core. research-continuum depends on coditect-core for extraction.

DimensionAssessment
SSOTSingle location
Framework impactNone
DependencyProduct depends on framework (inversion)
Migration effortZero
Long-term maintenancePoor — product code in framework, bloats core

Verdict: Easy but wrong direction architecturally. Product code should not live in the framework.


Recommendation: Option B (Split) with Evolution Path

Immediate (now)

Split at the natural boundary:

Stays in coditect-coreMoves to research-continuum
src/convert.pysrc/pipeline.py
SKILL.md (updated for convert.py only)src/udom/ (entire package)
.venv/ (regenerable)src/udom_artifact_finder.py
analyze-new-artifacts/ (batch outputs)src/udom_dup_finder.py
src/udom_linter.py
config/schemas/udom-v1.schema.json

Cross-repo import solution

pipeline.py imports convert.py. In the rollout-master monorepo tree, both submodules are siblings:

coditect-rollout-master/
├── submodules/core/coditect-core/skills/pdf-to-markdown/src/convert.py
└── submodules/products/coditect-research-continuum/src/pipeline/pipeline.py

Approach: sys.path injection at pipeline.py startup to find convert.py via relative traversal or CODITECT_CORE_PATH environment variable.

Short-term (1-2 months)

  • research-continuum's pipeline imports convert.py via relative path (both in same rollout-master tree)
  • SKILL.md in coditect-core updated to document convert.py-only usage
  • New README/docs in research-continuum for the full UDOM pipeline

Medium-term (3-6 months)

  • Extract convert.py into a shared coditect-pdf-utils package
  • Both repos depend on it via pip install

Long-term (6+ months)

  • Full Option C — research-continuum is self-contained product package
  • coditect-core's /pdf-to-markdown skill is a thin wrapper around the package

Target Structure (research-continuum after move)

coditect-research-continuum/
├── README.md
├── docs/
│ ├── vision/
│ │ └── CODITECT-Research-Continuum-Vision-Document.md
│ ├── architecture/
│ │ └── ADR-174-research-continuum-agentic-knowledge-infrastructure.md
│ └── analysis/
│ └── CODITECT-Research-Continuum-MoE-Assessment.md
├── schemas/
│ └── udom-v1.schema.json
└── src/
├── pipeline/
│ └── pipeline.py
├── udom/
│ ├── __init__.py
│ ├── schema.py
│ ├── taxonomy.py
│ ├── mapper.py
│ ├── assembler.py
│ ├── extractors/
│ ├── providers/
│ └── assemblers/
└── tools/
├── udom_artifact_finder.py
├── udom_dup_finder.py
└── udom_linter.py

Risk Assessment

RiskImpactLikelihoodMitigation
Import breakage (pipeline cannot find convert.py)HighMediumsys.path injection + CODITECT_CORE_PATH env var
UDOM schema referenced by other toolsMediumLowSchema stays in both repos; research-continuum is SSOT
Batch pipeline regressionHighLowValidate with 5-paper test batch post-move
Git history lossLowCertainDocument lineage in commit message; git log --follow pre-move
coditect-core SKILL.md references breakMediumMediumUpdate SKILL.md to document convert.py only
analyze-new-artifacts becomes orphanedLowLowStays in coditect-core; future batches output to research-continuum

Decision Required

Awaiting approval from Lead Architect on:

  1. Which option? (Recommended: Option B — Split)
  2. Cross-repo import strategy? (sys.path vs env var vs symlink)
  3. Timeline? (Immediate move vs. staged migration)
  4. Batch artifacts disposition? (Stay in core vs. move to product)

  • ADR-174: internal/architecture/adrs/ADR-174-research-continuum-agentic-knowledge-infrastructure.md
  • Vision Document: internal/analysis/research-continuum/CODITECT-Research-Continuum-Vision-Document.md
  • MoE Assessment: internal/analysis/research-continuum/CODITECT-Research-Continuum-MoE-Assessment.md
  • UDOM Schema ADR: internal/architecture/adrs/ADR-164-universal-document-object-model.md