Skip to main content

ADR-027: Corpus Processing Subsystem Architecture

Status

PROPOSED

Date

2026-01-15

Context

Coditect requires the ability to process large document corpora (100-10,000+ documents) for autonomous software development tasks including:

  • Requirements analysis from stakeholder interviews
  • Competitive intelligence gathering
  • Regulatory document parsing (FDA, HIPAA, SOC2)
  • Legacy codebase analysis
  • Technical documentation synthesis

Current LLM context window limitations (even with 200K+ token models) make naive sequential processing impractical. A video demonstrating "unlimited memory" via file-based checkpointing validates the core concept but lacks enterprise-grade characteristics required for regulated industries.

Requirements

RequirementPriorityRationale
Process 1,000+ documents in <1 hourP0Competitive parity
21 CFR Part 11 audit trailsP0FDA compliance
Token cost reduction >70%P1Unit economics
Incremental corpus updatesP1Operational efficiency
Multi-granularity retrievalP1Query flexibility
Parallel agent executionP0Performance
Checkpoint/resume on failureP0Reliability

Constraints

  • Must integrate with existing FoundationDB event store
  • Must support Rust/WASM browser execution path
  • Must maintain compliance audit capabilities
  • Token budget awareness (15x multiplier for multi-agent)

Decision

Implement a Hybrid Corpus Processing Subsystem consisting of five coordinated components:

┌─────────────────────────────────────────────────────────────────────┐
│ CORPUS PROCESSING SUBSYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ PRE-PROCESSOR │───►│ MAP-REDUCE │───►│ HIERARCHICAL │ │
│ │ AGENTS │ │ ORCHESTRATOR │ │ KNOWLEDGE STORE│ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ RAG QUERY ENGINE │ │
│ │ (Retrieval-Augmented Generation Layer) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ COMPLIANCE AUDIT LAYER │ │
│ │ (21 CFR Part 11, HIPAA, SOC2) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Component Breakdown

ComponentADRPurpose
Pre-Processor AgentsADR-028Token reduction, entity extraction, filtering
Map-Reduce OrchestratorADR-029Parallel agent coordination, aggregation
Hierarchical Knowledge StoreADR-030Multi-level summary storage, drill-down
RAG Query EngineADR-031Semantic retrieval, cited generation
Compliance Audit LayerADR-032Audit trails, access control, signatures

New Agent Types

corpus_agents:
- name: PreProcessorAgent
role: Document cleaning, NER, keyword filtering
tools: [ocr_extract, entity_recognize, keyword_filter, text_clean]
token_budget: 5,000 per document
parallelizable: true

- name: MapperAgent
role: Per-document analysis and extraction
tools: [analyze_document, extract_schema, summarize_chunk]
token_budget: 15,000 per document
parallelizable: true

- name: ReducerAgent
role: Aggregate mapper outputs into synthesis
tools: [merge_extractions, deduplicate, synthesize]
token_budget: 50,000 per batch
parallelizable: false (sequential reduction)

- name: IndexerAgent
role: Build vector embeddings and knowledge graph
tools: [embed_chunks, build_graph, index_entities]
token_budget: 10,000 per batch
parallelizable: true

- name: QueryAgent
role: RAG retrieval and response generation
tools: [vector_search, graph_traverse, generate_cited]
token_budget: 20,000 per query
parallelizable: true

New Skills

corpus_skills:
- name: corpus-ingest
description: Ingest document corpus with pre-processing
location: /mnt/skills/coditect/corpus-ingest/SKILL.md

- name: corpus-analyze
description: Map-reduce analysis of document corpus
location: /mnt/skills/coditect/corpus-analyze/SKILL.md

- name: corpus-query
description: RAG-powered queries against indexed corpus
location: /mnt/skills/coditect/corpus-query/SKILL.md

- name: corpus-export
description: Export analysis results with audit trail
location: /mnt/skills/coditect/corpus-export/SKILL.md

New Commands

corpus_commands:
- command: "@corpus:ingest"
description: Ingest documents into corpus processing pipeline
parameters:
- source_path: string (required)
- extraction_schema: string (optional)
- pre_process_level: enum[minimal, standard, aggressive]

- command: "@corpus:analyze"
description: Run map-reduce analysis on corpus
parameters:
- analysis_type: enum[extract, summarize, compare, custom]
- output_format: enum[json, markdown, structured]
- parallel_agents: int (default: 10)

- command: "@corpus:query"
description: RAG query against indexed corpus
parameters:
- query: string (required)
- top_k: int (default: 10)
- require_citations: bool (default: true)

- command: "@corpus:status"
description: Check corpus processing status
parameters:
- corpus_id: string (required)

Consequences

Positive

  • Performance: 10-50x faster than sequential processing
  • Cost: 70-85% token reduction through pre-processing + RAG
  • Compliance: Full audit trail for regulated industries
  • Flexibility: Multi-granularity retrieval (summary → detail)
  • Reliability: Checkpoint/resume on any failure
  • Scalability: Horizontal scaling via parallel agents

Negative

  • Complexity: Five coordinated subsystems to maintain
  • Infrastructure: Requires vector database (additional dependency)
  • Learning curve: New command vocabulary for users
  • Initial latency: Indexing phase before queries are fast

Risks

RiskProbabilityImpactMitigation
Hallucination propagation in hierarchical summariesMediumHighExtractive summaries at L0, grounding at each level
Vector DB performance at scaleLowMediumBenchmark early, horizontal sharding strategy
Agent coordination deadlockMediumHighCircuit breakers, timeout policies
Compliance audit overheadLowMediumAsync audit writes, batch commits

Alternatives Considered

Alternative 1: Sequential File-Based (Video Approach)

  • Rejected: O(n) time complexity unacceptable for enterprise scale
  • Learning: Checkpoint-resume pattern is valid, implementation is not

Alternative 2: Pure RAG (No Pre-Processing)

  • Rejected: Token costs prohibitive without filtering
  • Learning: RAG is necessary but not sufficient

Alternative 3: Single Monolithic Agent

  • Rejected: Context window limits, no parallelization
  • Learning: Multi-agent required for scale

Alternative 4: External Service (LangChain Cloud, etc.)

  • Rejected: Compliance requirements demand self-hosted
  • Learning: Architecture patterns are reusable

Implementation Plan

PhaseSprintDeliverable
1S1-S2Pre-Processor Agent pipeline
2S3-S4Map-Reduce Orchestrator
3S5-S6Hierarchical Knowledge Store
4S7-S8RAG Query Engine
5S9-S10Compliance Audit Layer
6S11-S12Integration testing, documentation

References

Approval

RoleNameDateDecision
CTOHal Casteel
Tech Lead
Security