ADR-027: Corpus Processing Subsystem Architecture
Status
PROPOSED
Date
2026-01-15
Context
Coditect requires the ability to process large document corpora (100-10,000+ documents) for autonomous software development tasks including:
- Requirements analysis from stakeholder interviews
- Competitive intelligence gathering
- Regulatory document parsing (FDA, HIPAA, SOC2)
- Legacy codebase analysis
- Technical documentation synthesis
Current LLM context window limitations (even with 200K+ token models) make naive sequential processing impractical. A video demonstrating "unlimited memory" via file-based checkpointing validates the core concept but lacks enterprise-grade characteristics required for regulated industries.
Requirements
| Requirement | Priority | Rationale |
|---|---|---|
| Process 1,000+ documents in <1 hour | P0 | Competitive parity |
| 21 CFR Part 11 audit trails | P0 | FDA compliance |
| Token cost reduction >70% | P1 | Unit economics |
| Incremental corpus updates | P1 | Operational efficiency |
| Multi-granularity retrieval | P1 | Query flexibility |
| Parallel agent execution | P0 | Performance |
| Checkpoint/resume on failure | P0 | Reliability |
Constraints
- Must integrate with existing FoundationDB event store
- Must support Rust/WASM browser execution path
- Must maintain compliance audit capabilities
- Token budget awareness (15x multiplier for multi-agent)
Decision
Implement a Hybrid Corpus Processing Subsystem consisting of five coordinated components:
┌─────────────────────────────────────────────────────────────────────┐
│ CORPUS PROCESSING SUBSYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ PRE-PROCESSOR │───►│ MAP-REDUCE │───►│ HIERARCHICAL │ │
│ │ AGENTS │ │ ORCHESTRATOR │ │ KNOWLEDGE STORE│ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ RAG QUERY ENGINE │ │
│ │ (Retrieval-Augmented Generation Layer) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ COMPLIANCE AUDIT LAYER │ │
│ │ (21 CFR Part 11, HIPAA, SOC2) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Component Breakdown
| Component | ADR | Purpose |
|---|---|---|
| Pre-Processor Agents | ADR-028 | Token reduction, entity extraction, filtering |
| Map-Reduce Orchestrator | ADR-029 | Parallel agent coordination, aggregation |
| Hierarchical Knowledge Store | ADR-030 | Multi-level summary storage, drill-down |
| RAG Query Engine | ADR-031 | Semantic retrieval, cited generation |
| Compliance Audit Layer | ADR-032 | Audit trails, access control, signatures |
New Agent Types
corpus_agents:
- name: PreProcessorAgent
role: Document cleaning, NER, keyword filtering
tools: [ocr_extract, entity_recognize, keyword_filter, text_clean]
token_budget: 5,000 per document
parallelizable: true
- name: MapperAgent
role: Per-document analysis and extraction
tools: [analyze_document, extract_schema, summarize_chunk]
token_budget: 15,000 per document
parallelizable: true
- name: ReducerAgent
role: Aggregate mapper outputs into synthesis
tools: [merge_extractions, deduplicate, synthesize]
token_budget: 50,000 per batch
parallelizable: false (sequential reduction)
- name: IndexerAgent
role: Build vector embeddings and knowledge graph
tools: [embed_chunks, build_graph, index_entities]
token_budget: 10,000 per batch
parallelizable: true
- name: QueryAgent
role: RAG retrieval and response generation
tools: [vector_search, graph_traverse, generate_cited]
token_budget: 20,000 per query
parallelizable: true
New Skills
corpus_skills:
- name: corpus-ingest
description: Ingest document corpus with pre-processing
location: /mnt/skills/coditect/corpus-ingest/SKILL.md
- name: corpus-analyze
description: Map-reduce analysis of document corpus
location: /mnt/skills/coditect/corpus-analyze/SKILL.md
- name: corpus-query
description: RAG-powered queries against indexed corpus
location: /mnt/skills/coditect/corpus-query/SKILL.md
- name: corpus-export
description: Export analysis results with audit trail
location: /mnt/skills/coditect/corpus-export/SKILL.md
New Commands
corpus_commands:
- command: "@corpus:ingest"
description: Ingest documents into corpus processing pipeline
parameters:
- source_path: string (required)
- extraction_schema: string (optional)
- pre_process_level: enum[minimal, standard, aggressive]
- command: "@corpus:analyze"
description: Run map-reduce analysis on corpus
parameters:
- analysis_type: enum[extract, summarize, compare, custom]
- output_format: enum[json, markdown, structured]
- parallel_agents: int (default: 10)
- command: "@corpus:query"
description: RAG query against indexed corpus
parameters:
- query: string (required)
- top_k: int (default: 10)
- require_citations: bool (default: true)
- command: "@corpus:status"
description: Check corpus processing status
parameters:
- corpus_id: string (required)
Consequences
Positive
- Performance: 10-50x faster than sequential processing
- Cost: 70-85% token reduction through pre-processing + RAG
- Compliance: Full audit trail for regulated industries
- Flexibility: Multi-granularity retrieval (summary → detail)
- Reliability: Checkpoint/resume on any failure
- Scalability: Horizontal scaling via parallel agents
Negative
- Complexity: Five coordinated subsystems to maintain
- Infrastructure: Requires vector database (additional dependency)
- Learning curve: New command vocabulary for users
- Initial latency: Indexing phase before queries are fast
Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Hallucination propagation in hierarchical summaries | Medium | High | Extractive summaries at L0, grounding at each level |
| Vector DB performance at scale | Low | Medium | Benchmark early, horizontal sharding strategy |
| Agent coordination deadlock | Medium | High | Circuit breakers, timeout policies |
| Compliance audit overhead | Low | Medium | Async audit writes, batch commits |
Alternatives Considered
Alternative 1: Sequential File-Based (Video Approach)
- Rejected: O(n) time complexity unacceptable for enterprise scale
- Learning: Checkpoint-resume pattern is valid, implementation is not
Alternative 2: Pure RAG (No Pre-Processing)
- Rejected: Token costs prohibitive without filtering
- Learning: RAG is necessary but not sufficient
Alternative 3: Single Monolithic Agent
- Rejected: Context window limits, no parallelization
- Learning: Multi-agent required for scale
Alternative 4: External Service (LangChain Cloud, etc.)
- Rejected: Compliance requirements demand self-hosted
- Learning: Architecture patterns are reusable
Implementation Plan
| Phase | Sprint | Deliverable |
|---|---|---|
| 1 | S1-S2 | Pre-Processor Agent pipeline |
| 2 | S3-S4 | Map-Reduce Orchestrator |
| 3 | S5-S6 | Hierarchical Knowledge Store |
| 4 | S7-S8 | RAG Query Engine |
| 5 | S9-S10 | Compliance Audit Layer |
| 6 | S11-S12 | Integration testing, documentation |
References
- ADR-001: FoundationDB as Core Database
- ADR-015: Multi-Agent Orchestration Framework
- ADR-018: Token Economics and Budget Management
- ADR-022: Compliance Framework (21 CFR Part 11)
- LLM×MapReduce Paper
- Hierarchical Summarization - Pieces.app
- RAG Best Practices 2025
Approval
| Role | Name | Date | Decision |
|---|---|---|---|
| CTO | Hal Casteel | ||
| Tech Lead | |||
| Security |