ADR-027: Hybrid Document Processing Architecture
Status
PROPOSED
Date
2026-01-15
Context
Coditect requires enterprise-grade document processing capabilities for regulated industries. Current approaches to large-scale document analysis face fundamental limitations:
- Sequential processing (as demonstrated in consumer tools) cannot scale to enterprise volumes
- Context window constraints limit single-pass analysis to ~200K tokens
- Compliance requirements (21 CFR Part 11, HIPAA, SOC2) demand audit trails for all document access
- Token economics at 15x multiplier for multi-agent make naive approaches cost-prohibitive
- Quality assurance requires validation to prevent hallucination propagation
Market research shows organizations implementing advanced RAG systems report 78% improvement in response accuracy for domain-specific queries. The opportunity is to package these patterns into Coditect's autonomous development platform.
Requirements
| Requirement | Priority | Rationale |
|---|---|---|
| Process 1000+ documents | P0 | Enterprise scale |
| Sub-minute query response | P0 | Interactive UX |
| 21 CFR Part 11 compliance | P0 | Regulated industries |
| 70%+ token reduction | P1 | Cost viability |
| Multi-granularity retrieval | P1 | Varied query complexity |
| Incremental updates | P2 | Avoid full reprocessing |
Decision
Implement a Hybrid Document Processing Architecture combining five complementary techniques:
┌─────────────────────────────────────────────────────────────────┐
│ CODITECT HYBRID PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 1: PRE-PROCESSING AGENTS │ │
│ │ • DocumentParserAgent (OCR, format detection) │ │
│ │ • EntityExtractionAgent (NER, keyword extraction) │ │
│ │ • FilterAgent (relevance scoring, deduplication) │ │
│ │ Token Impact: -60-80% │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 2: INDEXING SERVICE │ │
│ │ • ChunkingService (semantic boundaries) │ │
│ │ • EmbeddingService (vector generation) │ │
│ │ • KnowledgeGraphBuilder (entity relationships) │ │
│ │ Storage: FoundationDB + Vector Index │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 3: MAP-REDUCE ORCHESTRATOR │ │
│ │ • MapAgent pool (parallel document analysis) │ │
│ │ • ReduceAgent (synthesis and aggregation) │ │
│ │ • CheckpointManager (fault tolerance) │ │
│ │ Time Impact: -80-90% wall-clock │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 4: HIERARCHICAL MEMORY STORE │ │
│ │ • WorkingMemory (in-context, current session) │ │
│ │ • EpisodicMemory (compressed session summaries) │ │
│ │ • SemanticMemory (extracted knowledge graph) │ │
│ │ Storage: FoundationDB with tiered compression │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 5: COMPLIANCE-AWARE RAG │ │
│ │ • QueryProcessor (intent classification, expansion) │ │
│ │ • Retriever (hybrid BM25 + vector + graph) │ │
│ │ • Generator (cited response with audit trail) │ │
│ │ Compliance: Full 21 CFR Part 11 audit logging │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Architecture Principles
- Event-Driven: All stages emit events to FoundationDB event store for replay and audit
- Agent Specialization: Each stage uses purpose-built agents with bounded responsibilities
- Checkpoint-Resume: All long-running operations support interruption and recovery
- Token-Aware: Budget management enforced at orchestration layer
- Compliance-First: Audit trail creation is non-optional, not an afterthought
Consequences
Positive
- Scalability: Handles 10,000+ documents with horizontal scaling
- Performance: Sub-second query response after indexing
- Cost Efficiency: 70-85% token reduction vs naive approaches
- Compliance: Built-in audit trails satisfy regulatory requirements
- Quality: Multi-stage validation prevents hallucination propagation
Negative
- Complexity: Five-stage pipeline requires careful orchestration
- Initial Latency: First-time indexing has upfront cost
- Storage Requirements: Vector indices and knowledge graphs consume space
- Operational Overhead: More components to monitor and maintain
Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Stage coupling creates brittleness | Medium | High | Event-driven decoupling, circuit breakers |
| Vector index query latency at scale | Low | Medium | Partitioning, caching hot queries |
| Hallucination in hierarchical summaries | Medium | High | Extractive grounding, validation agents |
| Compliance audit storage growth | High | Low | Tiered retention policies |
Implementation
See related ADRs:
- ADR-028: Map-Reduce Agent Orchestration
- ADR-029: Hierarchical Knowledge Store
- ADR-030: Compliance-Aware RAG System
- ADR-031: Pre-Processing Pipeline
- ADR-032: Multi-Level Memory Hierarchy
- ADR-033: Document Processing CLI Commands