ADR-027: Hybrid Document Processing Architecture

Status

PROPOSED

Date

2026-01-15

Context

Coditect requires enterprise-grade document processing capabilities for regulated industries. Current approaches to large-scale document analysis face fundamental limitations:

Sequential processing (as demonstrated in consumer tools) cannot scale to enterprise volumes
Context window constraints limit single-pass analysis to ~200K tokens
Compliance requirements (21 CFR Part 11, HIPAA, SOC2) demand audit trails for all document access
Token economics at 15x multiplier for multi-agent make naive approaches cost-prohibitive
Quality assurance requires validation to prevent hallucination propagation

Market research shows organizations implementing advanced RAG systems report 78% improvement in response accuracy for domain-specific queries. The opportunity is to package these patterns into Coditect's autonomous development platform.

Requirements

Requirement	Priority	Rationale
Process 1000+ documents	P0	Enterprise scale
Sub-minute query response	P0	Interactive UX
21 CFR Part 11 compliance	P0	Regulated industries
70%+ token reduction	P1	Cost viability
Multi-granularity retrieval	P1	Varied query complexity
Incremental updates	P2	Avoid full reprocessing

Decision

Implement a Hybrid Document Processing Architecture combining five complementary techniques:

┌─────────────────────────────────────────────────────────────────┐
│                    CODITECT HYBRID PIPELINE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ STAGE 1: PRE-PROCESSING AGENTS                          │   │
│  │ • DocumentParserAgent (OCR, format detection)           │   │
│  │ • EntityExtractionAgent (NER, keyword extraction)       │   │
│  │ • FilterAgent (relevance scoring, deduplication)        │   │
│  │ Token Impact: -60-80%                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ STAGE 2: INDEXING SERVICE                               │   │
│  │ • ChunkingService (semantic boundaries)                 │   │
│  │ • EmbeddingService (vector generation)                  │   │
│  │ • KnowledgeGraphBuilder (entity relationships)          │   │
│  │ Storage: FoundationDB + Vector Index                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ STAGE 3: MAP-REDUCE ORCHESTRATOR                        │   │
│  │ • MapAgent pool (parallel document analysis)            │   │
│  │ • ReduceAgent (synthesis and aggregation)               │   │
│  │ • CheckpointManager (fault tolerance)                   │   │
│  │ Time Impact: -80-90% wall-clock                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ STAGE 4: HIERARCHICAL MEMORY STORE                      │   │
│  │ • WorkingMemory (in-context, current session)           │   │
│  │ • EpisodicMemory (compressed session summaries)         │   │
│  │ • SemanticMemory (extracted knowledge graph)            │   │
│  │ Storage: FoundationDB with tiered compression           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ STAGE 5: COMPLIANCE-AWARE RAG                           │   │
│  │ • QueryProcessor (intent classification, expansion)     │   │
│  │ • Retriever (hybrid BM25 + vector + graph)              │   │
│  │ • Generator (cited response with audit trail)           │   │
│  │ Compliance: Full 21 CFR Part 11 audit logging           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Architecture Principles

Event-Driven: All stages emit events to FoundationDB event store for replay and audit
Agent Specialization: Each stage uses purpose-built agents with bounded responsibilities
Checkpoint-Resume: All long-running operations support interruption and recovery
Token-Aware: Budget management enforced at orchestration layer
Compliance-First: Audit trail creation is non-optional, not an afterthought

Consequences

Positive

Scalability: Handles 10,000+ documents with horizontal scaling
Performance: Sub-second query response after indexing
Cost Efficiency: 70-85% token reduction vs naive approaches
Compliance: Built-in audit trails satisfy regulatory requirements
Quality: Multi-stage validation prevents hallucination propagation

Negative

Complexity: Five-stage pipeline requires careful orchestration
Initial Latency: First-time indexing has upfront cost
Storage Requirements: Vector indices and knowledge graphs consume space
Operational Overhead: More components to monitor and maintain

Risks

Risk	Probability	Impact	Mitigation
Stage coupling creates brittleness	Medium	High	Event-driven decoupling, circuit breakers
Vector index query latency at scale	Low	Medium	Partitioning, caching hot queries
Hallucination in hierarchical summaries	Medium	High	Extractive grounding, validation agents
Compliance audit storage growth	High	Low	Tiered retention policies

Implementation

See related ADRs:

ADR-028: Map-Reduce Agent Orchestration
ADR-029: Hierarchical Knowledge Store
ADR-030: Compliance-Aware RAG System
ADR-031: Pre-Processing Pipeline
ADR-032: Multi-Level Memory Hierarchy
ADR-033: Document Processing CLI Commands

Status​

Date​

Context​

Requirements​

Decision​

Architecture Principles​

Consequences​

Positive​

Negative​

Risks​

Implementation​

References​