Skip to main content

ADR-027: Hybrid Document Processing Architecture

Status

PROPOSED

Date

2026-01-15

Context

Coditect requires enterprise-grade document processing capabilities for regulated industries. Current approaches to large-scale document analysis face fundamental limitations:

  1. Sequential processing (as demonstrated in consumer tools) cannot scale to enterprise volumes
  2. Context window constraints limit single-pass analysis to ~200K tokens
  3. Compliance requirements (21 CFR Part 11, HIPAA, SOC2) demand audit trails for all document access
  4. Token economics at 15x multiplier for multi-agent make naive approaches cost-prohibitive
  5. Quality assurance requires validation to prevent hallucination propagation

Market research shows organizations implementing advanced RAG systems report 78% improvement in response accuracy for domain-specific queries. The opportunity is to package these patterns into Coditect's autonomous development platform.

Requirements

RequirementPriorityRationale
Process 1000+ documentsP0Enterprise scale
Sub-minute query responseP0Interactive UX
21 CFR Part 11 complianceP0Regulated industries
70%+ token reductionP1Cost viability
Multi-granularity retrievalP1Varied query complexity
Incremental updatesP2Avoid full reprocessing

Decision

Implement a Hybrid Document Processing Architecture combining five complementary techniques:

┌─────────────────────────────────────────────────────────────────┐
│ CODITECT HYBRID PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 1: PRE-PROCESSING AGENTS │ │
│ │ • DocumentParserAgent (OCR, format detection) │ │
│ │ • EntityExtractionAgent (NER, keyword extraction) │ │
│ │ • FilterAgent (relevance scoring, deduplication) │ │
│ │ Token Impact: -60-80% │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 2: INDEXING SERVICE │ │
│ │ • ChunkingService (semantic boundaries) │ │
│ │ • EmbeddingService (vector generation) │ │
│ │ • KnowledgeGraphBuilder (entity relationships) │ │
│ │ Storage: FoundationDB + Vector Index │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 3: MAP-REDUCE ORCHESTRATOR │ │
│ │ • MapAgent pool (parallel document analysis) │ │
│ │ • ReduceAgent (synthesis and aggregation) │ │
│ │ • CheckpointManager (fault tolerance) │ │
│ │ Time Impact: -80-90% wall-clock │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 4: HIERARCHICAL MEMORY STORE │ │
│ │ • WorkingMemory (in-context, current session) │ │
│ │ • EpisodicMemory (compressed session summaries) │ │
│ │ • SemanticMemory (extracted knowledge graph) │ │
│ │ Storage: FoundationDB with tiered compression │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STAGE 5: COMPLIANCE-AWARE RAG │ │
│ │ • QueryProcessor (intent classification, expansion) │ │
│ │ • Retriever (hybrid BM25 + vector + graph) │ │
│ │ • Generator (cited response with audit trail) │ │
│ │ Compliance: Full 21 CFR Part 11 audit logging │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Architecture Principles

  1. Event-Driven: All stages emit events to FoundationDB event store for replay and audit
  2. Agent Specialization: Each stage uses purpose-built agents with bounded responsibilities
  3. Checkpoint-Resume: All long-running operations support interruption and recovery
  4. Token-Aware: Budget management enforced at orchestration layer
  5. Compliance-First: Audit trail creation is non-optional, not an afterthought

Consequences

Positive

  • Scalability: Handles 10,000+ documents with horizontal scaling
  • Performance: Sub-second query response after indexing
  • Cost Efficiency: 70-85% token reduction vs naive approaches
  • Compliance: Built-in audit trails satisfy regulatory requirements
  • Quality: Multi-stage validation prevents hallucination propagation

Negative

  • Complexity: Five-stage pipeline requires careful orchestration
  • Initial Latency: First-time indexing has upfront cost
  • Storage Requirements: Vector indices and knowledge graphs consume space
  • Operational Overhead: More components to monitor and maintain

Risks

RiskProbabilityImpactMitigation
Stage coupling creates brittlenessMediumHighEvent-driven decoupling, circuit breakers
Vector index query latency at scaleLowMediumPartitioning, caching hot queries
Hallucination in hierarchical summariesMediumHighExtractive grounding, validation agents
Compliance audit storage growthHighLowTiered retention policies

Implementation

See related ADRs:

  • ADR-028: Map-Reduce Agent Orchestration
  • ADR-029: Hierarchical Knowledge Store
  • ADR-030: Compliance-Aware RAG System
  • ADR-031: Pre-Processing Pipeline
  • ADR-032: Multi-Level Memory Hierarchy
  • ADR-033: Document Processing CLI Commands

References