Skip to main content

Corpus Processing Subsystem: Suggestions for Further Improvement

Executive Summary

The ADR-027 through ADR-032 architecture provides a solid foundation for enterprise corpus processing. This document identifies 47 improvement opportunities across 12 categories, prioritized by impact and implementation complexity.


1. PERFORMANCE OPTIMIZATIONS

1.1 Streaming Pipeline Architecture

PriorityHIGH
Current GapDocuments fully loaded into memory before processing
SuggestionImplement streaming/chunked processing for large documents
BenefitHandle 100MB+ documents without memory pressure
ImplementationUse async generators, process document in 64KB chunks
ComplexityMedium
ReferenceADR-028, ADR-029

1.2 Speculative Execution for Map-Reduce

PriorityMEDIUM
Current GapSlow mappers block overall job completion
SuggestionImplement speculative execution — launch duplicate tasks for stragglers
BenefitReduce P99 latency by 30-50% for large jobs
ImplementationMonitor task duration percentiles, relaunch tasks exceeding P90
ComplexityHigh
ReferenceADR-029

1.3 Embedding Cache with TTL

PriorityHIGH
Current GapRe-embed identical content on each ingestion
SuggestionContent-addressable embedding cache with configurable TTL
Benefit50-80% reduction in embedding API calls for overlapping corpora
ImplementationSHA-256 content hash → embedding lookup before API call
ComplexityLow
ReferenceADR-030, ADR-031

1.4 Query Result Caching

PriorityMEDIUM
Current GapIdentical RAG queries re-execute full pipeline
SuggestionLRU cache for query results with cache invalidation on corpus update
BenefitSub-10ms response for repeated queries
ImplementationQuery hash → cached response, invalidate on corpus version change
ComplexityLow
ReferenceADR-031

1.5 Batch Embedding Requests

PriorityHIGH
Current GapSingle embedding per API call
SuggestionBatch up to 100 chunks per embedding API request
Benefit10x reduction in API round-trips, lower latency
ImplementationAccumulator with flush on size/time threshold
ComplexityLow
ReferenceADR-030

2. COST REDUCTION

2.1 Tiered Model Selection

PriorityHIGH
Current GapSingle model (Opus/Sonnet) for all operations
SuggestionUse cheaper models for simpler tasks (Haiku for extraction, Sonnet for synthesis, Opus for complex reasoning)
Benefit60-80% cost reduction with <5% quality loss
ImplementationTask complexity classifier → model router
ComplexityMedium
ReferenceADR-027, ADR-029
MODEL_ROUTING = {
"extraction": "claude-haiku", # Simple pattern matching
"summarization": "claude-sonnet", # Standard synthesis
"analysis": "claude-sonnet", # Complex reasoning
"synthesis": "claude-opus", # Final reports only
}

2.2 Adaptive Chunk Sizing

PriorityMEDIUM
Current GapFixed chunk size regardless of content density
SuggestionDynamically size chunks based on information density
Benefit20-40% token reduction for sparse documents
ImplementationEntropy-based chunk boundary detection
ComplexityHigh
ReferenceADR-028, ADR-030

2.3 Early Termination for Simple Queries

PriorityMEDIUM
Current GapFull retrieval pipeline for all queries
SuggestionShort-circuit for queries answerable from master/section summaries
Benefit70% cost reduction for simple factual queries
ImplementationConfidence threshold on high-level retrieval; skip drill-down if sufficient
ComplexityMedium
ReferenceADR-031

2.4 Compression of Stored Summaries

PriorityLOW
Current GapSummaries stored as plain text
SuggestionLZ4/Zstd compression for hierarchy nodes
Benefit60-70% storage reduction
ImplementationCompress on write, decompress on read (transparent)
ComplexityLow
ReferenceADR-030

3. QUALITY & ACCURACY

3.1 Hallucination Detection Layer

PriorityHIGH
Current GapSelf-correction relies on citation validation only
SuggestionDedicated hallucination detection using entailment models
BenefitCatch claims not supported by any retrieved context
ImplementationNLI model (e.g., DeBERTa) scores claim↔context entailment
ComplexityMedium
ReferenceADR-031

3.2 Confidence Calibration

PriorityMEDIUM
Current GapConfidence scores not calibrated to actual accuracy
SuggestionCalibrate confidence using held-out evaluation set
BenefitReliable confidence for downstream decision-making
ImplementationPlatt scaling or isotonic regression on confidence scores
ComplexityMedium
ReferenceADR-031

3.3 Multi-Model Consensus

PriorityLOW
Current GapSingle model for generation
SuggestionFor high-stakes queries, generate with multiple models and compare
BenefitHigher confidence when models agree; flag disagreements
ImplementationParallel generation, semantic similarity check on outputs
ComplexityHigh
ReferenceADR-031

3.4 Source Quality Scoring

PriorityMEDIUM
Current GapAll source documents treated equally
SuggestionScore documents by recency, authority, relevance; weight in retrieval
BenefitPrefer authoritative sources over outdated/tertiary
ImplementationMetadata-based scoring multiplier on retrieval scores
ComplexityLow
ReferenceADR-030, ADR-031

3.5 Fact Verification Pipeline

PriorityMEDIUM
Current GapNo external fact-checking
SuggestionFor factual claims, optionally verify against external sources
BenefitCatch corpus errors or outdated information
ImplementationWeb search verification for high-importance claims
ComplexityHigh
ReferenceADR-031

4. DEVELOPER EXPERIENCE

4.1 Visual Pipeline Builder

PriorityMEDIUM
Current GapPipeline configuration via code/YAML only
SuggestionDrag-and-drop visual builder for corpus processing pipelines
BenefitNon-developers can configure analysis workflows
ImplementationReact-based flow editor → generates pipeline config
ComplexityHigh
ReferenceAll ADRs

4.2 Interactive Debugging Console

PriorityHIGH
Current GapLimited visibility into pipeline execution
SuggestionReal-time console showing agent decisions, tool calls, checkpoints
BenefitFaster debugging, better understanding of system behavior
ImplementationWebSocket stream of execution events to UI
ComplexityMedium
ReferenceADR-029

4.3 Prompt Playground Integration

PriorityMEDIUM
Current GapMapper/reducer prompts edited in config files
SuggestionIntegrated prompt playground with test execution
BenefitIterate on prompts with immediate feedback
ImplementationMonaco editor + sample document execution
ComplexityMedium
ReferenceADR-029

4.4 Schema Builder with Validation

PriorityHIGH
Current GapExtraction schemas defined manually
SuggestionVisual schema builder with JSON Schema validation
BenefitReduce schema errors, faster iteration
ImplementationForm-based builder → JSON Schema output
ComplexityMedium
ReferenceADR-028, ADR-029

4.5 CLI Tool for Local Testing

PriorityMEDIUM
Current GapMust deploy to test pipelines
Suggestioncoditect-corpus CLI for local pipeline testing
BenefitRapid development iteration without deployment
ImplementationPython CLI wrapping core services
ComplexityMedium
ReferenceAll ADRs

5. OPERATIONAL EXCELLENCE

5.1 Automated Quality Regression Tests

PriorityHIGH
Current GapNo automated quality monitoring
SuggestionGolden dataset with expected outputs; run on each deployment
BenefitCatch quality regressions before production
ImplementationCI/CD step comparing outputs to golden set
ComplexityMedium
ReferenceADR-031

5.2 Cost Anomaly Detection

PriorityHIGH
Current GapToken costs monitored but no anomaly alerting
SuggestionML-based anomaly detection on token usage patterns
BenefitCatch runaway costs early (e.g., infinite loops, prompt injection)
ImplementationTime-series anomaly detection on per-job token consumption
ComplexityMedium
ReferenceADR-029

5.3 Graceful Degradation Modes

PriorityMEDIUM
Current GapSystem fails if any dependency unavailable
SuggestionDefine degraded modes (e.g., skip reranking if reranker down)
BenefitMaintain partial availability during outages
ImplementationFeature flags per capability, fallback behaviors
ComplexityMedium
ReferenceAll ADRs

5.4 Canary Deployments for Prompts

PriorityMEDIUM
Current GapPrompt changes deployed to all traffic
SuggestionRoute percentage of traffic to new prompts, compare metrics
BenefitSafe prompt iteration with rollback capability
ImplementationA/B routing with metric collection
ComplexityHigh
ReferenceADR-029

5.5 Automated Capacity Planning

PriorityLOW
Current GapManual capacity estimation
SuggestionPredict resource needs from corpus size and job configuration
BenefitRight-size infrastructure, avoid over/under-provisioning
ImplementationRegression model on historical job metrics
ComplexityMedium
ReferenceADR-029

6. SECURITY ENHANCEMENTS

6.1 Prompt Injection Detection

PriorityHIGH
Current GapUser documents could contain prompt injections
SuggestionScan ingested documents for prompt injection patterns
BenefitPrevent adversarial content from manipulating agents
ImplementationPattern matching + classifier for injection attempts
ComplexityMedium
ReferenceADR-028

6.2 Output Sanitization

PriorityHIGH
Current GapGenerated outputs not sanitized
SuggestionSanitize outputs to prevent XSS, code injection in downstream use
BenefitSafe integration with web UIs and downstream systems
ImplementationHTML escaping, code block sandboxing
ComplexityLow
ReferenceADR-031

6.3 Data Residency Controls

PriorityMEDIUM
Current GapNo geographic data residency enforcement
SuggestionTag corpora with residency requirements; route to compliant regions
BenefitGDPR, data sovereignty compliance
ImplementationMetadata-based routing to regional deployments
ComplexityHigh
ReferenceADR-032

6.4 Differential Privacy for Aggregations

PriorityLOW
Current GapAggregated outputs could leak individual data points
SuggestionApply differential privacy to statistical outputs
BenefitPrevent membership inference attacks
ImplementationNoise injection in aggregate statistics
ComplexityHigh
ReferenceADR-029, ADR-032

7. SCALABILITY

7.1 Sharded Vector Index

PriorityHIGH
Current GapSingle vector index per corpus
SuggestionShard vector index across multiple nodes for large corpora
BenefitHandle 10M+ chunks with consistent latency
ImplementationHash-based sharding with scatter-gather queries
ComplexityHigh
ReferenceADR-030, ADR-031

7.2 Async Job Queue

PriorityMEDIUM
Current GapJobs executed synchronously within coordinator
SuggestionPersistent job queue (e.g., Redis, SQS) for durability
BenefitSurvive coordinator restarts, enable distributed workers
ImplementationQueue-based job distribution to worker pool
ComplexityMedium
ReferenceADR-029

7.3 Multi-Tenant Resource Isolation

PriorityHIGH
Current GapShared resources across tenants
SuggestionTenant-level resource quotas and isolation
BenefitNoisy neighbor prevention, predictable performance
ImplementationKubernetes resource quotas, tenant-aware scheduling
ComplexityHigh
ReferenceADR-027

7.4 Cold Storage Tiering

PriorityLOW
Current GapAll hierarchy levels in hot storage
SuggestionMove old/infrequently accessed data to cold storage
Benefit70% storage cost reduction for archival corpora
ImplementationAccess-based tiering policy, lazy rehydration
ComplexityMedium
ReferenceADR-030

8. INTEGRATION OPPORTUNITIES

8.1 MCP Server for Corpus Tools

PriorityHIGH
Current GapCorpus tools only available within Coditect
SuggestionExpose corpus commands as MCP server for external AI agents
BenefitEnable Claude, GPT, etc. to query Coditect corpora
ImplementationMCP protocol wrapper around corpus services
ComplexityMedium
ReferenceADR-031

8.2 Webhook Notifications

PriorityMEDIUM
Current GapNo event notifications for external systems
SuggestionWebhook callbacks for job completion, errors, audit events
BenefitIntegration with Slack, PagerDuty, SIEM systems
ImplementationConfigurable webhook endpoints per event type
ComplexityLow
ReferenceADR-029, ADR-032

8.3 Google Drive / SharePoint Connectors

PriorityMEDIUM
Current GapManual document upload only
SuggestionNative connectors for common document repositories
BenefitAutomatic corpus sync from existing document stores
ImplementationOAuth-based connectors with incremental sync
ComplexityMedium
ReferenceADR-028

8.4 Export to Knowledge Bases

PriorityLOW
Current GapExport limited to JSON/PDF
SuggestionExport to Notion, Confluence, SharePoint
BenefitIntegrate analysis results into existing workflows
ImplementationAPI integrations for popular KB platforms
ComplexityMedium
ReferenceADR-030

8.5 BI Tool Integration

PriorityMEDIUM
Current GapNo structured data export for analytics
SuggestionExport extracted entities/metrics to data warehouse
BenefitEnable Tableau/PowerBI dashboards on corpus insights
ImplementationScheduled ETL to BigQuery/Snowflake
ComplexityMedium
ReferenceADR-029, ADR-030

9. ADVANCED FEATURES

9.1 Temporal Queries

PriorityMEDIUM
Current GapQueries against current corpus state only
SuggestionQuery corpus as of specific point in time
BenefitTrack how understanding evolved, audit historical queries
ImplementationVersion tagging on hierarchy nodes, time-travel queries
ComplexityHigh
ReferenceADR-030

9.2 Cross-Corpus Analysis

PriorityHIGH
Current GapAnalysis within single corpus
SuggestionMap-reduce and RAG across multiple corpora simultaneously
BenefitCompare analyses, aggregate insights across projects
ImplementationMulti-corpus query routing, federated reduction
ComplexityHigh
ReferenceADR-029, ADR-031

9.3 Continuous Corpus Updates

PriorityMEDIUM
Current GapBatch ingestion model
SuggestionStreaming ingestion with real-time hierarchy updates
BenefitAlways-current corpus without reprocessing
ImplementationChange data capture → incremental update pipeline
ComplexityHigh
ReferenceADR-030

9.4 Natural Language Pipeline Config

PriorityLOW
Current GapPipeline configured via commands/YAML
SuggestionDescribe desired analysis in natural language; system generates pipeline
BenefitZero-config corpus analysis for non-technical users
ImplementationLLM-based pipeline generator from description
ComplexityHigh
ReferenceADR-027

9.5 Automated Schema Discovery

PriorityMEDIUM
Current GapExtraction schema defined manually
SuggestionAnalyze sample documents to suggest extraction schema
BenefitFaster pipeline setup, discover unexpected patterns
ImplementationFew-shot schema inference from document samples
ComplexityMedium
ReferenceADR-028

10. COMPLIANCE ENHANCEMENTS

10.1 GDPR Right to Erasure Automation

PriorityHIGH
Current GapManual document removal process
SuggestionAutomated erasure workflow with cascade through hierarchy
BenefitGDPR compliance with audit trail
ImplementationErasure request → identify affected nodes → cascade delete → audit
ComplexityMedium
ReferenceADR-030, ADR-032

10.2 Data Retention Policies

PriorityMEDIUM
Current GapNo automatic data lifecycle management
SuggestionConfigurable retention policies with automatic archival/deletion
BenefitCompliance with data minimization requirements
ImplementationPolicy engine evaluating document age/access patterns
ComplexityMedium
ReferenceADR-032
PriorityMEDIUM
Current GapNo tracking of data subject consent
SuggestionLink documents to consent records; filter queries by consent scope
BenefitProcess data only within consented purposes
ImplementationConsent metadata on documents, query-time filtering
ComplexityHigh
ReferenceADR-032

10.4 Regulatory Report Generation

PriorityLOW
Current GapManual report assembly for audits
SuggestionPre-built report templates for FDA, HIPAA, SOC2 audits
BenefitFaster audit response, consistent formatting
ImplementationReport generators pulling from audit trail + hierarchy
ComplexityMedium
ReferenceADR-032

11. OBSERVABILITY

11.1 Distributed Tracing

PriorityHIGH
Current GapLimited request tracing
SuggestionOpenTelemetry tracing across all services
BenefitEnd-to-end latency visibility, bottleneck identification
ImplementationInstrument all services with OTEL SDK
ComplexityMedium
ReferenceAll ADRs

11.2 Token Usage Dashboard

PriorityHIGH
Current GapToken metrics in logs only
SuggestionReal-time dashboard showing token consumption by job, agent, phase
BenefitCost visibility, budget tracking
ImplementationMetrics → Prometheus → Grafana dashboard
ComplexityLow
ReferenceADR-029

11.3 Quality Metrics Dashboard

PriorityMEDIUM
Current GapNo visibility into output quality
SuggestionDashboard showing citation accuracy, confidence distributions, hallucination rates
BenefitProactive quality monitoring
ImplementationSample evaluation pipeline feeding metrics dashboard
ComplexityHigh
ReferenceADR-031

11.4 Agent Decision Logging

PriorityMEDIUM
Current GapLimited visibility into agent reasoning
SuggestionStructured logging of agent decisions, tool selections, strategy choices
BenefitDebugging, behavior analysis, prompt optimization
ImplementationStandardized decision log format, searchable index
ComplexityLow
ReferenceADR-029, ADR-031

12. RESEARCH DIRECTIONS

12.1 Fine-Tuned Extraction Models

PriorityLOW
Current GapGeneral-purpose LLMs for all extraction
SuggestionFine-tune smaller models for specific extraction patterns
Benefit10x cost reduction, faster inference for common patterns
ImplementationCollect extraction examples → fine-tune Haiku/Llama
ComplexityHigh
ReferenceADR-028

12.2 Learned Chunk Boundaries

PriorityLOW
Current GapFixed chunking strategies
SuggestionTrain model to identify optimal chunk boundaries for retrieval
BenefitBetter retrieval precision, fewer incomplete contexts
ImplementationContrastive learning on retrieval success signals
ComplexityVery High
ReferenceADR-030

12.3 Reinforcement Learning for Strategy Selection

PriorityLOW
Current GapRule-based retrieval strategy selection
SuggestionRL agent learning optimal strategy from query outcomes
BenefitAdaptive strategy selection improving over time
ImplementationBandit/RL on strategy choices with quality reward
ComplexityVery High
ReferenceADR-031

12.4 Multimodal Corpus Processing

PriorityMEDIUM
Current GapText-only processing
SuggestionNative support for images, diagrams, tables in documents
BenefitProcess technical documents with figures, charts
ImplementationVision model integration for image understanding
ComplexityHigh
ReferenceADR-028

PRIORITIZATION MATRIX

Immediate (Sprint 13-14)

SuggestionImpactEffortROI
Embedding cache with TTLHighLow★★★★★
Batch embedding requestsHighLow★★★★★
Tiered model selectionHighMedium★★★★☆
Interactive debugging consoleHighMedium★★★★☆
Prompt injection detectionHighMedium★★★★☆

Near-Term (Sprint 15-18)

SuggestionImpactEffortROI
Hallucination detection layerHighMedium★★★★☆
Distributed tracingHighMedium★★★★☆
Token usage dashboardHighLow★★★★★
MCP server for corpus toolsHighMedium★★★★☆
Cross-corpus analysisHighHigh★★★☆☆

Medium-Term (Sprint 19-24)

SuggestionImpactEffortROI
Visual pipeline builderMediumHigh★★★☆☆
Sharded vector indexHighHigh★★★☆☆
Temporal queriesMediumHigh★★☆☆☆
GDPR erasure automationHighMedium★★★★☆
Multimodal corpus processingMediumHigh★★★☆☆

Long-Term Research

SuggestionImpactEffortROI
Fine-tuned extraction modelsHighVery High★★☆☆☆
Learned chunk boundariesMediumVery High★☆☆☆☆
RL for strategy selectionMediumVery High★☆☆☆☆

SUMMARY

CategoryCountHigh Priority
Performance53
Cost Reduction41
Quality51
Developer Experience52
Operations52
Security42
Scalability42
Integration51
Advanced Features51
Compliance41
Observability42
Research40
Total4718

Recommended Next Steps:

  1. Implement embedding cache + batch requests (Quick wins, Sprint 13)
  2. Add tiered model routing (Major cost reduction, Sprint 13-14)
  3. Build token usage dashboard (Operational visibility, Sprint 14)
  4. Integrate prompt injection detection (Security, Sprint 14)
  5. Add distributed tracing (Foundation for debugging, Sprint 15)