Corpus Processing Subsystem: Suggestions for Further Improvement
Executive Summary
The ADR-027 through ADR-032 architecture provides a solid foundation for enterprise corpus processing. This document identifies 47 improvement opportunities across 12 categories, prioritized by impact and implementation complexity.
1. PERFORMANCE OPTIMIZATIONS
1.1 Streaming Pipeline Architecture
| Priority | HIGH |
|---|---|
| Current Gap | Documents fully loaded into memory before processing |
| Suggestion | Implement streaming/chunked processing for large documents |
| Benefit | Handle 100MB+ documents without memory pressure |
| Implementation | Use async generators, process document in 64KB chunks |
| Complexity | Medium |
| Reference | ADR-028, ADR-029 |
1.2 Speculative Execution for Map-Reduce
| Priority | MEDIUM |
|---|---|
| Current Gap | Slow mappers block overall job completion |
| Suggestion | Implement speculative execution — launch duplicate tasks for stragglers |
| Benefit | Reduce P99 latency by 30-50% for large jobs |
| Implementation | Monitor task duration percentiles, relaunch tasks exceeding P90 |
| Complexity | High |
| Reference | ADR-029 |
1.3 Embedding Cache with TTL
| Priority | HIGH |
|---|---|
| Current Gap | Re-embed identical content on each ingestion |
| Suggestion | Content-addressable embedding cache with configurable TTL |
| Benefit | 50-80% reduction in embedding API calls for overlapping corpora |
| Implementation | SHA-256 content hash → embedding lookup before API call |
| Complexity | Low |
| Reference | ADR-030, ADR-031 |
1.4 Query Result Caching
| Priority | MEDIUM |
|---|---|
| Current Gap | Identical RAG queries re-execute full pipeline |
| Suggestion | LRU cache for query results with cache invalidation on corpus update |
| Benefit | Sub-10ms response for repeated queries |
| Implementation | Query hash → cached response, invalidate on corpus version change |
| Complexity | Low |
| Reference | ADR-031 |
1.5 Batch Embedding Requests
| Priority | HIGH |
|---|---|
| Current Gap | Single embedding per API call |
| Suggestion | Batch up to 100 chunks per embedding API request |
| Benefit | 10x reduction in API round-trips, lower latency |
| Implementation | Accumulator with flush on size/time threshold |
| Complexity | Low |
| Reference | ADR-030 |
2. COST REDUCTION
2.1 Tiered Model Selection
| Priority | HIGH |
|---|---|
| Current Gap | Single model (Opus/Sonnet) for all operations |
| Suggestion | Use cheaper models for simpler tasks (Haiku for extraction, Sonnet for synthesis, Opus for complex reasoning) |
| Benefit | 60-80% cost reduction with <5% quality loss |
| Implementation | Task complexity classifier → model router |
| Complexity | Medium |
| Reference | ADR-027, ADR-029 |
MODEL_ROUTING = {
"extraction": "claude-haiku", # Simple pattern matching
"summarization": "claude-sonnet", # Standard synthesis
"analysis": "claude-sonnet", # Complex reasoning
"synthesis": "claude-opus", # Final reports only
}
2.2 Adaptive Chunk Sizing
| Priority | MEDIUM |
|---|---|
| Current Gap | Fixed chunk size regardless of content density |
| Suggestion | Dynamically size chunks based on information density |
| Benefit | 20-40% token reduction for sparse documents |
| Implementation | Entropy-based chunk boundary detection |
| Complexity | High |
| Reference | ADR-028, ADR-030 |
2.3 Early Termination for Simple Queries
| Priority | MEDIUM |
|---|---|
| Current Gap | Full retrieval pipeline for all queries |
| Suggestion | Short-circuit for queries answerable from master/section summaries |
| Benefit | 70% cost reduction for simple factual queries |
| Implementation | Confidence threshold on high-level retrieval; skip drill-down if sufficient |
| Complexity | Medium |
| Reference | ADR-031 |
2.4 Compression of Stored Summaries
| Priority | LOW |
|---|---|
| Current Gap | Summaries stored as plain text |
| Suggestion | LZ4/Zstd compression for hierarchy nodes |
| Benefit | 60-70% storage reduction |
| Implementation | Compress on write, decompress on read (transparent) |
| Complexity | Low |
| Reference | ADR-030 |
3. QUALITY & ACCURACY
3.1 Hallucination Detection Layer
| Priority | HIGH |
|---|---|
| Current Gap | Self-correction relies on citation validation only |
| Suggestion | Dedicated hallucination detection using entailment models |
| Benefit | Catch claims not supported by any retrieved context |
| Implementation | NLI model (e.g., DeBERTa) scores claim↔context entailment |
| Complexity | Medium |
| Reference | ADR-031 |
3.2 Confidence Calibration
| Priority | MEDIUM |
|---|---|
| Current Gap | Confidence scores not calibrated to actual accuracy |
| Suggestion | Calibrate confidence using held-out evaluation set |
| Benefit | Reliable confidence for downstream decision-making |
| Implementation | Platt scaling or isotonic regression on confidence scores |
| Complexity | Medium |
| Reference | ADR-031 |
3.3 Multi-Model Consensus
| Priority | LOW |
|---|---|
| Current Gap | Single model for generation |
| Suggestion | For high-stakes queries, generate with multiple models and compare |
| Benefit | Higher confidence when models agree; flag disagreements |
| Implementation | Parallel generation, semantic similarity check on outputs |
| Complexity | High |
| Reference | ADR-031 |
3.4 Source Quality Scoring
| Priority | MEDIUM |
|---|---|
| Current Gap | All source documents treated equally |
| Suggestion | Score documents by recency, authority, relevance; weight in retrieval |
| Benefit | Prefer authoritative sources over outdated/tertiary |
| Implementation | Metadata-based scoring multiplier on retrieval scores |
| Complexity | Low |
| Reference | ADR-030, ADR-031 |
3.5 Fact Verification Pipeline
| Priority | MEDIUM |
|---|---|
| Current Gap | No external fact-checking |
| Suggestion | For factual claims, optionally verify against external sources |
| Benefit | Catch corpus errors or outdated information |
| Implementation | Web search verification for high-importance claims |
| Complexity | High |
| Reference | ADR-031 |
4. DEVELOPER EXPERIENCE
4.1 Visual Pipeline Builder
| Priority | MEDIUM |
|---|---|
| Current Gap | Pipeline configuration via code/YAML only |
| Suggestion | Drag-and-drop visual builder for corpus processing pipelines |
| Benefit | Non-developers can configure analysis workflows |
| Implementation | React-based flow editor → generates pipeline config |
| Complexity | High |
| Reference | All ADRs |
4.2 Interactive Debugging Console
| Priority | HIGH |
|---|---|
| Current Gap | Limited visibility into pipeline execution |
| Suggestion | Real-time console showing agent decisions, tool calls, checkpoints |
| Benefit | Faster debugging, better understanding of system behavior |
| Implementation | WebSocket stream of execution events to UI |
| Complexity | Medium |
| Reference | ADR-029 |
4.3 Prompt Playground Integration
| Priority | MEDIUM |
|---|---|
| Current Gap | Mapper/reducer prompts edited in config files |
| Suggestion | Integrated prompt playground with test execution |
| Benefit | Iterate on prompts with immediate feedback |
| Implementation | Monaco editor + sample document execution |
| Complexity | Medium |
| Reference | ADR-029 |
4.4 Schema Builder with Validation
| Priority | HIGH |
|---|---|
| Current Gap | Extraction schemas defined manually |
| Suggestion | Visual schema builder with JSON Schema validation |
| Benefit | Reduce schema errors, faster iteration |
| Implementation | Form-based builder → JSON Schema output |
| Complexity | Medium |
| Reference | ADR-028, ADR-029 |
4.5 CLI Tool for Local Testing
| Priority | MEDIUM |
|---|---|
| Current Gap | Must deploy to test pipelines |
| Suggestion | coditect-corpus CLI for local pipeline testing |
| Benefit | Rapid development iteration without deployment |
| Implementation | Python CLI wrapping core services |
| Complexity | Medium |
| Reference | All ADRs |
5. OPERATIONAL EXCELLENCE
5.1 Automated Quality Regression Tests
| Priority | HIGH |
|---|---|
| Current Gap | No automated quality monitoring |
| Suggestion | Golden dataset with expected outputs; run on each deployment |
| Benefit | Catch quality regressions before production |
| Implementation | CI/CD step comparing outputs to golden set |
| Complexity | Medium |
| Reference | ADR-031 |
5.2 Cost Anomaly Detection
| Priority | HIGH |
|---|---|
| Current Gap | Token costs monitored but no anomaly alerting |
| Suggestion | ML-based anomaly detection on token usage patterns |
| Benefit | Catch runaway costs early (e.g., infinite loops, prompt injection) |
| Implementation | Time-series anomaly detection on per-job token consumption |
| Complexity | Medium |
| Reference | ADR-029 |
5.3 Graceful Degradation Modes
| Priority | MEDIUM |
|---|---|
| Current Gap | System fails if any dependency unavailable |
| Suggestion | Define degraded modes (e.g., skip reranking if reranker down) |
| Benefit | Maintain partial availability during outages |
| Implementation | Feature flags per capability, fallback behaviors |
| Complexity | Medium |
| Reference | All ADRs |
5.4 Canary Deployments for Prompts
| Priority | MEDIUM |
|---|---|
| Current Gap | Prompt changes deployed to all traffic |
| Suggestion | Route percentage of traffic to new prompts, compare metrics |
| Benefit | Safe prompt iteration with rollback capability |
| Implementation | A/B routing with metric collection |
| Complexity | High |
| Reference | ADR-029 |
5.5 Automated Capacity Planning
| Priority | LOW |
|---|---|
| Current Gap | Manual capacity estimation |
| Suggestion | Predict resource needs from corpus size and job configuration |
| Benefit | Right-size infrastructure, avoid over/under-provisioning |
| Implementation | Regression model on historical job metrics |
| Complexity | Medium |
| Reference | ADR-029 |
6. SECURITY ENHANCEMENTS
6.1 Prompt Injection Detection
| Priority | HIGH |
|---|---|
| Current Gap | User documents could contain prompt injections |
| Suggestion | Scan ingested documents for prompt injection patterns |
| Benefit | Prevent adversarial content from manipulating agents |
| Implementation | Pattern matching + classifier for injection attempts |
| Complexity | Medium |
| Reference | ADR-028 |
6.2 Output Sanitization
| Priority | HIGH |
|---|---|
| Current Gap | Generated outputs not sanitized |
| Suggestion | Sanitize outputs to prevent XSS, code injection in downstream use |
| Benefit | Safe integration with web UIs and downstream systems |
| Implementation | HTML escaping, code block sandboxing |
| Complexity | Low |
| Reference | ADR-031 |
6.3 Data Residency Controls
| Priority | MEDIUM |
|---|---|
| Current Gap | No geographic data residency enforcement |
| Suggestion | Tag corpora with residency requirements; route to compliant regions |
| Benefit | GDPR, data sovereignty compliance |
| Implementation | Metadata-based routing to regional deployments |
| Complexity | High |
| Reference | ADR-032 |
6.4 Differential Privacy for Aggregations
| Priority | LOW |
|---|---|
| Current Gap | Aggregated outputs could leak individual data points |
| Suggestion | Apply differential privacy to statistical outputs |
| Benefit | Prevent membership inference attacks |
| Implementation | Noise injection in aggregate statistics |
| Complexity | High |
| Reference | ADR-029, ADR-032 |
7. SCALABILITY
7.1 Sharded Vector Index
| Priority | HIGH |
|---|---|
| Current Gap | Single vector index per corpus |
| Suggestion | Shard vector index across multiple nodes for large corpora |
| Benefit | Handle 10M+ chunks with consistent latency |
| Implementation | Hash-based sharding with scatter-gather queries |
| Complexity | High |
| Reference | ADR-030, ADR-031 |
7.2 Async Job Queue
| Priority | MEDIUM |
|---|---|
| Current Gap | Jobs executed synchronously within coordinator |
| Suggestion | Persistent job queue (e.g., Redis, SQS) for durability |
| Benefit | Survive coordinator restarts, enable distributed workers |
| Implementation | Queue-based job distribution to worker pool |
| Complexity | Medium |
| Reference | ADR-029 |
7.3 Multi-Tenant Resource Isolation
| Priority | HIGH |
|---|---|
| Current Gap | Shared resources across tenants |
| Suggestion | Tenant-level resource quotas and isolation |
| Benefit | Noisy neighbor prevention, predictable performance |
| Implementation | Kubernetes resource quotas, tenant-aware scheduling |
| Complexity | High |
| Reference | ADR-027 |
7.4 Cold Storage Tiering
| Priority | LOW |
|---|---|
| Current Gap | All hierarchy levels in hot storage |
| Suggestion | Move old/infrequently accessed data to cold storage |
| Benefit | 70% storage cost reduction for archival corpora |
| Implementation | Access-based tiering policy, lazy rehydration |
| Complexity | Medium |
| Reference | ADR-030 |
8. INTEGRATION OPPORTUNITIES
8.1 MCP Server for Corpus Tools
| Priority | HIGH |
|---|---|
| Current Gap | Corpus tools only available within Coditect |
| Suggestion | Expose corpus commands as MCP server for external AI agents |
| Benefit | Enable Claude, GPT, etc. to query Coditect corpora |
| Implementation | MCP protocol wrapper around corpus services |
| Complexity | Medium |
| Reference | ADR-031 |
8.2 Webhook Notifications
| Priority | MEDIUM |
|---|---|
| Current Gap | No event notifications for external systems |
| Suggestion | Webhook callbacks for job completion, errors, audit events |
| Benefit | Integration with Slack, PagerDuty, SIEM systems |
| Implementation | Configurable webhook endpoints per event type |
| Complexity | Low |
| Reference | ADR-029, ADR-032 |
8.3 Google Drive / SharePoint Connectors
| Priority | MEDIUM |
|---|---|
| Current Gap | Manual document upload only |
| Suggestion | Native connectors for common document repositories |
| Benefit | Automatic corpus sync from existing document stores |
| Implementation | OAuth-based connectors with incremental sync |
| Complexity | Medium |
| Reference | ADR-028 |
8.4 Export to Knowledge Bases
| Priority | LOW |
|---|---|
| Current Gap | Export limited to JSON/PDF |
| Suggestion | Export to Notion, Confluence, SharePoint |
| Benefit | Integrate analysis results into existing workflows |
| Implementation | API integrations for popular KB platforms |
| Complexity | Medium |
| Reference | ADR-030 |
8.5 BI Tool Integration
| Priority | MEDIUM |
|---|---|
| Current Gap | No structured data export for analytics |
| Suggestion | Export extracted entities/metrics to data warehouse |
| Benefit | Enable Tableau/PowerBI dashboards on corpus insights |
| Implementation | Scheduled ETL to BigQuery/Snowflake |
| Complexity | Medium |
| Reference | ADR-029, ADR-030 |
9. ADVANCED FEATURES
9.1 Temporal Queries
| Priority | MEDIUM |
|---|---|
| Current Gap | Queries against current corpus state only |
| Suggestion | Query corpus as of specific point in time |
| Benefit | Track how understanding evolved, audit historical queries |
| Implementation | Version tagging on hierarchy nodes, time-travel queries |
| Complexity | High |
| Reference | ADR-030 |
9.2 Cross-Corpus Analysis
| Priority | HIGH |
|---|---|
| Current Gap | Analysis within single corpus |
| Suggestion | Map-reduce and RAG across multiple corpora simultaneously |
| Benefit | Compare analyses, aggregate insights across projects |
| Implementation | Multi-corpus query routing, federated reduction |
| Complexity | High |
| Reference | ADR-029, ADR-031 |
9.3 Continuous Corpus Updates
| Priority | MEDIUM |
|---|---|
| Current Gap | Batch ingestion model |
| Suggestion | Streaming ingestion with real-time hierarchy updates |
| Benefit | Always-current corpus without reprocessing |
| Implementation | Change data capture → incremental update pipeline |
| Complexity | High |
| Reference | ADR-030 |
9.4 Natural Language Pipeline Config
| Priority | LOW |
|---|---|
| Current Gap | Pipeline configured via commands/YAML |
| Suggestion | Describe desired analysis in natural language; system generates pipeline |
| Benefit | Zero-config corpus analysis for non-technical users |
| Implementation | LLM-based pipeline generator from description |
| Complexity | High |
| Reference | ADR-027 |
9.5 Automated Schema Discovery
| Priority | MEDIUM |
|---|---|
| Current Gap | Extraction schema defined manually |
| Suggestion | Analyze sample documents to suggest extraction schema |
| Benefit | Faster pipeline setup, discover unexpected patterns |
| Implementation | Few-shot schema inference from document samples |
| Complexity | Medium |
| Reference | ADR-028 |
10. COMPLIANCE ENHANCEMENTS
10.1 GDPR Right to Erasure Automation
| Priority | HIGH |
|---|---|
| Current Gap | Manual document removal process |
| Suggestion | Automated erasure workflow with cascade through hierarchy |
| Benefit | GDPR compliance with audit trail |
| Implementation | Erasure request → identify affected nodes → cascade delete → audit |
| Complexity | Medium |
| Reference | ADR-030, ADR-032 |
10.2 Data Retention Policies
| Priority | MEDIUM |
|---|---|
| Current Gap | No automatic data lifecycle management |
| Suggestion | Configurable retention policies with automatic archival/deletion |
| Benefit | Compliance with data minimization requirements |
| Implementation | Policy engine evaluating document age/access patterns |
| Complexity | Medium |
| Reference | ADR-032 |
10.3 Consent Tracking
| Priority | MEDIUM |
|---|---|
| Current Gap | No tracking of data subject consent |
| Suggestion | Link documents to consent records; filter queries by consent scope |
| Benefit | Process data only within consented purposes |
| Implementation | Consent metadata on documents, query-time filtering |
| Complexity | High |
| Reference | ADR-032 |
10.4 Regulatory Report Generation
| Priority | LOW |
|---|---|
| Current Gap | Manual report assembly for audits |
| Suggestion | Pre-built report templates for FDA, HIPAA, SOC2 audits |
| Benefit | Faster audit response, consistent formatting |
| Implementation | Report generators pulling from audit trail + hierarchy |
| Complexity | Medium |
| Reference | ADR-032 |
11. OBSERVABILITY
11.1 Distributed Tracing
| Priority | HIGH |
|---|---|
| Current Gap | Limited request tracing |
| Suggestion | OpenTelemetry tracing across all services |
| Benefit | End-to-end latency visibility, bottleneck identification |
| Implementation | Instrument all services with OTEL SDK |
| Complexity | Medium |
| Reference | All ADRs |
11.2 Token Usage Dashboard
| Priority | HIGH |
|---|---|
| Current Gap | Token metrics in logs only |
| Suggestion | Real-time dashboard showing token consumption by job, agent, phase |
| Benefit | Cost visibility, budget tracking |
| Implementation | Metrics → Prometheus → Grafana dashboard |
| Complexity | Low |
| Reference | ADR-029 |
11.3 Quality Metrics Dashboard
| Priority | MEDIUM |
|---|---|
| Current Gap | No visibility into output quality |
| Suggestion | Dashboard showing citation accuracy, confidence distributions, hallucination rates |
| Benefit | Proactive quality monitoring |
| Implementation | Sample evaluation pipeline feeding metrics dashboard |
| Complexity | High |
| Reference | ADR-031 |
11.4 Agent Decision Logging
| Priority | MEDIUM |
|---|---|
| Current Gap | Limited visibility into agent reasoning |
| Suggestion | Structured logging of agent decisions, tool selections, strategy choices |
| Benefit | Debugging, behavior analysis, prompt optimization |
| Implementation | Standardized decision log format, searchable index |
| Complexity | Low |
| Reference | ADR-029, ADR-031 |
12. RESEARCH DIRECTIONS
12.1 Fine-Tuned Extraction Models
| Priority | LOW |
|---|---|
| Current Gap | General-purpose LLMs for all extraction |
| Suggestion | Fine-tune smaller models for specific extraction patterns |
| Benefit | 10x cost reduction, faster inference for common patterns |
| Implementation | Collect extraction examples → fine-tune Haiku/Llama |
| Complexity | High |
| Reference | ADR-028 |
12.2 Learned Chunk Boundaries
| Priority | LOW |
|---|---|
| Current Gap | Fixed chunking strategies |
| Suggestion | Train model to identify optimal chunk boundaries for retrieval |
| Benefit | Better retrieval precision, fewer incomplete contexts |
| Implementation | Contrastive learning on retrieval success signals |
| Complexity | Very High |
| Reference | ADR-030 |
12.3 Reinforcement Learning for Strategy Selection
| Priority | LOW |
|---|---|
| Current Gap | Rule-based retrieval strategy selection |
| Suggestion | RL agent learning optimal strategy from query outcomes |
| Benefit | Adaptive strategy selection improving over time |
| Implementation | Bandit/RL on strategy choices with quality reward |
| Complexity | Very High |
| Reference | ADR-031 |
12.4 Multimodal Corpus Processing
| Priority | MEDIUM |
|---|---|
| Current Gap | Text-only processing |
| Suggestion | Native support for images, diagrams, tables in documents |
| Benefit | Process technical documents with figures, charts |
| Implementation | Vision model integration for image understanding |
| Complexity | High |
| Reference | ADR-028 |
PRIORITIZATION MATRIX
Immediate (Sprint 13-14)
| Suggestion | Impact | Effort | ROI |
|---|---|---|---|
| Embedding cache with TTL | High | Low | ★★★★★ |
| Batch embedding requests | High | Low | ★★★★★ |
| Tiered model selection | High | Medium | ★★★★☆ |
| Interactive debugging console | High | Medium | ★★★★☆ |
| Prompt injection detection | High | Medium | ★★★★☆ |
Near-Term (Sprint 15-18)
| Suggestion | Impact | Effort | ROI |
|---|---|---|---|
| Hallucination detection layer | High | Medium | ★★★★☆ |
| Distributed tracing | High | Medium | ★★★★☆ |
| Token usage dashboard | High | Low | ★★★★★ |
| MCP server for corpus tools | High | Medium | ★★★★☆ |
| Cross-corpus analysis | High | High | ★★★☆☆ |
Medium-Term (Sprint 19-24)
| Suggestion | Impact | Effort | ROI |
|---|---|---|---|
| Visual pipeline builder | Medium | High | ★★★☆☆ |
| Sharded vector index | High | High | ★★★☆☆ |
| Temporal queries | Medium | High | ★★☆☆☆ |
| GDPR erasure automation | High | Medium | ★★★★☆ |
| Multimodal corpus processing | Medium | High | ★★★☆☆ |
Long-Term Research
| Suggestion | Impact | Effort | ROI |
|---|---|---|---|
| Fine-tuned extraction models | High | Very High | ★★☆☆☆ |
| Learned chunk boundaries | Medium | Very High | ★☆☆☆☆ |
| RL for strategy selection | Medium | Very High | ★☆☆☆☆ |
SUMMARY
| Category | Count | High Priority |
|---|---|---|
| Performance | 5 | 3 |
| Cost Reduction | 4 | 1 |
| Quality | 5 | 1 |
| Developer Experience | 5 | 2 |
| Operations | 5 | 2 |
| Security | 4 | 2 |
| Scalability | 4 | 2 |
| Integration | 5 | 1 |
| Advanced Features | 5 | 1 |
| Compliance | 4 | 1 |
| Observability | 4 | 2 |
| Research | 4 | 0 |
| Total | 47 | 18 |
Recommended Next Steps:
- Implement embedding cache + batch requests (Quick wins, Sprint 13)
- Add tiered model routing (Major cost reduction, Sprint 13-14)
- Build token usage dashboard (Operational visibility, Sprint 14)
- Integrate prompt injection detection (Security, Sprint 14)
- Add distributed tracing (Foundation for debugging, Sprint 15)