Coditect Corpus Processing Subsystem: Complete Component Inventory
Document Reference Key
| Code | Document | Purpose |
|---|---|---|
| ADR-027 | Corpus Processing Subsystem Architecture | Master architecture, agent types, skills, commands |
| ADR-028 | Pre-Processor Agent Pipeline | Document cleaning, NER, filtering |
| ADR-029 | Map-Reduce Agent Orchestrator | Parallel processing, coordination |
| ADR-030 | Hierarchical Knowledge Store | Multi-level storage, FoundationDB schema |
| ADR-031 | RAG Query Engine | Semantic retrieval, citations |
| ADR-032 | Compliance Audit Layer | 21 CFR Part 11, HIPAA, signatures |
| BMA | Better Methods Analysis | Alternative techniques research |
| CIA | Coditect Impact Analysis | Strategic alignment, roadmap |
1. AGENTS (5 New Agent Types)
1.1 PreProcessorAgent
| Attribute | Value |
|---|---|
| What | Agent that cleans, extracts entities, and filters documents before LLM processing |
| Why | Achieve 60-80% token reduction; remove noise that causes hallucinations; extract structured entities deterministically |
| Tools | ocr_extract, entity_recognize, keyword_filter, text_clean, extractive_summarize |
| Token Budget | 5,000 per document |
| Parallelizable | Yes |
| Reference | ADR-028 §Agent Definition |
1.2 MapperAgent
| Attribute | Value |
|---|---|
| What | Agent that processes individual documents in parallel, extracting structured data per extraction schema |
| Why | Enable O(1) wall-clock processing instead of O(n) sequential; isolate failures per document |
| Tools | analyze_document, extract_schema, summarize_chunk |
| Token Budget | 15,000 per document |
| Parallelizable | Yes (up to 50 concurrent) |
| Reference | ADR-029 §Mapper Agent |
1.3 ReducerAgent
| Attribute | Value |
|---|---|
| What | Agent that aggregates outputs from multiple MapperAgents into synthesized results |
| Why | Combine parallel results into coherent output; support hierarchical reduction for large corpora |
| Tools | merge_extractions, deduplicate, synthesize |
| Token Budget | 50,000 per batch |
| Parallelizable | No (sequential reduction levels) |
| Reference | ADR-029 §Reducer Agent |
1.4 IndexerAgent
| Attribute | Value |
|---|---|
| What | Agent that builds vector embeddings and knowledge graph from processed documents |
| Why | Enable semantic RAG retrieval; support entity-relationship queries via GraphRAG |
| Tools | embed_chunks, build_graph, index_entities |
| Token Budget | 10,000 per batch |
| Parallelizable | Yes |
| Reference | ADR-027 §New Agent Types |
1.5 QueryAgent
| Attribute | Value |
|---|---|
| What | Agent that executes RAG queries with adaptive retrieval strategy and mandatory citations |
| Why | Provide interactive corpus access; ensure grounded, cited responses; support compliance requirements |
| Tools | vector_search, graph_traverse, generate_cited |
| Token Budget | 20,000 per query |
| Parallelizable | Yes |
| Reference | ADR-031 §Core Implementation |
2. SKILLS (4 New Skill Definitions)
2.1 corpus-ingest
| Attribute | Value |
|---|---|
| What | Skill for ingesting document corpora with configurable pre-processing |
| Why | Standardize document intake; provide clear instructions for pre-processing level selection |
| Location | /mnt/skills/coditect/corpus-ingest/SKILL.md |
| Reference | ADR-027 §New Skills |
2.2 corpus-analyze
| Attribute | Value |
|---|---|
| What | Skill for map-reduce analysis of document corpora |
| Why | Guide agents through parallel processing workflow; define extraction schemas |
| Location | /mnt/skills/coditect/corpus-analyze/SKILL.md |
| Reference | ADR-027 §New Skills |
2.3 corpus-query
| Attribute | Value |
|---|---|
| What | Skill for RAG-powered queries against indexed corpora |
| Why | Ensure proper citation format; guide strategy selection; enforce access reasons |
| Location | /mnt/skills/coditect/corpus-query/SKILL.md |
| Reference | ADR-027 §New Skills |
2.4 corpus-export
| Attribute | Value |
|---|---|
| What | Skill for exporting analysis results with compliance audit trail |
| Why | Ensure exports include signatures; format for regulatory submission |
| Location | /mnt/skills/coditect/corpus-export/SKILL.md |
| Reference | ADR-027 §New Skills |
3. COMMANDS (12 New Commands)
3.1 Corpus Processing Commands
| Command | What | Why | Parameters | Reference |
|---|---|---|---|---|
@corpus:ingest | Ingest documents into processing pipeline | Entry point for corpus processing with pre-processing level selection | source_path, extraction_schema, pre_process_level | ADR-027 |
@corpus:analyze | Run map-reduce analysis | Execute parallel analysis with configurable agent count and token budget | analysis_type, output_format, parallel_agents, token_budget | ADR-027, ADR-029 |
@corpus:query | RAG query against corpus | Interactive retrieval with mandatory citations | query, corpus_ids, top_k, require_citations, access_reason | ADR-027, ADR-031 |
@corpus:status | Check processing job status | Monitor long-running jobs, view progress | job_id | ADR-027 |
@corpus:cancel | Cancel running job | Stop processing with optional checkpoint | job_id, checkpoint | ADR-029 |
@corpus:recover | Recover failed job | Resume from checkpoint after failure | job_id | ADR-029 |
3.2 Knowledge Hierarchy Commands
| Command | What | Why | Parameters | Reference |
|---|---|---|---|---|
@knowledge:query | Query at specific hierarchy level | Access master/section/chunk/raw levels | corpus_id, level, query, filters, access_reason | ADR-030 |
@knowledge:drill | Navigate down hierarchy | Drill from summary to source detail | node_id, target_level | ADR-030 |
@knowledge:rollup | Navigate up hierarchy | Roll up from details to summary | node_ids, target_level | ADR-030 |
@knowledge:update | Incrementally update corpus | Add/remove documents without full reprocess | corpus_id, add_documents, remove_documents | ADR-030 |
3.3 Compliance Commands
| Command | What | Why | Parameters | Reference |
|---|---|---|---|---|
@compliance:audit | View audit trail | Inspect operation history for compliance review | corpus_id, start_date, end_date, actions, format | ADR-032 |
@compliance:verify | Verify chain integrity | Detect tampering in audit trail | corpus_id | ADR-032 |
@compliance:sign | Sign corpus/export | Apply electronic signature per 21 CFR Part 11 | corpus_id, meaning, meaning_text | ADR-032 |
@compliance:export | Export with audit trail | Generate compliant export package | corpus_id, format, include_audit, include_signatures | ADR-032 |
4. TOOLS (15 New Tools)
4.1 Pre-Processing Tools
| Tool | What | Why | Parameters | Reference |
|---|---|---|---|---|
ocr_extract | Extract text from images via OCR | Handle scanned documents, PDFs with images | image_path, engine, language | ADR-028 |
entity_recognize | Run NER on text | Extract people, organizations, dates, amounts deterministically | text, model, custom_patterns | ADR-028 |
keyword_filter | Filter text by keywords | Reduce token load by removing irrelevant sections | text, keywords, context_sentences | ADR-028 |
text_clean | Clean and normalize text | Remove boilerplate, fix encoding, normalize whitespace | text, remove_boilerplate, normalize_whitespace, fix_encoding | ADR-028 |
extractive_summarize | Select key sentences via TF-IDF | Compress documents while preserving entities | text, num_sentences, preserve_entities | ADR-028 |
4.2 Mapping Tools
| Tool | What | Why | Parameters | Reference |
|---|---|---|---|---|
analyze_document | Analyze single document per schema | Core mapper operation for extraction | document, schema | ADR-029 |
extract_schema | Apply extraction schema to content | Structured data extraction with validation | content, schema, strict_mode | ADR-029 |
summarize_chunk | Summarize document chunk | Generate chunk-level summary for hierarchy | chunk, max_tokens, preserve_quotes | ADR-029 |
4.3 Reduction Tools
| Tool | What | Why | Parameters | Reference |
|---|---|---|---|---|
merge_extractions | Merge multiple extraction results | Combine parallel mapper outputs | extractions, merge_strategy | ADR-029 |
deduplicate | Remove duplicate content | Eliminate redundancy across documents | items, similarity_threshold | ADR-029 |
synthesize | Generate synthesis from merged data | Create coherent narrative from fragments | merged_data, synthesis_prompt | ADR-029 |
4.4 Indexing Tools
| Tool | What | Why | Parameters | Reference |
|---|---|---|---|---|
embed_chunks | Generate vector embeddings | Enable semantic search in RAG | chunks, model | ADR-030 |
build_graph | Build knowledge graph from entities | Enable GraphRAG entity traversal | entities, relationships | ADR-030 |
index_entities | Index entities for lookup | Fast entity-based retrieval | entities, corpus_id | ADR-030 |
4.5 Query Tools
| Tool | What | Why | Parameters | Reference |
|---|---|---|---|---|
vector_search | Semantic vector similarity search | Find relevant chunks by meaning | query_embedding, top_k, filters | ADR-031 |
graph_traverse | Traverse knowledge graph | Find related entities across documents | entity, max_hops, relationship_types | ADR-031 |
generate_cited | Generate response with citations | Ensure grounded, cited output | query, context, citation_format | ADR-031 |
5. DATA STRUCTURES (18 New Types)
5.1 Core Processing Types
| Type | What | Why | Key Fields | Reference |
|---|---|---|---|---|
ProcessedDocument | Document after pre-processing | Carry cleaned content + entities through pipeline | content, entities, metadata, metrics | ADR-028 |
PreProcessorAgentConfig | Configuration for pre-processor | Control aggressiveness level and extraction options | level, ocr_engine, ner_model, keyword_filter | ADR-028 |
PreprocessingLevel | Enum of processing levels | Standardize minimal/standard/aggressive options | MINIMAL, STANDARD, AGGRESSIVE | ADR-028 |
Entity | Extracted named entity | Structured entity with position and confidence | text, label, start, end, confidence | ADR-028 |
5.2 Map-Reduce Types
| Type | What | Why | Key Fields | Reference |
|---|---|---|---|---|
MapReduceJob | Job definition for map-reduce | Configure parallelism, budget, prompts | job_id, mapper_prompt, reducer_prompt, max_parallel_mappers, total_token_budget | ADR-029 |
MapperOutput | Output from single mapper | Carry extractions + metrics from mapper | batch_id, extractions, tokens_used, errors | ADR-029 |
ReducerOutput | Output from reducer | Aggregated results with source tracking | aggregated, source_documents, tokens_used | ADR-029 |
MapReduceCheckpoint | Checkpoint state for recovery | Enable resume after failure | job_id, phase, completed_batches, pending_batches, map_results | ADR-029 |
MapReduceTokenBudget | Token budget allocation | Control costs across phases | total_budget, map_allocation, reduce_allocation | ADR-029 |
5.3 Hierarchy Types
| Type | What | Why | Key Fields | Reference |
|---|---|---|---|---|
HierarchyLevel | Enum of hierarchy levels | Navigate MASTER/SECTION/CHUNK/RAW | MASTER=3, SECTION=2, CHUNK=1, RAW=0 | ADR-030 |
KnowledgeNode | Base class for all hierarchy nodes | Common fields for all levels | id, corpus_id, level, parent_ref, child_refs, created_at | ADR-030 |
MasterNode | Top-level corpus summary | Entry point for hierarchy queries | summary, key_findings, statistics | ADR-030 |
SectionNode | Section-level grouping | Category/topic organization | category, summary, entities, document_count | ADR-030 |
ChunkNode | Chunk with embedding | RAG-searchable unit | summary, key_quotes, embedding, source_document | ADR-030 |
RawNode | Original text span | Source of truth for citations | content, document_id, span_start, span_end | ADR-030 |
5.4 RAG Types
| Type | What | Why | Key Fields | Reference |
|---|---|---|---|---|
QueryIntent | Classification of query intent | Select retrieval strategy | FACTUAL, ANALYTICAL, COMPARATIVE, EXPLORATORY, PROCEDURAL | ADR-031 |
RetrievalStrategy | Enum of retrieval strategies | Configure retrieval behavior | SIMPLE, STANDARD, MULTI_HOP, GRAPH_ENHANCED, EXHAUSTIVE | ADR-031 |
RetrievalContext | Context assembled for generation | Carry retrieved chunks to generator | chunks, strategy_used, total_retrieved, hops_executed | ADR-031 |
RAGResponse | Response with citations | Final output with grounding | answer, citations, confidence, context_used, tokens_used | ADR-031 |
Citation | Structured citation | Link claim to source | citation_id, claim_text, source_chunk_id, source_text, confidence | ADR-031 |
RAGConfig | Configuration for RAG engine | Control retrieval and generation | default_top_k, enable_reranking, require_citations, confidence_threshold | ADR-031 |
5.5 Compliance Types
| Type | What | Why | Key Fields | Reference |
|---|---|---|---|---|
AccessPolicy | Access policy definition | RBAC + ABAC access control | allowed_roles, conditions, corpus_patterns, require_reason | ADR-032 |
AccessCondition | ABAC condition | Attribute-based filtering | attribute, operator, value | ADR-032 |
PHICategory | HIPAA PHI categories | Identify redactable content | 18 categories per HIPAA Safe Harbor | ADR-032 |
RedactionPolicy | PHI redaction configuration | Control what/how to redact | phi_categories, redaction_method, enable_safe_harbor | ADR-032 |
RedactionRecord | Record of single redaction | Audit redaction without storing PHI | category, original_hash, replacement, span | ADR-032 |
AuditEvent | Immutable audit event | 21 CFR Part 11 audit trail | event_id, sequence_number, previous_hash, event_hash, operator_id, action, timestamp | ADR-032 |
ElectronicSignature | 21 CFR Part 11 signature | Non-repudiation for compliance | signer_id, component_1, component_2_hash, meaning, signature_value, signed_content_hash | ADR-032 |
SignatureMeaning | Enum of signature meanings | Capture intent per regulation | CREATED, REVIEWED, APPROVED, VERIFIED, AUTHORIZED | ADR-032 |
6. SERVICES (6 New Services)
| Service | What | Why | Key Methods | Reference |
|---|---|---|---|---|
MapReduceCoordinator | Orchestrate map-reduce jobs | Central coordination for parallel processing | submit_job(), _execute_map_phase(), _execute_reduce_phase() | ADR-029 |
HierarchicalKnowledgeStore | Store and query hierarchy | Multi-level persistence and retrieval | store_hierarchy(), query_at_level(), drill_down(), roll_up() | ADR-030 |
HierarchyUpdater | Incremental hierarchy updates | Add/remove docs without full reprocess | add_documents(), remove_documents() | ADR-030 |
RAGQueryEngine | Adaptive RAG queries | Execute queries with strategy selection | query(), _retrieve(), _generate(), _validate_and_correct() | ADR-031 |
AccessControlGateway | Enforce access control | RBAC + ABAC policy enforcement | check_access() | ADR-032 |
PHIRedactor | Detect and redact PHI | HIPAA compliance | process_document(), _detect_phi() | ADR-032 |
AuditRecorder | Record immutable audit trail | 21 CFR Part 11 audit | record(), verify_chain_integrity(), export_audit_trail() | ADR-032 |
SignatureService | Electronic signatures | 21 CFR Part 11 signatures | sign(), verify() | ADR-032 |
7. WORKFLOWS (4 New Workflows)
7.1 corpus-preprocess
| Attribute | Value |
|---|---|
| What | Workflow for document pre-processing pipeline |
| Why | Standardize extraction → cleaning → NER → filtering → compression |
| Trigger | @corpus:ingest |
| Steps | validate_input → detect_type → extract → clean → extract_entities → filter → compress → store → audit |
| Reference | ADR-028 §Workflow Integration |
7.2 corpus-analyze
| Attribute | Value |
|---|---|
| What | Workflow for map-reduce corpus analysis |
| Why | Coordinate parallel mappers and reducers with checkpointing |
| Trigger | @corpus:analyze |
| Steps | validate_job → allocate_budget → execute_map_phase → checkpoint → execute_shuffle → execute_reduce → synthesize → store → audit |
| Reference | ADR-029 §Core Components |
7.3 corpus-query
| Attribute | Value |
|---|---|
| What | Workflow for RAG query execution |
| Why | Standardize analyze → retrieve → generate → validate → audit |
| Trigger | @corpus:query |
| Steps | analyze_query → select_strategy → retrieve → assemble_context → generate → validate_citations → self_correct → audit |
| Reference | ADR-031 §Core Implementation |
7.4 compliance-export
| Attribute | Value |
|---|---|
| What | Workflow for compliant corpus export |
| Why | Ensure exports include audit trail and signatures per regulation |
| Trigger | @compliance:export |
| Steps | check_access → gather_hierarchy → gather_audit_trail → format_export → sign_export → record_export_audit |
| Reference | ADR-032 §Electronic Signature Service |
8. FOUNDATIONDB SCHEMA (6 New Directories)
| Directory | What | Why | Key Structure | Reference |
|---|---|---|---|---|
('coditect', 'knowledge', 'master') | Master-level summaries | Top of hierarchy | {corpus_id} → MasterNode | ADR-030 |
('coditect', 'knowledge', 'sections') | Section-level groupings | Category organization | {corpus_id}/{section_id} → SectionNode | ADR-030 |
('coditect', 'knowledge', 'chunks') | Chunk-level with embeddings | RAG-searchable units | {corpus_id}/{section_id}/{chunk_id} → ChunkNode | ADR-030 |
('coditect', 'knowledge', 'raw') | Original text spans | Source of truth | {corpus_id}/{document_id}/{span_start} → RawNode | ADR-030 |
('coditect', 'knowledge', 'indexes') | Entity and embedding indexes | Fast retrieval | entity/{type}/{value}, embedding/{chunk_id} | ADR-030 |
('coditect', 'knowledge', 'audit') | Immutable audit trail | 21 CFR Part 11 compliance | {corpus_id}/{sequence} → AuditEvent | ADR-032 |
9. CONFIGURATION OBJECTS (5 New Configs)
| Config | What | Why | Key Fields | Reference |
|---|---|---|---|---|
PreProcessorAgentConfig | Pre-processor settings | Control extraction and filtering | level, ocr_engine, ner_model, remove_boilerplate_patterns | ADR-028 |
MapReduceTokenBudget | Token allocation | Prevent cost overruns | total_budget, map_allocation, reduce_allocation, synthesis_allocation | ADR-029 |
RAGConfig | RAG engine settings | Control retrieval behavior | default_top_k, similarity_threshold, enable_reranking, require_citations | ADR-031 |
RedactionPolicy | PHI redaction settings | HIPAA compliance | phi_categories, redaction_method, enable_safe_harbor | ADR-032 |
AccessPolicy | Access control policy | RBAC + ABAC rules | allowed_roles, conditions, require_reason, require_approval | ADR-032 |
10. EXTERNAL DEPENDENCIES (5 New Dependencies)
| Dependency | What | Why | Reference |
|---|---|---|---|
| spaCy (en_core_web_lg) | NER model | Entity extraction in pre-processing | ADR-028 |
| Cross-encoder reranker | ms-marco-MiniLM | Improve retrieval precision | ADR-031 |
| Vector database | Embedding index | Semantic search (Pinecone/Chroma/native) | ADR-030, ADR-031 |
| OCR engine | Text extraction | Handle scanned documents (Tesseract/Azure/Google) | ADR-028 |
| Timestamp authority | RFC 3161 | Trusted timestamps for audit | ADR-032 |
11. IMPLEMENTATION SUMMARY
By Sprint
| Sprint | Components | Reference |
|---|---|---|
| S1-S2 | PreProcessorAgent, PreProcessorAgentConfig, pre-processing tools, corpus-preprocess workflow | ADR-028 |
| S3-S4 | MapperAgent, ReducerAgent, MapReduceCoordinator, MapReduceJob, checkpoint types | ADR-029 |
| S5-S6 | HierarchicalKnowledgeStore, hierarchy nodes, FoundationDB schema, HierarchyUpdater | ADR-030 |
| S7-S8 | RAGQueryEngine, QueryAgent, retrieval strategies, citation types | ADR-031 |
| S9-S10 | AccessControlGateway, PHIRedactor, AuditRecorder, SignatureService | ADR-032 |
| S11-S12 | Integration testing, skills documentation, command finalization | All |
Component Counts
| Category | Count |
|---|---|
| Agents | 5 |
| Skills | 4 |
| Commands | 12 |
| Tools | 15 |
| Data Types | 18+ |
| Services | 8 |
| Workflows | 4 |
| FDB Directories | 6 |
| Configs | 5 |
| Total New Components | 77+ |
12. CROSS-REFERENCE MATRIX
| Component | ADR-027 | ADR-028 | ADR-029 | ADR-030 | ADR-031 | ADR-032 | BMA | CIA |
|---|---|---|---|---|---|---|---|---|
| PreProcessorAgent | ✓ | ✓✓ | ✓ | |||||
| MapperAgent | ✓ | ✓✓ | ✓ | |||||
| ReducerAgent | ✓ | ✓✓ | ✓ | |||||
| IndexerAgent | ✓ | ✓ | ✓ | |||||
| QueryAgent | ✓ | ✓✓ | ✓ | |||||
| Hierarchy Store | ✓✓ | ✓ | ✓ | ✓ | ✓ | |||
| RAG Engine | ✓ | ✓✓ | ✓ | ✓ | ||||
| Audit Layer | ✓ | ✓ | ✓✓ | ✓ | ||||
| Map-Reduce Coord | ✓ | ✓✓ | ✓ | ✓ | ||||
| Token Budget | ✓ | ✓✓ | ✓ |
✓✓ = Primary definition | ✓ = Referenced