Skip to main content

Coditect Corpus Processing Subsystem: Complete Component Inventory

Document Reference Key

CodeDocumentPurpose
ADR-027Corpus Processing Subsystem ArchitectureMaster architecture, agent types, skills, commands
ADR-028Pre-Processor Agent PipelineDocument cleaning, NER, filtering
ADR-029Map-Reduce Agent OrchestratorParallel processing, coordination
ADR-030Hierarchical Knowledge StoreMulti-level storage, FoundationDB schema
ADR-031RAG Query EngineSemantic retrieval, citations
ADR-032Compliance Audit Layer21 CFR Part 11, HIPAA, signatures
BMABetter Methods AnalysisAlternative techniques research
CIACoditect Impact AnalysisStrategic alignment, roadmap

1. AGENTS (5 New Agent Types)

1.1 PreProcessorAgent

AttributeValue
WhatAgent that cleans, extracts entities, and filters documents before LLM processing
WhyAchieve 60-80% token reduction; remove noise that causes hallucinations; extract structured entities deterministically
Toolsocr_extract, entity_recognize, keyword_filter, text_clean, extractive_summarize
Token Budget5,000 per document
ParallelizableYes
ReferenceADR-028 §Agent Definition

1.2 MapperAgent

AttributeValue
WhatAgent that processes individual documents in parallel, extracting structured data per extraction schema
WhyEnable O(1) wall-clock processing instead of O(n) sequential; isolate failures per document
Toolsanalyze_document, extract_schema, summarize_chunk
Token Budget15,000 per document
ParallelizableYes (up to 50 concurrent)
ReferenceADR-029 §Mapper Agent

1.3 ReducerAgent

AttributeValue
WhatAgent that aggregates outputs from multiple MapperAgents into synthesized results
WhyCombine parallel results into coherent output; support hierarchical reduction for large corpora
Toolsmerge_extractions, deduplicate, synthesize
Token Budget50,000 per batch
ParallelizableNo (sequential reduction levels)
ReferenceADR-029 §Reducer Agent

1.4 IndexerAgent

AttributeValue
WhatAgent that builds vector embeddings and knowledge graph from processed documents
WhyEnable semantic RAG retrieval; support entity-relationship queries via GraphRAG
Toolsembed_chunks, build_graph, index_entities
Token Budget10,000 per batch
ParallelizableYes
ReferenceADR-027 §New Agent Types

1.5 QueryAgent

AttributeValue
WhatAgent that executes RAG queries with adaptive retrieval strategy and mandatory citations
WhyProvide interactive corpus access; ensure grounded, cited responses; support compliance requirements
Toolsvector_search, graph_traverse, generate_cited
Token Budget20,000 per query
ParallelizableYes
ReferenceADR-031 §Core Implementation

2. SKILLS (4 New Skill Definitions)

2.1 corpus-ingest

AttributeValue
WhatSkill for ingesting document corpora with configurable pre-processing
WhyStandardize document intake; provide clear instructions for pre-processing level selection
Location/mnt/skills/coditect/corpus-ingest/SKILL.md
ReferenceADR-027 §New Skills

2.2 corpus-analyze

AttributeValue
WhatSkill for map-reduce analysis of document corpora
WhyGuide agents through parallel processing workflow; define extraction schemas
Location/mnt/skills/coditect/corpus-analyze/SKILL.md
ReferenceADR-027 §New Skills

2.3 corpus-query

AttributeValue
WhatSkill for RAG-powered queries against indexed corpora
WhyEnsure proper citation format; guide strategy selection; enforce access reasons
Location/mnt/skills/coditect/corpus-query/SKILL.md
ReferenceADR-027 §New Skills

2.4 corpus-export

AttributeValue
WhatSkill for exporting analysis results with compliance audit trail
WhyEnsure exports include signatures; format for regulatory submission
Location/mnt/skills/coditect/corpus-export/SKILL.md
ReferenceADR-027 §New Skills

3. COMMANDS (12 New Commands)

3.1 Corpus Processing Commands

CommandWhatWhyParametersReference
@corpus:ingestIngest documents into processing pipelineEntry point for corpus processing with pre-processing level selectionsource_path, extraction_schema, pre_process_levelADR-027
@corpus:analyzeRun map-reduce analysisExecute parallel analysis with configurable agent count and token budgetanalysis_type, output_format, parallel_agents, token_budgetADR-027, ADR-029
@corpus:queryRAG query against corpusInteractive retrieval with mandatory citationsquery, corpus_ids, top_k, require_citations, access_reasonADR-027, ADR-031
@corpus:statusCheck processing job statusMonitor long-running jobs, view progressjob_idADR-027
@corpus:cancelCancel running jobStop processing with optional checkpointjob_id, checkpointADR-029
@corpus:recoverRecover failed jobResume from checkpoint after failurejob_idADR-029

3.2 Knowledge Hierarchy Commands

CommandWhatWhyParametersReference
@knowledge:queryQuery at specific hierarchy levelAccess master/section/chunk/raw levelscorpus_id, level, query, filters, access_reasonADR-030
@knowledge:drillNavigate down hierarchyDrill from summary to source detailnode_id, target_levelADR-030
@knowledge:rollupNavigate up hierarchyRoll up from details to summarynode_ids, target_levelADR-030
@knowledge:updateIncrementally update corpusAdd/remove documents without full reprocesscorpus_id, add_documents, remove_documentsADR-030

3.3 Compliance Commands

CommandWhatWhyParametersReference
@compliance:auditView audit trailInspect operation history for compliance reviewcorpus_id, start_date, end_date, actions, formatADR-032
@compliance:verifyVerify chain integrityDetect tampering in audit trailcorpus_idADR-032
@compliance:signSign corpus/exportApply electronic signature per 21 CFR Part 11corpus_id, meaning, meaning_textADR-032
@compliance:exportExport with audit trailGenerate compliant export packagecorpus_id, format, include_audit, include_signaturesADR-032

4. TOOLS (15 New Tools)

4.1 Pre-Processing Tools

ToolWhatWhyParametersReference
ocr_extractExtract text from images via OCRHandle scanned documents, PDFs with imagesimage_path, engine, languageADR-028
entity_recognizeRun NER on textExtract people, organizations, dates, amounts deterministicallytext, model, custom_patternsADR-028
keyword_filterFilter text by keywordsReduce token load by removing irrelevant sectionstext, keywords, context_sentencesADR-028
text_cleanClean and normalize textRemove boilerplate, fix encoding, normalize whitespacetext, remove_boilerplate, normalize_whitespace, fix_encodingADR-028
extractive_summarizeSelect key sentences via TF-IDFCompress documents while preserving entitiestext, num_sentences, preserve_entitiesADR-028

4.2 Mapping Tools

ToolWhatWhyParametersReference
analyze_documentAnalyze single document per schemaCore mapper operation for extractiondocument, schemaADR-029
extract_schemaApply extraction schema to contentStructured data extraction with validationcontent, schema, strict_modeADR-029
summarize_chunkSummarize document chunkGenerate chunk-level summary for hierarchychunk, max_tokens, preserve_quotesADR-029

4.3 Reduction Tools

ToolWhatWhyParametersReference
merge_extractionsMerge multiple extraction resultsCombine parallel mapper outputsextractions, merge_strategyADR-029
deduplicateRemove duplicate contentEliminate redundancy across documentsitems, similarity_thresholdADR-029
synthesizeGenerate synthesis from merged dataCreate coherent narrative from fragmentsmerged_data, synthesis_promptADR-029

4.4 Indexing Tools

ToolWhatWhyParametersReference
embed_chunksGenerate vector embeddingsEnable semantic search in RAGchunks, modelADR-030
build_graphBuild knowledge graph from entitiesEnable GraphRAG entity traversalentities, relationshipsADR-030
index_entitiesIndex entities for lookupFast entity-based retrievalentities, corpus_idADR-030

4.5 Query Tools

ToolWhatWhyParametersReference
vector_searchSemantic vector similarity searchFind relevant chunks by meaningquery_embedding, top_k, filtersADR-031
graph_traverseTraverse knowledge graphFind related entities across documentsentity, max_hops, relationship_typesADR-031
generate_citedGenerate response with citationsEnsure grounded, cited outputquery, context, citation_formatADR-031

5. DATA STRUCTURES (18 New Types)

5.1 Core Processing Types

TypeWhatWhyKey FieldsReference
ProcessedDocumentDocument after pre-processingCarry cleaned content + entities through pipelinecontent, entities, metadata, metricsADR-028
PreProcessorAgentConfigConfiguration for pre-processorControl aggressiveness level and extraction optionslevel, ocr_engine, ner_model, keyword_filterADR-028
PreprocessingLevelEnum of processing levelsStandardize minimal/standard/aggressive optionsMINIMAL, STANDARD, AGGRESSIVEADR-028
EntityExtracted named entityStructured entity with position and confidencetext, label, start, end, confidenceADR-028

5.2 Map-Reduce Types

TypeWhatWhyKey FieldsReference
MapReduceJobJob definition for map-reduceConfigure parallelism, budget, promptsjob_id, mapper_prompt, reducer_prompt, max_parallel_mappers, total_token_budgetADR-029
MapperOutputOutput from single mapperCarry extractions + metrics from mapperbatch_id, extractions, tokens_used, errorsADR-029
ReducerOutputOutput from reducerAggregated results with source trackingaggregated, source_documents, tokens_usedADR-029
MapReduceCheckpointCheckpoint state for recoveryEnable resume after failurejob_id, phase, completed_batches, pending_batches, map_resultsADR-029
MapReduceTokenBudgetToken budget allocationControl costs across phasestotal_budget, map_allocation, reduce_allocationADR-029

5.3 Hierarchy Types

TypeWhatWhyKey FieldsReference
HierarchyLevelEnum of hierarchy levelsNavigate MASTER/SECTION/CHUNK/RAWMASTER=3, SECTION=2, CHUNK=1, RAW=0ADR-030
KnowledgeNodeBase class for all hierarchy nodesCommon fields for all levelsid, corpus_id, level, parent_ref, child_refs, created_atADR-030
MasterNodeTop-level corpus summaryEntry point for hierarchy queriessummary, key_findings, statisticsADR-030
SectionNodeSection-level groupingCategory/topic organizationcategory, summary, entities, document_countADR-030
ChunkNodeChunk with embeddingRAG-searchable unitsummary, key_quotes, embedding, source_documentADR-030
RawNodeOriginal text spanSource of truth for citationscontent, document_id, span_start, span_endADR-030

5.4 RAG Types

TypeWhatWhyKey FieldsReference
QueryIntentClassification of query intentSelect retrieval strategyFACTUAL, ANALYTICAL, COMPARATIVE, EXPLORATORY, PROCEDURALADR-031
RetrievalStrategyEnum of retrieval strategiesConfigure retrieval behaviorSIMPLE, STANDARD, MULTI_HOP, GRAPH_ENHANCED, EXHAUSTIVEADR-031
RetrievalContextContext assembled for generationCarry retrieved chunks to generatorchunks, strategy_used, total_retrieved, hops_executedADR-031
RAGResponseResponse with citationsFinal output with groundinganswer, citations, confidence, context_used, tokens_usedADR-031
CitationStructured citationLink claim to sourcecitation_id, claim_text, source_chunk_id, source_text, confidenceADR-031
RAGConfigConfiguration for RAG engineControl retrieval and generationdefault_top_k, enable_reranking, require_citations, confidence_thresholdADR-031

5.5 Compliance Types

TypeWhatWhyKey FieldsReference
AccessPolicyAccess policy definitionRBAC + ABAC access controlallowed_roles, conditions, corpus_patterns, require_reasonADR-032
AccessConditionABAC conditionAttribute-based filteringattribute, operator, valueADR-032
PHICategoryHIPAA PHI categoriesIdentify redactable content18 categories per HIPAA Safe HarborADR-032
RedactionPolicyPHI redaction configurationControl what/how to redactphi_categories, redaction_method, enable_safe_harborADR-032
RedactionRecordRecord of single redactionAudit redaction without storing PHIcategory, original_hash, replacement, spanADR-032
AuditEventImmutable audit event21 CFR Part 11 audit trailevent_id, sequence_number, previous_hash, event_hash, operator_id, action, timestampADR-032
ElectronicSignature21 CFR Part 11 signatureNon-repudiation for compliancesigner_id, component_1, component_2_hash, meaning, signature_value, signed_content_hashADR-032
SignatureMeaningEnum of signature meaningsCapture intent per regulationCREATED, REVIEWED, APPROVED, VERIFIED, AUTHORIZEDADR-032

6. SERVICES (6 New Services)

ServiceWhatWhyKey MethodsReference
MapReduceCoordinatorOrchestrate map-reduce jobsCentral coordination for parallel processingsubmit_job(), _execute_map_phase(), _execute_reduce_phase()ADR-029
HierarchicalKnowledgeStoreStore and query hierarchyMulti-level persistence and retrievalstore_hierarchy(), query_at_level(), drill_down(), roll_up()ADR-030
HierarchyUpdaterIncremental hierarchy updatesAdd/remove docs without full reprocessadd_documents(), remove_documents()ADR-030
RAGQueryEngineAdaptive RAG queriesExecute queries with strategy selectionquery(), _retrieve(), _generate(), _validate_and_correct()ADR-031
AccessControlGatewayEnforce access controlRBAC + ABAC policy enforcementcheck_access()ADR-032
PHIRedactorDetect and redact PHIHIPAA complianceprocess_document(), _detect_phi()ADR-032
AuditRecorderRecord immutable audit trail21 CFR Part 11 auditrecord(), verify_chain_integrity(), export_audit_trail()ADR-032
SignatureServiceElectronic signatures21 CFR Part 11 signaturessign(), verify()ADR-032

7. WORKFLOWS (4 New Workflows)

7.1 corpus-preprocess

AttributeValue
WhatWorkflow for document pre-processing pipeline
WhyStandardize extraction → cleaning → NER → filtering → compression
Trigger@corpus:ingest
Stepsvalidate_inputdetect_typeextractcleanextract_entitiesfiltercompressstoreaudit
ReferenceADR-028 §Workflow Integration

7.2 corpus-analyze

AttributeValue
WhatWorkflow for map-reduce corpus analysis
WhyCoordinate parallel mappers and reducers with checkpointing
Trigger@corpus:analyze
Stepsvalidate_joballocate_budgetexecute_map_phasecheckpointexecute_shuffleexecute_reducesynthesizestoreaudit
ReferenceADR-029 §Core Components

7.3 corpus-query

AttributeValue
WhatWorkflow for RAG query execution
WhyStandardize analyze → retrieve → generate → validate → audit
Trigger@corpus:query
Stepsanalyze_queryselect_strategyretrieveassemble_contextgeneratevalidate_citationsself_correctaudit
ReferenceADR-031 §Core Implementation

7.4 compliance-export

AttributeValue
WhatWorkflow for compliant corpus export
WhyEnsure exports include audit trail and signatures per regulation
Trigger@compliance:export
Stepscheck_accessgather_hierarchygather_audit_trailformat_exportsign_exportrecord_export_audit
ReferenceADR-032 §Electronic Signature Service

8. FOUNDATIONDB SCHEMA (6 New Directories)

DirectoryWhatWhyKey StructureReference
('coditect', 'knowledge', 'master')Master-level summariesTop of hierarchy{corpus_id}MasterNodeADR-030
('coditect', 'knowledge', 'sections')Section-level groupingsCategory organization{corpus_id}/{section_id}SectionNodeADR-030
('coditect', 'knowledge', 'chunks')Chunk-level with embeddingsRAG-searchable units{corpus_id}/{section_id}/{chunk_id}ChunkNodeADR-030
('coditect', 'knowledge', 'raw')Original text spansSource of truth{corpus_id}/{document_id}/{span_start}RawNodeADR-030
('coditect', 'knowledge', 'indexes')Entity and embedding indexesFast retrievalentity/{type}/{value}, embedding/{chunk_id}ADR-030
('coditect', 'knowledge', 'audit')Immutable audit trail21 CFR Part 11 compliance{corpus_id}/{sequence}AuditEventADR-032

9. CONFIGURATION OBJECTS (5 New Configs)

ConfigWhatWhyKey FieldsReference
PreProcessorAgentConfigPre-processor settingsControl extraction and filteringlevel, ocr_engine, ner_model, remove_boilerplate_patternsADR-028
MapReduceTokenBudgetToken allocationPrevent cost overrunstotal_budget, map_allocation, reduce_allocation, synthesis_allocationADR-029
RAGConfigRAG engine settingsControl retrieval behaviordefault_top_k, similarity_threshold, enable_reranking, require_citationsADR-031
RedactionPolicyPHI redaction settingsHIPAA compliancephi_categories, redaction_method, enable_safe_harborADR-032
AccessPolicyAccess control policyRBAC + ABAC rulesallowed_roles, conditions, require_reason, require_approvalADR-032

10. EXTERNAL DEPENDENCIES (5 New Dependencies)

DependencyWhatWhyReference
spaCy (en_core_web_lg)NER modelEntity extraction in pre-processingADR-028
Cross-encoder rerankerms-marco-MiniLMImprove retrieval precisionADR-031
Vector databaseEmbedding indexSemantic search (Pinecone/Chroma/native)ADR-030, ADR-031
OCR engineText extractionHandle scanned documents (Tesseract/Azure/Google)ADR-028
Timestamp authorityRFC 3161Trusted timestamps for auditADR-032

11. IMPLEMENTATION SUMMARY

By Sprint

SprintComponentsReference
S1-S2PreProcessorAgent, PreProcessorAgentConfig, pre-processing tools, corpus-preprocess workflowADR-028
S3-S4MapperAgent, ReducerAgent, MapReduceCoordinator, MapReduceJob, checkpoint typesADR-029
S5-S6HierarchicalKnowledgeStore, hierarchy nodes, FoundationDB schema, HierarchyUpdaterADR-030
S7-S8RAGQueryEngine, QueryAgent, retrieval strategies, citation typesADR-031
S9-S10AccessControlGateway, PHIRedactor, AuditRecorder, SignatureServiceADR-032
S11-S12Integration testing, skills documentation, command finalizationAll

Component Counts

CategoryCount
Agents5
Skills4
Commands12
Tools15
Data Types18+
Services8
Workflows4
FDB Directories6
Configs5
Total New Components77+

12. CROSS-REFERENCE MATRIX

ComponentADR-027ADR-028ADR-029ADR-030ADR-031ADR-032BMACIA
PreProcessorAgent✓✓
MapperAgent✓✓
ReducerAgent✓✓
IndexerAgent
QueryAgent✓✓
Hierarchy Store✓✓
RAG Engine✓✓
Audit Layer✓✓
Map-Reduce Coord✓✓
Token Budget✓✓

✓✓ = Primary definition | ✓ = Referenced