ADR-028: Pre-Processor Agent Pipeline

Status

PROPOSED

Date

2026-01-15

Context

Raw document ingestion into LLM pipelines is token-inefficient. Enterprise corpora contain:

Boilerplate (headers, footers, legal disclaimers): 10-30% of tokens
Duplicate content across documents: 5-20% of tokens
Irrelevant sections for specific analysis: 30-60% of tokens
Formatting artifacts from OCR/parsing: 5-15% of tokens

Pre-processing with traditional NLP/ML techniques before LLM analysis can achieve 60-80% token reduction while improving output quality by removing noise.

Key Insight from Research

Pieces.app discovered that over-preprocessing increases hallucinations:

"The more pre-processing we did, the more hallucinations were created, and the worse the final summaries."

This ADR defines a calibrated pre-processing pipeline that removes noise without introducing distortion.

Decision

Implement a staged Pre-Processor Agent pipeline with configurable aggressiveness levels.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    PRE-PROCESSOR PIPELINE                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  STAGE 1: EXTRACTION                                                │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐     │   │
│  │  │  PDF    │   │  DOCX   │   │  HTML   │   │  Image  │     │   │
│  │  │ Parser  │   │ Parser  │   │ Parser  │   │   OCR   │     │   │
│  │  └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘     │   │
│  │       └──────────────┴──────────────┴──────────────┘        │   │
│  │                          │                                   │   │
│  │                          ▼                                   │   │
│  │              ┌─────────────────────┐                        │   │
│  │              │   Raw Text + Meta   │                        │   │
│  │              └─────────────────────┘                        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  STAGE 2: CLEANING (Configurable)                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │   │
│  │  │  Boilerplate│   │  Whitespace │   │   Encoding  │       │   │
│  │  │   Removal   │   │   Normalize │   │    Fixup    │       │   │
│  │  └─────────────┘   └─────────────┘   └─────────────┘       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  STAGE 3: ENTITY EXTRACTION                                        │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │   │
│  │  │    NER      │   │   Date/Time │   │   Domain    │       │   │
│  │  │  (spaCy)    │   │   Parsing   │   │  Entities   │       │   │
│  │  └─────────────┘   └─────────────┘   └─────────────┘       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  STAGE 4: FILTERING (Query-Dependent)                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │   │
│  │  │  Keyword    │   │   Section   │   │  Relevance  │       │   │
│  │  │   Match     │   │   Headers   │   │   Scoring   │       │   │
│  │  └─────────────┘   └─────────────┘   └─────────────┘       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                              │                                      │
│                              ▼                                      │
│  STAGE 5: COMPRESSION (Optional)                                   │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │   │
│  │  │ Extractive  │   │   Sentence  │   │    TF-IDF   │       │   │
│  │  │  Summary    │   │   Scoring   │   │   Ranking   │       │   │
│  │  └─────────────┘   └─────────────┘   └─────────────┘       │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Aggressiveness Levels

preprocessing_levels:
  minimal:
    description: "Safe defaults, preserve maximum fidelity"
    stages: [extraction, cleaning]
    token_reduction: "20-30%"
    hallucination_risk: "minimal"
    use_case: "Legal documents, contracts, compliance"
    
  standard:
    description: "Balanced noise removal"
    stages: [extraction, cleaning, entity_extraction, filtering]
    token_reduction: "50-60%"
    hallucination_risk: "low"
    use_case: "General business documents, reports"
    
  aggressive:
    description: "Maximum compression for high-volume"
    stages: [extraction, cleaning, entity_extraction, filtering, compression]
    token_reduction: "70-85%"
    hallucination_risk: "medium"
    use_case: "Large corpora, initial triage, non-critical"

Agent Definition

@dataclass
class PreProcessorAgentConfig:
    """Configuration for pre-processor agent"""
    
    # Processing level
    level: Literal["minimal", "standard", "aggressive"] = "standard"
    
    # Extraction settings
    ocr_engine: str = "tesseract"  # or "azure", "google"
    preserve_layout: bool = True
    extract_tables: bool = True
    extract_images: bool = False
    
    # Cleaning settings
    remove_headers_footers: bool = True
    normalize_whitespace: bool = True
    fix_encoding: bool = True
    remove_boilerplate_patterns: List[str] = field(default_factory=list)
    
    # Entity extraction
    ner_model: str = "en_core_web_lg"
    custom_entity_patterns: Dict[str, str] = field(default_factory=dict)
    extract_dates: bool = True
    extract_amounts: bool = True
    
    # Filtering (query-dependent)
    keyword_filter: Optional[List[str]] = None
    section_filter: Optional[List[str]] = None
    min_relevance_score: float = 0.3
    
    # Compression
    extractive_summary_sentences: int = 20
    use_tfidf_ranking: bool = True
    preserve_entities: bool = True  # Never compress away extracted entities


class PreProcessorAgent:
    """Agent for document pre-processing"""
    
    def __init__(self, config: PreProcessorAgentConfig):
        self.config = config
        self.nlp = spacy.load(config.ner_model)
        self.metrics = PreProcessorMetrics()
    
    async def process(self, document: RawDocument) -> ProcessedDocument:
        """Execute pre-processing pipeline"""
        
        original_tokens = self._count_tokens(document.content)
        
        # Stage 1: Extraction
        extracted = await self._extract(document)
        
        # Stage 2: Cleaning
        cleaned = await self._clean(extracted)
        
        # Stage 3: Entity Extraction (always run, used for grounding)
        entities = await self._extract_entities(cleaned)
        
        # Stage 4: Filtering (if keywords provided)
        if self.config.keyword_filter:
            filtered = await self._filter(cleaned, self.config.keyword_filter)
        else:
            filtered = cleaned
        
        # Stage 5: Compression (if aggressive)
        if self.config.level == "aggressive":
            compressed = await self._compress(filtered, entities)
        else:
            compressed = filtered
        
        final_tokens = self._count_tokens(compressed)
        
        return ProcessedDocument(
            content=compressed,
            entities=entities,
            metadata=document.metadata,
            metrics=ProcessingMetrics(
                original_tokens=original_tokens,
                final_tokens=final_tokens,
                reduction_ratio=1 - (final_tokens / original_tokens),
                entities_extracted=len(entities)
            )
        )
    
    async def _extract(self, document: RawDocument) -> str:
        """Stage 1: Extract text from document"""
        
        if document.mime_type == "application/pdf":
            return await self._extract_pdf(document)
        elif document.mime_type in ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]:
            return await self._extract_docx(document)
        elif document.mime_type == "text/html":
            return await self._extract_html(document)
        elif document.mime_type.startswith("image/"):
            return await self._extract_ocr(document)
        else:
            return document.content  # Assume plain text
    
    async def _clean(self, text: str) -> str:
        """Stage 2: Clean extracted text"""
        
        result = text
        
        if self.config.normalize_whitespace:
            result = re.sub(r'\s+', ' ', result)
            result = re.sub(r'\n{3,}', '\n\n', result)
        
        if self.config.fix_encoding:
            result = result.encode('utf-8', errors='ignore').decode('utf-8')
        
        if self.config.remove_headers_footers:
            result = self._remove_headers_footers(result)
        
        for pattern in self.config.remove_boilerplate_patterns:
            result = re.sub(pattern, '', result, flags=re.IGNORECASE)
        
        return result.strip()
    
    async def _extract_entities(self, text: str) -> List[Entity]:
        """Stage 3: Named entity recognition"""
        
        doc = self.nlp(text)
        entities = []
        
        for ent in doc.ents:
            entities.append(Entity(
                text=ent.text,
                label=ent.label_,
                start=ent.start_char,
                end=ent.end_char,
                confidence=1.0  # spaCy doesn't provide confidence
            ))
        
        # Custom patterns
        for label, pattern in self.config.custom_entity_patterns.items():
            for match in re.finditer(pattern, text):
                entities.append(Entity(
                    text=match.group(),
                    label=label,
                    start=match.start(),
                    end=match.end(),
                    confidence=0.9
                ))
        
        return entities
    
    async def _filter(self, text: str, keywords: List[str]) -> str:
        """Stage 4: Filter to relevant sections"""
        
        paragraphs = text.split('\n\n')
        relevant = []
        
        for para in paragraphs:
            para_lower = para.lower()
            if any(kw.lower() in para_lower for kw in keywords):
                relevant.append(para)
        
        return '\n\n'.join(relevant)
    
    async def _compress(
        self,
        text: str,
        entities: List[Entity]
    ) -> str:
        """Stage 5: Extractive compression"""
        
        # Sentence tokenization
        doc = self.nlp(text)
        sentences = list(doc.sents)
        
        # TF-IDF scoring
        if self.config.use_tfidf_ranking:
            scores = self._tfidf_scores(sentences)
        else:
            scores = {i: 1.0 for i in range(len(sentences))}
        
        # Boost sentences containing entities
        if self.config.preserve_entities:
            entity_texts = {e.text.lower() for e in entities}
            for i, sent in enumerate(sentences):
                if any(et in sent.text.lower() for et in entity_texts):
                    scores[i] *= 2.0
        
        # Select top sentences
        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        top_indices = sorted([
            idx for idx, _ in ranked[:self.config.extractive_summary_sentences]
        ])
        
        selected = [sentences[i].text for i in top_indices]
        return ' '.join(selected)

Tools Provided

preprocessor_tools:
  - name: ocr_extract
    description: Extract text from images using OCR
    parameters:
      - image_path: string
      - engine: enum[tesseract, azure, google]
      - language: string (default: "eng")
    returns: ExtractedText
    
  - name: entity_recognize
    description: Run NER on text
    parameters:
      - text: string
      - model: string (default: "en_core_web_lg")
      - custom_patterns: Dict[str, str]
    returns: List[Entity]
    
  - name: keyword_filter
    description: Filter text to sections containing keywords
    parameters:
      - text: string
      - keywords: List[string]
      - context_sentences: int (default: 2)
    returns: FilteredText
    
  - name: text_clean
    description: Clean and normalize text
    parameters:
      - text: string
      - remove_boilerplate: bool
      - normalize_whitespace: bool
      - fix_encoding: bool
    returns: CleanedText
    
  - name: extractive_summarize
    description: Select key sentences using TF-IDF
    parameters:
      - text: string
      - num_sentences: int
      - preserve_entities: bool
    returns: CompressedText

Workflow Integration

preprocessing_workflow:
  name: corpus-preprocess
  trigger: "@corpus:ingest"
  
  steps:
    - id: validate_input
      action: validate_document_format
      on_error: reject_with_reason
      
    - id: detect_type
      action: detect_mime_type
      outputs: [mime_type]
      
    - id: extract
      action: extract_text
      inputs: [document, mime_type]
      outputs: [raw_text, extraction_metadata]
      
    - id: clean
      action: clean_text
      inputs: [raw_text, preprocessing_level]
      outputs: [cleaned_text]
      
    - id: extract_entities
      action: run_ner
      inputs: [cleaned_text]
      outputs: [entities]
      parallel: true
      
    - id: filter
      action: keyword_filter
      inputs: [cleaned_text, analysis_keywords]
      outputs: [filtered_text]
      condition: "analysis_keywords is not None"
      
    - id: compress
      action: extractive_summarize
      inputs: [filtered_text, entities]
      outputs: [compressed_text]
      condition: "preprocessing_level == 'aggressive'"
      
    - id: store
      action: store_processed_document
      inputs: [compressed_text, entities, extraction_metadata]
      outputs: [document_id]
      
    - id: audit
      action: create_audit_record
      inputs: [document_id, processing_metrics]

Consequences

Positive

60-80% token reduction before LLM processing
Structured entity output for downstream use
Configurable aggressiveness per use case
Deterministic extraction (NER) alongside LLM analysis
Parallelizable per document

Negative

Additional dependencies: spaCy, OCR engines
Processing latency: 1-5 seconds per document
Configuration complexity: Choosing right level per use case

Metrics to Track

Metric	Target	Alert Threshold
Token reduction ratio	>60%	<40%
Entity extraction recall	>85%	<70%
Processing time per doc	<5s	>10s
Hallucination rate (downstream)	<5%	>10%

Alternatives Considered

Alternative 1: LLM-Based Pre-Processing

Rejected: Defeats purpose of token reduction
Learning: Use LLM for analysis, not preprocessing

Alternative 2: No Pre-Processing (RAG Only)

Rejected: Token costs prohibitive at scale
Learning: Pre-processing and RAG are complementary

Alternative 3: Single Aggressiveness Level

Rejected: Compliance documents need different handling than triage
Learning: Configurable levels essential for enterprise

References

ADR-027: Corpus Processing Subsystem Architecture
spaCy NER Documentation
Pieces.app Hallucination Research
Text Preprocessing Best Practices

Approval

Role	Name	Date	Decision
CTO	Hal Casteel
ML Lead

Status​

Date​

Context​

Key Insight from Research​

Decision​

Pipeline Architecture​

Aggressiveness Levels​

Agent Definition​

Tools Provided​

Workflow Integration​

Consequences​

Positive​

Negative​

Metrics to Track​

Alternatives Considered​

Alternative 1: LLM-Based Pre-Processing​

Alternative 2: No Pre-Processing (RAG Only)​

Alternative 3: Single Aggressiveness Level​

References​

Approval​