ADR-028: Pre-Processor Agent Pipeline
Status
PROPOSED
Date
2026-01-15
Context
Raw document ingestion into LLM pipelines is token-inefficient. Enterprise corpora contain:
- Boilerplate (headers, footers, legal disclaimers): 10-30% of tokens
- Duplicate content across documents: 5-20% of tokens
- Irrelevant sections for specific analysis: 30-60% of tokens
- Formatting artifacts from OCR/parsing: 5-15% of tokens
Pre-processing with traditional NLP/ML techniques before LLM analysis can achieve 60-80% token reduction while improving output quality by removing noise.
Key Insight from Research
Pieces.app discovered that over-preprocessing increases hallucinations:
"The more pre-processing we did, the more hallucinations were created, and the worse the final summaries."
This ADR defines a calibrated pre-processing pipeline that removes noise without introducing distortion.
Decision
Implement a staged Pre-Processor Agent pipeline with configurable aggressiveness levels.
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ PRE-PROCESSOR PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: EXTRACTION │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ PDF │ │ DOCX │ │ HTML │ │ Image │ │ │
│ │ │ Parser │ │ Parser │ │ Parser │ │ OCR │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ └──────────────┴──────────────┴──────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ Raw Text + Meta │ │ │
│ │ └─────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 2: CLEANING (Configurable) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Boilerplate│ │ Whitespace │ │ Encoding │ │ │
│ │ │ Removal │ │ Normalize │ │ Fixup │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 3: ENTITY EXTRACTION │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ NER │ │ Date/Time │ │ Domain │ │ │
│ │ │ (spaCy) │ │ Parsing │ │ Entities │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 4: FILTERING (Query-Dependent) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Keyword │ │ Section │ │ Relevance │ │ │
│ │ │ Match │ │ Headers │ │ Scoring │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 5: COMPRESSION (Optional) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Extractive │ │ Sentence │ │ TF-IDF │ │ │
│ │ │ Summary │ │ Scoring │ │ Ranking │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Aggressiveness Levels
preprocessing_levels:
minimal:
description: "Safe defaults, preserve maximum fidelity"
stages: [extraction, cleaning]
token_reduction: "20-30%"
hallucination_risk: "minimal"
use_case: "Legal documents, contracts, compliance"
standard:
description: "Balanced noise removal"
stages: [extraction, cleaning, entity_extraction, filtering]
token_reduction: "50-60%"
hallucination_risk: "low"
use_case: "General business documents, reports"
aggressive:
description: "Maximum compression for high-volume"
stages: [extraction, cleaning, entity_extraction, filtering, compression]
token_reduction: "70-85%"
hallucination_risk: "medium"
use_case: "Large corpora, initial triage, non-critical"
Agent Definition
@dataclass
class PreProcessorAgentConfig:
"""Configuration for pre-processor agent"""
# Processing level
level: Literal["minimal", "standard", "aggressive"] = "standard"
# Extraction settings
ocr_engine: str = "tesseract" # or "azure", "google"
preserve_layout: bool = True
extract_tables: bool = True
extract_images: bool = False
# Cleaning settings
remove_headers_footers: bool = True
normalize_whitespace: bool = True
fix_encoding: bool = True
remove_boilerplate_patterns: List[str] = field(default_factory=list)
# Entity extraction
ner_model: str = "en_core_web_lg"
custom_entity_patterns: Dict[str, str] = field(default_factory=dict)
extract_dates: bool = True
extract_amounts: bool = True
# Filtering (query-dependent)
keyword_filter: Optional[List[str]] = None
section_filter: Optional[List[str]] = None
min_relevance_score: float = 0.3
# Compression
extractive_summary_sentences: int = 20
use_tfidf_ranking: bool = True
preserve_entities: bool = True # Never compress away extracted entities
class PreProcessorAgent:
"""Agent for document pre-processing"""
def __init__(self, config: PreProcessorAgentConfig):
self.config = config
self.nlp = spacy.load(config.ner_model)
self.metrics = PreProcessorMetrics()
async def process(self, document: RawDocument) -> ProcessedDocument:
"""Execute pre-processing pipeline"""
original_tokens = self._count_tokens(document.content)
# Stage 1: Extraction
extracted = await self._extract(document)
# Stage 2: Cleaning
cleaned = await self._clean(extracted)
# Stage 3: Entity Extraction (always run, used for grounding)
entities = await self._extract_entities(cleaned)
# Stage 4: Filtering (if keywords provided)
if self.config.keyword_filter:
filtered = await self._filter(cleaned, self.config.keyword_filter)
else:
filtered = cleaned
# Stage 5: Compression (if aggressive)
if self.config.level == "aggressive":
compressed = await self._compress(filtered, entities)
else:
compressed = filtered
final_tokens = self._count_tokens(compressed)
return ProcessedDocument(
content=compressed,
entities=entities,
metadata=document.metadata,
metrics=ProcessingMetrics(
original_tokens=original_tokens,
final_tokens=final_tokens,
reduction_ratio=1 - (final_tokens / original_tokens),
entities_extracted=len(entities)
)
)
async def _extract(self, document: RawDocument) -> str:
"""Stage 1: Extract text from document"""
if document.mime_type == "application/pdf":
return await self._extract_pdf(document)
elif document.mime_type in ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]:
return await self._extract_docx(document)
elif document.mime_type == "text/html":
return await self._extract_html(document)
elif document.mime_type.startswith("image/"):
return await self._extract_ocr(document)
else:
return document.content # Assume plain text
async def _clean(self, text: str) -> str:
"""Stage 2: Clean extracted text"""
result = text
if self.config.normalize_whitespace:
result = re.sub(r'\s+', ' ', result)
result = re.sub(r'\n{3,}', '\n\n', result)
if self.config.fix_encoding:
result = result.encode('utf-8', errors='ignore').decode('utf-8')
if self.config.remove_headers_footers:
result = self._remove_headers_footers(result)
for pattern in self.config.remove_boilerplate_patterns:
result = re.sub(pattern, '', result, flags=re.IGNORECASE)
return result.strip()
async def _extract_entities(self, text: str) -> List[Entity]:
"""Stage 3: Named entity recognition"""
doc = self.nlp(text)
entities = []
for ent in doc.ents:
entities.append(Entity(
text=ent.text,
label=ent.label_,
start=ent.start_char,
end=ent.end_char,
confidence=1.0 # spaCy doesn't provide confidence
))
# Custom patterns
for label, pattern in self.config.custom_entity_patterns.items():
for match in re.finditer(pattern, text):
entities.append(Entity(
text=match.group(),
label=label,
start=match.start(),
end=match.end(),
confidence=0.9
))
return entities
async def _filter(self, text: str, keywords: List[str]) -> str:
"""Stage 4: Filter to relevant sections"""
paragraphs = text.split('\n\n')
relevant = []
for para in paragraphs:
para_lower = para.lower()
if any(kw.lower() in para_lower for kw in keywords):
relevant.append(para)
return '\n\n'.join(relevant)
async def _compress(
self,
text: str,
entities: List[Entity]
) -> str:
"""Stage 5: Extractive compression"""
# Sentence tokenization
doc = self.nlp(text)
sentences = list(doc.sents)
# TF-IDF scoring
if self.config.use_tfidf_ranking:
scores = self._tfidf_scores(sentences)
else:
scores = {i: 1.0 for i in range(len(sentences))}
# Boost sentences containing entities
if self.config.preserve_entities:
entity_texts = {e.text.lower() for e in entities}
for i, sent in enumerate(sentences):
if any(et in sent.text.lower() for et in entity_texts):
scores[i] *= 2.0
# Select top sentences
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
top_indices = sorted([
idx for idx, _ in ranked[:self.config.extractive_summary_sentences]
])
selected = [sentences[i].text for i in top_indices]
return ' '.join(selected)
Tools Provided
preprocessor_tools:
- name: ocr_extract
description: Extract text from images using OCR
parameters:
- image_path: string
- engine: enum[tesseract, azure, google]
- language: string (default: "eng")
returns: ExtractedText
- name: entity_recognize
description: Run NER on text
parameters:
- text: string
- model: string (default: "en_core_web_lg")
- custom_patterns: Dict[str, str]
returns: List[Entity]
- name: keyword_filter
description: Filter text to sections containing keywords
parameters:
- text: string
- keywords: List[string]
- context_sentences: int (default: 2)
returns: FilteredText
- name: text_clean
description: Clean and normalize text
parameters:
- text: string
- remove_boilerplate: bool
- normalize_whitespace: bool
- fix_encoding: bool
returns: CleanedText
- name: extractive_summarize
description: Select key sentences using TF-IDF
parameters:
- text: string
- num_sentences: int
- preserve_entities: bool
returns: CompressedText
Workflow Integration
preprocessing_workflow:
name: corpus-preprocess
trigger: "@corpus:ingest"
steps:
- id: validate_input
action: validate_document_format
on_error: reject_with_reason
- id: detect_type
action: detect_mime_type
outputs: [mime_type]
- id: extract
action: extract_text
inputs: [document, mime_type]
outputs: [raw_text, extraction_metadata]
- id: clean
action: clean_text
inputs: [raw_text, preprocessing_level]
outputs: [cleaned_text]
- id: extract_entities
action: run_ner
inputs: [cleaned_text]
outputs: [entities]
parallel: true
- id: filter
action: keyword_filter
inputs: [cleaned_text, analysis_keywords]
outputs: [filtered_text]
condition: "analysis_keywords is not None"
- id: compress
action: extractive_summarize
inputs: [filtered_text, entities]
outputs: [compressed_text]
condition: "preprocessing_level == 'aggressive'"
- id: store
action: store_processed_document
inputs: [compressed_text, entities, extraction_metadata]
outputs: [document_id]
- id: audit
action: create_audit_record
inputs: [document_id, processing_metrics]
Consequences
Positive
- 60-80% token reduction before LLM processing
- Structured entity output for downstream use
- Configurable aggressiveness per use case
- Deterministic extraction (NER) alongside LLM analysis
- Parallelizable per document
Negative
- Additional dependencies: spaCy, OCR engines
- Processing latency: 1-5 seconds per document
- Configuration complexity: Choosing right level per use case
Metrics to Track
| Metric | Target | Alert Threshold |
|---|---|---|
| Token reduction ratio | >60% | <40% |
| Entity extraction recall | >85% | <70% |
| Processing time per doc | <5s | >10s |
| Hallucination rate (downstream) | <5% | >10% |
Alternatives Considered
Alternative 1: LLM-Based Pre-Processing
- Rejected: Defeats purpose of token reduction
- Learning: Use LLM for analysis, not preprocessing
Alternative 2: No Pre-Processing (RAG Only)
- Rejected: Token costs prohibitive at scale
- Learning: Pre-processing and RAG are complementary
Alternative 3: Single Aggressiveness Level
- Rejected: Compliance documents need different handling than triage
- Learning: Configurable levels essential for enterprise
References
- ADR-027: Corpus Processing Subsystem Architecture
- spaCy NER Documentation
- Pieces.app Hallucination Research
- Text Preprocessing Best Practices
Approval
| Role | Name | Date | Decision |
|---|---|---|---|
| CTO | Hal Casteel | ||
| ML Lead |