Skip to main content

ADR-028: Pre-Processor Agent Pipeline

Status

PROPOSED

Date

2026-01-15

Context

Raw document ingestion into LLM pipelines is token-inefficient. Enterprise corpora contain:

  • Boilerplate (headers, footers, legal disclaimers): 10-30% of tokens
  • Duplicate content across documents: 5-20% of tokens
  • Irrelevant sections for specific analysis: 30-60% of tokens
  • Formatting artifacts from OCR/parsing: 5-15% of tokens

Pre-processing with traditional NLP/ML techniques before LLM analysis can achieve 60-80% token reduction while improving output quality by removing noise.

Key Insight from Research

Pieces.app discovered that over-preprocessing increases hallucinations:

"The more pre-processing we did, the more hallucinations were created, and the worse the final summaries."

This ADR defines a calibrated pre-processing pipeline that removes noise without introducing distortion.

Decision

Implement a staged Pre-Processor Agent pipeline with configurable aggressiveness levels.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│ PRE-PROCESSOR PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: EXTRACTION │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ PDF │ │ DOCX │ │ HTML │ │ Image │ │ │
│ │ │ Parser │ │ Parser │ │ Parser │ │ OCR │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ │ └──────────────┴──────────────┴──────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ Raw Text + Meta │ │ │
│ │ └─────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 2: CLEANING (Configurable) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Boilerplate│ │ Whitespace │ │ Encoding │ │ │
│ │ │ Removal │ │ Normalize │ │ Fixup │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 3: ENTITY EXTRACTION │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ NER │ │ Date/Time │ │ Domain │ │ │
│ │ │ (spaCy) │ │ Parsing │ │ Entities │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 4: FILTERING (Query-Dependent) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Keyword │ │ Section │ │ Relevance │ │ │
│ │ │ Match │ │ Headers │ │ Scoring │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ STAGE 5: COMPRESSION (Optional) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Extractive │ │ Sentence │ │ TF-IDF │ │ │
│ │ │ Summary │ │ Scoring │ │ Ranking │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Aggressiveness Levels

preprocessing_levels:
minimal:
description: "Safe defaults, preserve maximum fidelity"
stages: [extraction, cleaning]
token_reduction: "20-30%"
hallucination_risk: "minimal"
use_case: "Legal documents, contracts, compliance"

standard:
description: "Balanced noise removal"
stages: [extraction, cleaning, entity_extraction, filtering]
token_reduction: "50-60%"
hallucination_risk: "low"
use_case: "General business documents, reports"

aggressive:
description: "Maximum compression for high-volume"
stages: [extraction, cleaning, entity_extraction, filtering, compression]
token_reduction: "70-85%"
hallucination_risk: "medium"
use_case: "Large corpora, initial triage, non-critical"

Agent Definition

@dataclass
class PreProcessorAgentConfig:
"""Configuration for pre-processor agent"""

# Processing level
level: Literal["minimal", "standard", "aggressive"] = "standard"

# Extraction settings
ocr_engine: str = "tesseract" # or "azure", "google"
preserve_layout: bool = True
extract_tables: bool = True
extract_images: bool = False

# Cleaning settings
remove_headers_footers: bool = True
normalize_whitespace: bool = True
fix_encoding: bool = True
remove_boilerplate_patterns: List[str] = field(default_factory=list)

# Entity extraction
ner_model: str = "en_core_web_lg"
custom_entity_patterns: Dict[str, str] = field(default_factory=dict)
extract_dates: bool = True
extract_amounts: bool = True

# Filtering (query-dependent)
keyword_filter: Optional[List[str]] = None
section_filter: Optional[List[str]] = None
min_relevance_score: float = 0.3

# Compression
extractive_summary_sentences: int = 20
use_tfidf_ranking: bool = True
preserve_entities: bool = True # Never compress away extracted entities


class PreProcessorAgent:
"""Agent for document pre-processing"""

def __init__(self, config: PreProcessorAgentConfig):
self.config = config
self.nlp = spacy.load(config.ner_model)
self.metrics = PreProcessorMetrics()

async def process(self, document: RawDocument) -> ProcessedDocument:
"""Execute pre-processing pipeline"""

original_tokens = self._count_tokens(document.content)

# Stage 1: Extraction
extracted = await self._extract(document)

# Stage 2: Cleaning
cleaned = await self._clean(extracted)

# Stage 3: Entity Extraction (always run, used for grounding)
entities = await self._extract_entities(cleaned)

# Stage 4: Filtering (if keywords provided)
if self.config.keyword_filter:
filtered = await self._filter(cleaned, self.config.keyword_filter)
else:
filtered = cleaned

# Stage 5: Compression (if aggressive)
if self.config.level == "aggressive":
compressed = await self._compress(filtered, entities)
else:
compressed = filtered

final_tokens = self._count_tokens(compressed)

return ProcessedDocument(
content=compressed,
entities=entities,
metadata=document.metadata,
metrics=ProcessingMetrics(
original_tokens=original_tokens,
final_tokens=final_tokens,
reduction_ratio=1 - (final_tokens / original_tokens),
entities_extracted=len(entities)
)
)

async def _extract(self, document: RawDocument) -> str:
"""Stage 1: Extract text from document"""

if document.mime_type == "application/pdf":
return await self._extract_pdf(document)
elif document.mime_type in ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]:
return await self._extract_docx(document)
elif document.mime_type == "text/html":
return await self._extract_html(document)
elif document.mime_type.startswith("image/"):
return await self._extract_ocr(document)
else:
return document.content # Assume plain text

async def _clean(self, text: str) -> str:
"""Stage 2: Clean extracted text"""

result = text

if self.config.normalize_whitespace:
result = re.sub(r'\s+', ' ', result)
result = re.sub(r'\n{3,}', '\n\n', result)

if self.config.fix_encoding:
result = result.encode('utf-8', errors='ignore').decode('utf-8')

if self.config.remove_headers_footers:
result = self._remove_headers_footers(result)

for pattern in self.config.remove_boilerplate_patterns:
result = re.sub(pattern, '', result, flags=re.IGNORECASE)

return result.strip()

async def _extract_entities(self, text: str) -> List[Entity]:
"""Stage 3: Named entity recognition"""

doc = self.nlp(text)
entities = []

for ent in doc.ents:
entities.append(Entity(
text=ent.text,
label=ent.label_,
start=ent.start_char,
end=ent.end_char,
confidence=1.0 # spaCy doesn't provide confidence
))

# Custom patterns
for label, pattern in self.config.custom_entity_patterns.items():
for match in re.finditer(pattern, text):
entities.append(Entity(
text=match.group(),
label=label,
start=match.start(),
end=match.end(),
confidence=0.9
))

return entities

async def _filter(self, text: str, keywords: List[str]) -> str:
"""Stage 4: Filter to relevant sections"""

paragraphs = text.split('\n\n')
relevant = []

for para in paragraphs:
para_lower = para.lower()
if any(kw.lower() in para_lower for kw in keywords):
relevant.append(para)

return '\n\n'.join(relevant)

async def _compress(
self,
text: str,
entities: List[Entity]
) -> str:
"""Stage 5: Extractive compression"""

# Sentence tokenization
doc = self.nlp(text)
sentences = list(doc.sents)

# TF-IDF scoring
if self.config.use_tfidf_ranking:
scores = self._tfidf_scores(sentences)
else:
scores = {i: 1.0 for i in range(len(sentences))}

# Boost sentences containing entities
if self.config.preserve_entities:
entity_texts = {e.text.lower() for e in entities}
for i, sent in enumerate(sentences):
if any(et in sent.text.lower() for et in entity_texts):
scores[i] *= 2.0

# Select top sentences
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
top_indices = sorted([
idx for idx, _ in ranked[:self.config.extractive_summary_sentences]
])

selected = [sentences[i].text for i in top_indices]
return ' '.join(selected)

Tools Provided

preprocessor_tools:
- name: ocr_extract
description: Extract text from images using OCR
parameters:
- image_path: string
- engine: enum[tesseract, azure, google]
- language: string (default: "eng")
returns: ExtractedText

- name: entity_recognize
description: Run NER on text
parameters:
- text: string
- model: string (default: "en_core_web_lg")
- custom_patterns: Dict[str, str]
returns: List[Entity]

- name: keyword_filter
description: Filter text to sections containing keywords
parameters:
- text: string
- keywords: List[string]
- context_sentences: int (default: 2)
returns: FilteredText

- name: text_clean
description: Clean and normalize text
parameters:
- text: string
- remove_boilerplate: bool
- normalize_whitespace: bool
- fix_encoding: bool
returns: CleanedText

- name: extractive_summarize
description: Select key sentences using TF-IDF
parameters:
- text: string
- num_sentences: int
- preserve_entities: bool
returns: CompressedText

Workflow Integration

preprocessing_workflow:
name: corpus-preprocess
trigger: "@corpus:ingest"

steps:
- id: validate_input
action: validate_document_format
on_error: reject_with_reason

- id: detect_type
action: detect_mime_type
outputs: [mime_type]

- id: extract
action: extract_text
inputs: [document, mime_type]
outputs: [raw_text, extraction_metadata]

- id: clean
action: clean_text
inputs: [raw_text, preprocessing_level]
outputs: [cleaned_text]

- id: extract_entities
action: run_ner
inputs: [cleaned_text]
outputs: [entities]
parallel: true

- id: filter
action: keyword_filter
inputs: [cleaned_text, analysis_keywords]
outputs: [filtered_text]
condition: "analysis_keywords is not None"

- id: compress
action: extractive_summarize
inputs: [filtered_text, entities]
outputs: [compressed_text]
condition: "preprocessing_level == 'aggressive'"

- id: store
action: store_processed_document
inputs: [compressed_text, entities, extraction_metadata]
outputs: [document_id]

- id: audit
action: create_audit_record
inputs: [document_id, processing_metrics]

Consequences

Positive

  • 60-80% token reduction before LLM processing
  • Structured entity output for downstream use
  • Configurable aggressiveness per use case
  • Deterministic extraction (NER) alongside LLM analysis
  • Parallelizable per document

Negative

  • Additional dependencies: spaCy, OCR engines
  • Processing latency: 1-5 seconds per document
  • Configuration complexity: Choosing right level per use case

Metrics to Track

MetricTargetAlert Threshold
Token reduction ratio>60%<40%
Entity extraction recall>85%<70%
Processing time per doc<5s>10s
Hallucination rate (downstream)<5%>10%

Alternatives Considered

Alternative 1: LLM-Based Pre-Processing

  • Rejected: Defeats purpose of token reduction
  • Learning: Use LLM for analysis, not preprocessing

Alternative 2: No Pre-Processing (RAG Only)

  • Rejected: Token costs prohibitive at scale
  • Learning: Pre-processing and RAG are complementary

Alternative 3: Single Aggressiveness Level

  • Rejected: Compliance documents need different handling than triage
  • Learning: Configurable levels essential for enterprise

References

Approval

RoleNameDateDecision
CTOHal Casteel
ML Lead