Skip to main content

ADR-003: 9-Dimension Quality Scoring System

Status

Accepted

Context

CODITECT operates in regulated industries where document quality is not optional — it's a compliance requirement. A quality scoring system is needed that: (a) quantifies extraction fidelity across all document aspects, (b) provides a go/no-go gate for agent consumption, (c) generates validation evidence (IQ/OQ/PQ) for FDA 21 CFR Part 11 and similar frameworks, and (d) detects quality regressions when upstream tools (Docling, ar5iv) change.

Decision

Implement a 9-dimension quality scoring system that evaluates: structure (0.12), tables (0.12), math (0.15), citations (0.10), images (0.08), content density (0.10), LaTeX residual (0.13), heading hierarchy (0.10), and bibliography (0.10). The weighted sum produces an overall score. Grade A (≥0.85) is required for agent consumption in regulated workflows. Grade B (≥0.70) triggers retry. Grade C (<0.70) triggers human checkpoint.

Math and LaTeX residual have the highest combined weight (0.28) because mathematical content is the most common failure mode and the most critical for scientific agent reasoning.

Consequences

Positive:

  • Provides quantitative quality evidence for IQ/OQ/PQ validation documentation
  • Enables automated regression detection when upstream tools change
  • Grade gate prevents low-quality content from reaching regulated agent workflows
  • Dimension breakdown pinpoints exactly what failed for targeted fixes
  • Tenant-configurable weights allow domain-specific quality priorities

Negative:

  • 9 dimensions require maintenance — each dimension's scoring logic must be updated when new failure patterns emerge
  • Weights are hand-tuned; may need recalibration as paper corpus diversity increases

Neutral:

  • Quality scoring adds ~0.1–0.3s per paper — negligible relative to extraction time

Alternatives Considered

  1. Binary pass/fail on key metrics only (math valid, tables intact): Too coarse — misses subtle degradation patterns. Rejected.

  2. LLM-based quality assessment (ask Claude to evaluate extraction quality): Non-deterministic, expensive (LLM tokens per paper), and not auditable for compliance. Rejected.

  3. Single composite score without dimension breakdown: Provides no actionable information when quality drops. Rejected because the whole point is identifying which dimension failed.