Skip to main content

ADR-013: Mixture of Experts (MoE) Judges Framework

Document: ADR-013-moe-judges-framework
Version: 1.0.0
Purpose: Document architectural decisions for multi-agent evaluation with calibrated grading
Audience: Framework contributors, developers, AI agents
Date Created: 2025-12-19
Status: APPROVED
Depends On:
- ADR-011-uncertainty-quantification-framework
- ADR-012-moe-analysis-framework
Related ADRs:
- ADR-010-autonomous-orchestration-system
Related Components:
- commands/moe-judge.md
- agents/uncertainty-orchestrator.md
- skills/uncertainty-quantification/SKILL.md
- skills/evaluation-framework/SKILL.md
Research Foundation:
- docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md

Context and Problem Statement

The Dual Evaluation Problem

LLM-based systems require evaluation of both:

Outbound (Prompt Quality):

  • Prompt ambiguity leading to misinterpretation
  • Missing context causing hallucinations
  • Unclear constraints resulting in off-target outputs
  • No explicit uncertainty handling instructions

Inbound (Response Quality):

  • Factual correctness of generated claims
  • Reasoning quality and logical consistency
  • Appropriate confidence calibration
  • Safety and constraint adherence

Research Foundation

This ADR is supported by peer-reviewed research from 2024-2025:

ResearchVenueContributionCertainty
G-EvalEMNLP 20230.514 Spearman correlation95%
LLM-RubricACL 2024Multi-dimensional calibrated evaluation92%
ChatEvalICLR 2024Multi-agent referee teams91%
Agent-as-a-JudgearXiv 2024~70% human judge alignment88%
RAGASIndustry95% faithfulness agreement90%
DeepEvalOpen-sourceComprehensive metrics suite88%

Full citations: See docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md

Decision Drivers

  1. Calibrated Confidence - Scores must indicate reliability
  2. Multi-Dimensional - Separate assessment per quality dimension
  3. Bias Detection - Identify and report judge biases
  4. Actionable Feedback - Provide improvement recommendations
  5. Research-Backed Rubrics - Grading criteria from validated frameworks

Considered Options

Option A: Single LLM-as-Judge

  • One model evaluates all outputs
  • Rejected: Single-judge bias, no cross-validation, limited perspective

Option B: Human Evaluation Only

  • Manual review of all outputs
  • Rejected: Not scalable, expensive, inconsistent

Option C: MoE Judges Framework with Calibrated Rubrics (Selected)

  • Multiple specialized judge agents
  • Calibrated rubric-based scoring per dimension
  • Bias detection and reporting
  • Consensus scoring with confidence
  • Selected: Addresses all decision drivers with research backing

Option D: Automated Metrics Only (BLEU, ROUGE)

  • Lexical similarity metrics
  • Rejected: Poor correlation with human judgment, no reasoning assessment

Decision

Implement Option C: MoE Judges Framework with the following architecture:

1. Dual Evaluation Scope

Part A: Prompt Quality Assessment (Outbound)

Dimensions (20% weight each):

DimensionDescriptionScore 5Score 3Score 1
Clarity & SpecificityUnambiguous requirementsSingle interpretationMinor ambiguitiesVague, multiple interpretations
CompletenessAll context providedFull contextMost contextCritical context missing
Constraint QualitySafety/formatting explicitAll constraints statedMost constraintsConstraints unclear
Domain AlignmentAppropriate for domainPerfect fitReasonable fitPoor alignment
Uncertainty HandlingInstructions for uncertaintyClear instructionsImplicit handlingNo guidance

Research Basis: G-Eval (EMNLP 2023) achieves 0.514 Spearman correlation using structured dimension scoring.

Part B: Response Quality Assessment (Inbound)

Dimensions (weighted):

DimensionWeightScore 5Score 3Score 1
Factual Correctness25%100% verifiable70%+ verifiable<50% verifiable
Coverage & Completeness20%All requirementsMain pointsMajor gaps
Reasoning Quality20%Clear step-by-stepMostly logicalInconsistent
Uncertainty Expression20%Calibrated confidenceSome uncertaintyOverconfident
Safety & Compliance15%All constraintsMinor violationsMajor violations

Research Basis: LLM-Rubric (ACL 2024) demonstrates multi-dimensional calibrated evaluation.

2. Judge Panel Composition

Recommended Panel (4-6 judges):

judges = [
{"type": "qa-reviewer", "role": "Clarity Judge",
"focus": "Prompt clarity and specificity"},
{"type": "thoughts-analyzer", "role": "Ambiguity Judge",
"focus": "Hallucination risk and vague quantifiers"},
{"type": "codebase-analyzer", "role": "Factuality Judge",
"focus": "Claim verification and technical accuracy"},
{"type": "web-search-researcher", "role": "Grounding Judge",
"focus": "Evidence verification against sources"},
{"type": "code-reviewer", "role": "Reasoning Judge",
"focus": "Logic consistency and completeness"},
{"type": "security-specialist", "role": "Safety Judge",
"focus": "Constraint adherence and safety"}
]

Research Basis: ChatEval (ICLR 2024) demonstrates multi-agent referee teams achieve superior accuracy vs. single-agent evaluation.

3. Confidence Calibration

Calibrated Confidence Formula:

def calibrated_confidence(
self_certainty: float, # Judge's internal certainty (0-1)
evidence_quality: float, # Quality of supporting evidence (0-1)
ground_truth_available: bool # Is reference answer available?
) -> float:
base_confidence = self_certainty * 0.5

if ground_truth_available:
return min(base_confidence + 0.4, 0.95)
elif evidence_quality > 0.7:
return min(base_confidence + 0.3, 0.85)
else:
return min(base_confidence + 0.1, 0.60)

Confidence Interpretation:

RangeMeaningAction
0.85-1.0High - strong evidence, reference availableTrust judgment
0.65-0.84Moderate - good evidence, some uncertaintyNote limitations
0.40-0.64Low - limited evidenceSeek additional input
0.0-0.39Very low - speculativeFlag for human review

Research Basis: VOCAL research (2024) demonstrates verbal uncertainty calibration improves hallucination detection by ~30%.

4. Bias Detection Protocol

Required Bias Checks:

Bias TypeDetection MethodMitigation
Verbosity BiasScore vs. response length correlationNormalize for length
Position BiasScore variance across response orderSwap-order evaluation
Self-EnhancementCompare judge style to output styleCross-model judging
Recency BiasScore vs. information dateWeight recency appropriately
Confirmation BiasCompare to contrary evidenceDevil's advocate judge

Implementation:

detected_biases = []

# Verbosity bias detection
if response_length > 1000 and score > 4:
detected_biases.append({
"type": "verbosity_bias_possible",
"indicator": "High score with long response",
"mitigation": "Verify substance matches length"
})

# Score variance detection
if score_variance_across_judges < 0.5:
detected_biases.append({
"type": "undifferentiated_scoring",
"indicator": "Low variance across judges",
"mitigation": "Review for groupthink"
})

Research Basis: LLM-as-Judge research (2024) identifies 10-15% accuracy loss from unmitigated biases.

5. Correctness Determination

Final Verdict Labels:

LabelScore RangeMeaning
mostly_correct4.0-5.0High accuracy, minor issues
mixed2.5-3.9Significant correct and incorrect elements
mostly_incorrect1.0-2.4Majority of content incorrect
unsafeAnySafety/compliance violations detected

Consensus Calculation:

def calculate_consensus(judge_scores: List[float]) -> dict:
mean_score = statistics.mean(judge_scores)
std_dev = statistics.stdev(judge_scores)

consensus = 1 - (std_dev / 2.5) # Normalize to 0-1

return {
"aggregate_score": mean_score,
"consensus_level": consensus,
"variance": std_dev,
"requires_investigation": std_dev > 1.5
}

Research Basis: Agent-as-a-Judge (arXiv 2024) achieves ~70% alignment with human judge consensus, higher than individual humans.

Architecture

Workflow Phases

┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Judge Dispatch │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Clarity │ │Ambiguity │ │Factuality│ │ Grounding│ │
│ │ Judge │ │ Judge │ │ Judge │ │ Judge │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └────────────┴─────┬──────┴────────────┘ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 2: Score Collection │
│ Each judge provides: │
│ - Dimension scores (0-5 scale) │
│ - Confidence levels (0-1) │
│ - Rationale (2-5 sentences) │
│ - Detected biases │
│ - Specific issues found │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 3: Aggregation │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ - Calculate weighted aggregate scores │ │
│ │ - Measure consensus/variance │ │
│ │ - Aggregate detected biases │ │
│ │ - Identify disagreements for investigation │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 4: Meta-Evaluation │
│ (Optional: --with-meta-judge) │
│ - Evaluate judge panel performance │
│ - Detect systematic biases │
│ - Validate scoring consistency │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 5: Report Generation │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ - Correctness verdict with justification │ │
│ │ - Dimension-by-dimension breakdown │ │
│ │ - Improvement recommendations │ │
│ │ - Bias report │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Output Schema

{
"evaluation_id": "judge-[timestamp]",
"prompt_evaluation": {
"scores_by_dimension": {
"clarity_specificity": {
"score": 4.0,
"confidence": 0.82,
"rationale": "Prompt clearly defines output format...",
"detected_biases": [],
"specific_issues": ["Term 'high-quality' undefined"]
},
"completeness": {...},
"constraint_quality": {...},
"domain_alignment": {...},
"uncertainty_handling": {...}
},
"overall_prompt_score": 3.8,
"overall_prompt_confidence": 0.78,
"key_prompt_issues": [...],
"improvement_suggestions": [...]
},
"response_evaluation": [
{
"agent_id": "analyst_1",
"scores_by_dimension": {
"factual_correctness": {
"score": 4.5,
"confidence": 0.88,
"rationale": "Claims verified against sources...",
"detected_biases": []
},
"coverage": {...},
"reasoning_quality": {...},
"uncertainty_expression": {...},
"safety_compliance": {...}
},
"overall_response_score": 4.2,
"overall_response_confidence": 0.83,
"strengths": [...],
"weaknesses": [...],
"correctness_judgement": {
"label": "mostly_correct",
"justification": "Technical content accurate, minor coverage gaps"
}
}
],
"meta_evaluation": {
"judge_consensus": 0.85,
"score_variance": 0.3,
"potential_blind_spots": [...],
"recommended_additional_judges": [...]
}
}

Consequences

Positive

  • Calibrated Trust - Confidence scores indicate judgment reliability
  • Multi-Dimensional Insight - Specific areas for improvement identified
  • Bias Transparency - Known biases documented and mitigated
  • Actionable Recommendations - Clear improvement paths
  • Quality Assurance - Systematic evaluation of both prompts and responses

Negative

  • Evaluation Overhead - ~2000-3000 tokens per full evaluation
  • Latency - Multi-judge coordination adds processing time
  • Complexity - Sophisticated output parsing required
  • Judge Selection - Appropriate judges must be chosen for domain

Neutral

  • Shifts from subjective to structured evaluation
  • Requires calibration for new domains

Implementation

Phase 1: Core Components (Week 1-2)

  • Create moe-judge command specification
  • Define prompt quality rubric
  • Define response quality rubric
  • Implement confidence calibration
  • Create bias detection functions

Phase 2: Integration (Week 3-4)

  • Integrate with uncertainty-orchestrator
  • Connect to existing reviewer agents
  • Implement consensus calculation
  • Create output formatting

Phase 3: Validation (Week 5-6)

  • Validate against human judgments
  • Calibrate confidence thresholds
  • Test bias detection accuracy
  • Document edge cases

Validation Criteria

MetricTargetMeasurement
Human Agreement>70%Match with expert human judgments
Confidence Calibration<0.1 ECEExpected Calibration Error
Bias Detection Rate>80%Known biases correctly identified
Consensus Accuracy>85%Consensus correct when high
Improvement Relevance>90%Suggestions rated as actionable

References

Primary Research (Tier 1: 95%+ Certainty)

  1. G-Eval - Microsoft Research, EMNLP 2023

  2. LLM-Rubric - ACL 2024

  3. ChatEval - ICLR 2024

  4. RAGAS Framework - Explodinggradients

Secondary Research (Tier 2: 85-94% Certainty)

  1. Agent-as-a-Judge - arXiv 2024

  2. DeepEval - Confident AI

  3. LLM-as-Judge Guide - EvidentlyAI

CODITECT Components

  • commands/moe-judge.md - Command specification
  • agents/uncertainty-orchestrator.md - Orchestration agent
  • skills/uncertainty-quantification/SKILL.md - Confidence calibration
  • skills/evaluation-framework/SKILL.md - LLM-as-judge patterns
  • docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md - Full research catalog

Document Version: 1.0.0 Last Updated: 2025-12-19 Author: CODITECT Research Team Status: APPROVED