ADR-013: Mixture of Experts (MoE) Judges Framework
Document: ADR-013-moe-judges-framework
Version: 1.0.0
Purpose: Document architectural decisions for multi-agent evaluation with calibrated grading
Audience: Framework contributors, developers, AI agents
Date Created: 2025-12-19
Status: APPROVED
Depends On:
- ADR-011-uncertainty-quantification-framework
- ADR-012-moe-analysis-framework
Related ADRs:
- ADR-010-autonomous-orchestration-system
Related Components:
- commands/moe-judge.md
- agents/uncertainty-orchestrator.md
- skills/uncertainty-quantification/SKILL.md
- skills/evaluation-framework/SKILL.md
Research Foundation:
- docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md
Context and Problem Statement
The Dual Evaluation Problem
LLM-based systems require evaluation of both:
Outbound (Prompt Quality):
- Prompt ambiguity leading to misinterpretation
- Missing context causing hallucinations
- Unclear constraints resulting in off-target outputs
- No explicit uncertainty handling instructions
Inbound (Response Quality):
- Factual correctness of generated claims
- Reasoning quality and logical consistency
- Appropriate confidence calibration
- Safety and constraint adherence
Research Foundation
This ADR is supported by peer-reviewed research from 2024-2025:
| Research | Venue | Contribution | Certainty |
|---|---|---|---|
| G-Eval | EMNLP 2023 | 0.514 Spearman correlation | 95% |
| LLM-Rubric | ACL 2024 | Multi-dimensional calibrated evaluation | 92% |
| ChatEval | ICLR 2024 | Multi-agent referee teams | 91% |
| Agent-as-a-Judge | arXiv 2024 | ~70% human judge alignment | 88% |
| RAGAS | Industry | 95% faithfulness agreement | 90% |
| DeepEval | Open-source | Comprehensive metrics suite | 88% |
Full citations: See docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md
Decision Drivers
- Calibrated Confidence - Scores must indicate reliability
- Multi-Dimensional - Separate assessment per quality dimension
- Bias Detection - Identify and report judge biases
- Actionable Feedback - Provide improvement recommendations
- Research-Backed Rubrics - Grading criteria from validated frameworks
Considered Options
Option A: Single LLM-as-Judge
- One model evaluates all outputs
- Rejected: Single-judge bias, no cross-validation, limited perspective
Option B: Human Evaluation Only
- Manual review of all outputs
- Rejected: Not scalable, expensive, inconsistent
Option C: MoE Judges Framework with Calibrated Rubrics (Selected)
- Multiple specialized judge agents
- Calibrated rubric-based scoring per dimension
- Bias detection and reporting
- Consensus scoring with confidence
- Selected: Addresses all decision drivers with research backing
Option D: Automated Metrics Only (BLEU, ROUGE)
- Lexical similarity metrics
- Rejected: Poor correlation with human judgment, no reasoning assessment
Decision
Implement Option C: MoE Judges Framework with the following architecture:
1. Dual Evaluation Scope
Part A: Prompt Quality Assessment (Outbound)
Dimensions (20% weight each):
| Dimension | Description | Score 5 | Score 3 | Score 1 |
|---|---|---|---|---|
| Clarity & Specificity | Unambiguous requirements | Single interpretation | Minor ambiguities | Vague, multiple interpretations |
| Completeness | All context provided | Full context | Most context | Critical context missing |
| Constraint Quality | Safety/formatting explicit | All constraints stated | Most constraints | Constraints unclear |
| Domain Alignment | Appropriate for domain | Perfect fit | Reasonable fit | Poor alignment |
| Uncertainty Handling | Instructions for uncertainty | Clear instructions | Implicit handling | No guidance |
Research Basis: G-Eval (EMNLP 2023) achieves 0.514 Spearman correlation using structured dimension scoring.
Part B: Response Quality Assessment (Inbound)
Dimensions (weighted):
| Dimension | Weight | Score 5 | Score 3 | Score 1 |
|---|---|---|---|---|
| Factual Correctness | 25% | 100% verifiable | 70%+ verifiable | <50% verifiable |
| Coverage & Completeness | 20% | All requirements | Main points | Major gaps |
| Reasoning Quality | 20% | Clear step-by-step | Mostly logical | Inconsistent |
| Uncertainty Expression | 20% | Calibrated confidence | Some uncertainty | Overconfident |
| Safety & Compliance | 15% | All constraints | Minor violations | Major violations |
Research Basis: LLM-Rubric (ACL 2024) demonstrates multi-dimensional calibrated evaluation.
2. Judge Panel Composition
Recommended Panel (4-6 judges):
judges = [
{"type": "qa-reviewer", "role": "Clarity Judge",
"focus": "Prompt clarity and specificity"},
{"type": "thoughts-analyzer", "role": "Ambiguity Judge",
"focus": "Hallucination risk and vague quantifiers"},
{"type": "codebase-analyzer", "role": "Factuality Judge",
"focus": "Claim verification and technical accuracy"},
{"type": "web-search-researcher", "role": "Grounding Judge",
"focus": "Evidence verification against sources"},
{"type": "code-reviewer", "role": "Reasoning Judge",
"focus": "Logic consistency and completeness"},
{"type": "security-specialist", "role": "Safety Judge",
"focus": "Constraint adherence and safety"}
]
Research Basis: ChatEval (ICLR 2024) demonstrates multi-agent referee teams achieve superior accuracy vs. single-agent evaluation.
3. Confidence Calibration
Calibrated Confidence Formula:
def calibrated_confidence(
self_certainty: float, # Judge's internal certainty (0-1)
evidence_quality: float, # Quality of supporting evidence (0-1)
ground_truth_available: bool # Is reference answer available?
) -> float:
base_confidence = self_certainty * 0.5
if ground_truth_available:
return min(base_confidence + 0.4, 0.95)
elif evidence_quality > 0.7:
return min(base_confidence + 0.3, 0.85)
else:
return min(base_confidence + 0.1, 0.60)
Confidence Interpretation:
| Range | Meaning | Action |
|---|---|---|
| 0.85-1.0 | High - strong evidence, reference available | Trust judgment |
| 0.65-0.84 | Moderate - good evidence, some uncertainty | Note limitations |
| 0.40-0.64 | Low - limited evidence | Seek additional input |
| 0.0-0.39 | Very low - speculative | Flag for human review |
Research Basis: VOCAL research (2024) demonstrates verbal uncertainty calibration improves hallucination detection by ~30%.
4. Bias Detection Protocol
Required Bias Checks:
| Bias Type | Detection Method | Mitigation |
|---|---|---|
| Verbosity Bias | Score vs. response length correlation | Normalize for length |
| Position Bias | Score variance across response order | Swap-order evaluation |
| Self-Enhancement | Compare judge style to output style | Cross-model judging |
| Recency Bias | Score vs. information date | Weight recency appropriately |
| Confirmation Bias | Compare to contrary evidence | Devil's advocate judge |
Implementation:
detected_biases = []
# Verbosity bias detection
if response_length > 1000 and score > 4:
detected_biases.append({
"type": "verbosity_bias_possible",
"indicator": "High score with long response",
"mitigation": "Verify substance matches length"
})
# Score variance detection
if score_variance_across_judges < 0.5:
detected_biases.append({
"type": "undifferentiated_scoring",
"indicator": "Low variance across judges",
"mitigation": "Review for groupthink"
})
Research Basis: LLM-as-Judge research (2024) identifies 10-15% accuracy loss from unmitigated biases.
5. Correctness Determination
Final Verdict Labels:
| Label | Score Range | Meaning |
|---|---|---|
| mostly_correct | 4.0-5.0 | High accuracy, minor issues |
| mixed | 2.5-3.9 | Significant correct and incorrect elements |
| mostly_incorrect | 1.0-2.4 | Majority of content incorrect |
| unsafe | Any | Safety/compliance violations detected |
Consensus Calculation:
def calculate_consensus(judge_scores: List[float]) -> dict:
mean_score = statistics.mean(judge_scores)
std_dev = statistics.stdev(judge_scores)
consensus = 1 - (std_dev / 2.5) # Normalize to 0-1
return {
"aggregate_score": mean_score,
"consensus_level": consensus,
"variance": std_dev,
"requires_investigation": std_dev > 1.5
}
Research Basis: Agent-as-a-Judge (arXiv 2024) achieves ~70% alignment with human judge consensus, higher than individual humans.
Architecture
Workflow Phases
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Judge Dispatch │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Clarity │ │Ambiguity │ │Factuality│ │ Grounding│ │
│ │ Judge │ │ Judge │ │ Judge │ │ Judge │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └────────────┴─────┬──────┴────────────┘ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 2: Score Collection │
│ Each judge provides: │
│ - Dimension scores (0-5 scale) │
│ - Confidence levels (0-1) │
│ - Rationale (2-5 sentences) │
│ - Detected biases │
│ - Specific issues found │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 3: Aggregation │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ - Calculate weighted aggregate scores │ │
│ │ - Measure consensus/variance │ │
│ │ - Aggregate detected biases │ │
│ │ - Identify disagreements for investigation │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 4: Meta-Evaluation │
│ (Optional: --with-meta-judge) │
│ - Evaluate judge panel performance │
│ - Detect systematic biases │
│ - Validate scoring consistency │
│ │ │
│ ▼ │
├─────────────────────────────────────────────────────────────┤
│ Phase 5: Report Generation │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ - Correctness verdict with justification │ │
│ │ - Dimension-by-dimension breakdown │ │
│ │ - Improvement recommendations │ │
│ │ - Bias report │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Output Schema
{
"evaluation_id": "judge-[timestamp]",
"prompt_evaluation": {
"scores_by_dimension": {
"clarity_specificity": {
"score": 4.0,
"confidence": 0.82,
"rationale": "Prompt clearly defines output format...",
"detected_biases": [],
"specific_issues": ["Term 'high-quality' undefined"]
},
"completeness": {...},
"constraint_quality": {...},
"domain_alignment": {...},
"uncertainty_handling": {...}
},
"overall_prompt_score": 3.8,
"overall_prompt_confidence": 0.78,
"key_prompt_issues": [...],
"improvement_suggestions": [...]
},
"response_evaluation": [
{
"agent_id": "analyst_1",
"scores_by_dimension": {
"factual_correctness": {
"score": 4.5,
"confidence": 0.88,
"rationale": "Claims verified against sources...",
"detected_biases": []
},
"coverage": {...},
"reasoning_quality": {...},
"uncertainty_expression": {...},
"safety_compliance": {...}
},
"overall_response_score": 4.2,
"overall_response_confidence": 0.83,
"strengths": [...],
"weaknesses": [...],
"correctness_judgement": {
"label": "mostly_correct",
"justification": "Technical content accurate, minor coverage gaps"
}
}
],
"meta_evaluation": {
"judge_consensus": 0.85,
"score_variance": 0.3,
"potential_blind_spots": [...],
"recommended_additional_judges": [...]
}
}
Consequences
Positive
- Calibrated Trust - Confidence scores indicate judgment reliability
- Multi-Dimensional Insight - Specific areas for improvement identified
- Bias Transparency - Known biases documented and mitigated
- Actionable Recommendations - Clear improvement paths
- Quality Assurance - Systematic evaluation of both prompts and responses
Negative
- Evaluation Overhead - ~2000-3000 tokens per full evaluation
- Latency - Multi-judge coordination adds processing time
- Complexity - Sophisticated output parsing required
- Judge Selection - Appropriate judges must be chosen for domain
Neutral
- Shifts from subjective to structured evaluation
- Requires calibration for new domains
Implementation
Phase 1: Core Components (Week 1-2)
- Create
moe-judgecommand specification - Define prompt quality rubric
- Define response quality rubric
- Implement confidence calibration
- Create bias detection functions
Phase 2: Integration (Week 3-4)
- Integrate with uncertainty-orchestrator
- Connect to existing reviewer agents
- Implement consensus calculation
- Create output formatting
Phase 3: Validation (Week 5-6)
- Validate against human judgments
- Calibrate confidence thresholds
- Test bias detection accuracy
- Document edge cases
Validation Criteria
| Metric | Target | Measurement |
|---|---|---|
| Human Agreement | >70% | Match with expert human judgments |
| Confidence Calibration | <0.1 ECE | Expected Calibration Error |
| Bias Detection Rate | >80% | Known biases correctly identified |
| Consensus Accuracy | >85% | Consensus correct when high |
| Improvement Relevance | >90% | Suggestions rated as actionable |
References
Primary Research (Tier 1: 95%+ Certainty)
-
G-Eval - Microsoft Research, EMNLP 2023
- URL: https://deepeval.com/docs/metrics-llm-evals
- Contribution: Chain-of-thought evaluation methodology
-
LLM-Rubric - ACL 2024
- URL: https://arxiv.org/abs/2501.00274
- Contribution: Multi-dimensional calibrated rubrics
-
ChatEval - ICLR 2024
- URL: https://openreview.net/forum?id=FQepisCUWu
- Contribution: Multi-agent referee team pattern
-
RAGAS Framework - Explodinggradients
- URL: https://docs.ragas.io
- Contribution: Faithfulness and relevancy metrics
Secondary Research (Tier 2: 85-94% Certainty)
-
Agent-as-a-Judge - arXiv 2024
- URL: https://arxiv.org/html/2410.10934v2
- Contribution: Agentic evaluation patterns
-
DeepEval - Confident AI
- URL: https://github.com/confident-ai/deepeval
- Contribution: Comprehensive metrics implementation
-
LLM-as-Judge Guide - EvidentlyAI
- URL: https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- Contribution: Bias identification and mitigation
CODITECT Components
commands/moe-judge.md- Command specificationagents/uncertainty-orchestrator.md- Orchestration agentskills/uncertainty-quantification/SKILL.md- Confidence calibrationskills/evaluation-framework/SKILL.md- LLM-as-judge patternsdocs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md- Full research catalog
Document Version: 1.0.0 Last Updated: 2025-12-19 Author: CODITECT Research Team Status: APPROVED