ADR-013: Mixture of Experts (MoE) Judges Framework

Document: ADR-013-moe-judges-framework
Version: 1.0.0
Purpose: Document architectural decisions for multi-agent evaluation with calibrated grading
Audience: Framework contributors, developers, AI agents
Date Created: 2025-12-19
Status: APPROVED
Depends On:
  - ADR-011-uncertainty-quantification-framework
  - ADR-012-moe-analysis-framework
Related ADRs:
  - ADR-010-autonomous-orchestration-system
Related Components:
  - commands/moe-judge.md
  - agents/uncertainty-orchestrator.md
  - skills/uncertainty-quantification/SKILL.md
  - skills/evaluation-framework/SKILL.md
Research Foundation:
  - docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md

Context and Problem Statement

The Dual Evaluation Problem

LLM-based systems require evaluation of both:

Outbound (Prompt Quality):

Prompt ambiguity leading to misinterpretation
Missing context causing hallucinations
Unclear constraints resulting in off-target outputs
No explicit uncertainty handling instructions

Inbound (Response Quality):

Factual correctness of generated claims
Reasoning quality and logical consistency
Appropriate confidence calibration
Safety and constraint adherence

Research Foundation

This ADR is supported by peer-reviewed research from 2024-2025:

Research	Venue	Contribution	Certainty
G-Eval	EMNLP 2023	0.514 Spearman correlation	95%
LLM-Rubric	ACL 2024	Multi-dimensional calibrated evaluation	92%
ChatEval	ICLR 2024	Multi-agent referee teams	91%
Agent-as-a-Judge	arXiv 2024	~70% human judge alignment	88%
RAGAS	Industry	95% faithfulness agreement	90%
DeepEval	Open-source	Comprehensive metrics suite	88%

Full citations: See docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md

Decision Drivers

Calibrated Confidence - Scores must indicate reliability
Multi-Dimensional - Separate assessment per quality dimension
Bias Detection - Identify and report judge biases
Actionable Feedback - Provide improvement recommendations
Research-Backed Rubrics - Grading criteria from validated frameworks

Considered Options

Option A: Single LLM-as-Judge

One model evaluates all outputs
Rejected: Single-judge bias, no cross-validation, limited perspective

Option B: Human Evaluation Only

Manual review of all outputs
Rejected: Not scalable, expensive, inconsistent

Option C: MoE Judges Framework with Calibrated Rubrics (Selected)

Multiple specialized judge agents
Calibrated rubric-based scoring per dimension
Bias detection and reporting
Consensus scoring with confidence
Selected: Addresses all decision drivers with research backing

Option D: Automated Metrics Only (BLEU, ROUGE)

Lexical similarity metrics
Rejected: Poor correlation with human judgment, no reasoning assessment

Decision

Implement Option C: MoE Judges Framework with the following architecture:

1. Dual Evaluation Scope

Part A: Prompt Quality Assessment (Outbound)

Dimensions (20% weight each):

Dimension	Description	Score 5	Score 3	Score 1
Clarity & Specificity	Unambiguous requirements	Single interpretation	Minor ambiguities	Vague, multiple interpretations
Completeness	All context provided	Full context	Most context	Critical context missing
Constraint Quality	Safety/formatting explicit	All constraints stated	Most constraints	Constraints unclear
Domain Alignment	Appropriate for domain	Perfect fit	Reasonable fit	Poor alignment
Uncertainty Handling	Instructions for uncertainty	Clear instructions	Implicit handling	No guidance

Research Basis: G-Eval (EMNLP 2023) achieves 0.514 Spearman correlation using structured dimension scoring.

Part B: Response Quality Assessment (Inbound)

Dimensions (weighted):

Dimension	Weight	Score 5	Score 3	Score 1
Factual Correctness	25%	100% verifiable	70%+ verifiable	<50% verifiable
Coverage & Completeness	20%	All requirements	Main points	Major gaps
Reasoning Quality	20%	Clear step-by-step	Mostly logical	Inconsistent
Uncertainty Expression	20%	Calibrated confidence	Some uncertainty	Overconfident
Safety & Compliance	15%	All constraints	Minor violations	Major violations

Research Basis: LLM-Rubric (ACL 2024) demonstrates multi-dimensional calibrated evaluation.

2. Judge Panel Composition

Recommended Panel (4-6 judges):

judges = [
    {"type": "qa-reviewer", "role": "Clarity Judge",
     "focus": "Prompt clarity and specificity"},
    {"type": "thoughts-analyzer", "role": "Ambiguity Judge",
     "focus": "Hallucination risk and vague quantifiers"},
    {"type": "codebase-analyzer", "role": "Factuality Judge",
     "focus": "Claim verification and technical accuracy"},
    {"type": "web-search-researcher", "role": "Grounding Judge",
     "focus": "Evidence verification against sources"},
    {"type": "code-reviewer", "role": "Reasoning Judge",
     "focus": "Logic consistency and completeness"},
    {"type": "security-specialist", "role": "Safety Judge",
     "focus": "Constraint adherence and safety"}
]

Research Basis: ChatEval (ICLR 2024) demonstrates multi-agent referee teams achieve superior accuracy vs. single-agent evaluation.

3. Confidence Calibration

Calibrated Confidence Formula:

def calibrated_confidence(
    self_certainty: float,      # Judge's internal certainty (0-1)
    evidence_quality: float,    # Quality of supporting evidence (0-1)
    ground_truth_available: bool # Is reference answer available?
) -> float:
    base_confidence = self_certainty * 0.5

    if ground_truth_available:
        return min(base_confidence + 0.4, 0.95)
    elif evidence_quality > 0.7:
        return min(base_confidence + 0.3, 0.85)
    else:
        return min(base_confidence + 0.1, 0.60)

Confidence Interpretation:

Range	Meaning	Action
0.85-1.0	High - strong evidence, reference available	Trust judgment
0.65-0.84	Moderate - good evidence, some uncertainty	Note limitations
0.40-0.64	Low - limited evidence	Seek additional input
0.0-0.39	Very low - speculative	Flag for human review

Research Basis: VOCAL research (2024) demonstrates verbal uncertainty calibration improves hallucination detection by ~30%.

4. Bias Detection Protocol

Required Bias Checks:

Bias Type	Detection Method	Mitigation
Verbosity Bias	Score vs. response length correlation	Normalize for length
Position Bias	Score variance across response order	Swap-order evaluation
Self-Enhancement	Compare judge style to output style	Cross-model judging
Recency Bias	Score vs. information date	Weight recency appropriately
Confirmation Bias	Compare to contrary evidence	Devil's advocate judge

Implementation:

detected_biases = []

# Verbosity bias detection
if response_length > 1000 and score > 4:
    detected_biases.append({
        "type": "verbosity_bias_possible",
        "indicator": "High score with long response",
        "mitigation": "Verify substance matches length"
    })

# Score variance detection
if score_variance_across_judges < 0.5:
    detected_biases.append({
        "type": "undifferentiated_scoring",
        "indicator": "Low variance across judges",
        "mitigation": "Review for groupthink"
    })

Research Basis: LLM-as-Judge research (2024) identifies 10-15% accuracy loss from unmitigated biases.

5. Correctness Determination

Final Verdict Labels:

Label	Score Range	Meaning
mostly_correct	4.0-5.0	High accuracy, minor issues
mixed	2.5-3.9	Significant correct and incorrect elements
mostly_incorrect	1.0-2.4	Majority of content incorrect
unsafe	Any	Safety/compliance violations detected

Consensus Calculation:

def calculate_consensus(judge_scores: List[float]) -> dict:
    mean_score = statistics.mean(judge_scores)
    std_dev = statistics.stdev(judge_scores)

    consensus = 1 - (std_dev / 2.5)  # Normalize to 0-1

    return {
        "aggregate_score": mean_score,
        "consensus_level": consensus,
        "variance": std_dev,
        "requires_investigation": std_dev > 1.5
    }

Research Basis: Agent-as-a-Judge (arXiv 2024) achieves ~70% alignment with human judge consensus, higher than individual humans.

Architecture

Workflow Phases

┌─────────────────────────────────────────────────────────────┐
│                  Phase 1: Judge Dispatch                     │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │ Clarity  │ │Ambiguity │ │Factuality│ │ Grounding│       │
│  │  Judge   │ │  Judge   │ │  Judge   │ │  Judge   │       │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘       │
│       │            │            │            │              │
│       └────────────┴─────┬──────┴────────────┘              │
│                          ▼                                   │
├─────────────────────────────────────────────────────────────┤
│                  Phase 2: Score Collection                   │
│  Each judge provides:                                        │
│  - Dimension scores (0-5 scale)                             │
│  - Confidence levels (0-1)                                  │
│  - Rationale (2-5 sentences)                                │
│  - Detected biases                                          │
│  - Specific issues found                                    │
│                          │                                   │
│                          ▼                                   │
├─────────────────────────────────────────────────────────────┤
│                  Phase 3: Aggregation                        │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ - Calculate weighted aggregate scores                │    │
│  │ - Measure consensus/variance                         │    │
│  │ - Aggregate detected biases                          │    │
│  │ - Identify disagreements for investigation           │    │
│  └─────────────────────────────────────────────────────┘    │
│                          │                                   │
│                          ▼                                   │
├─────────────────────────────────────────────────────────────┤
│                Phase 4: Meta-Evaluation                      │
│  (Optional: --with-meta-judge)                               │
│  - Evaluate judge panel performance                          │
│  - Detect systematic biases                                  │
│  - Validate scoring consistency                              │
│                          │                                   │
│                          ▼                                   │
├─────────────────────────────────────────────────────────────┤
│                  Phase 5: Report Generation                  │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ - Correctness verdict with justification             │    │
│  │ - Dimension-by-dimension breakdown                   │    │
│  │ - Improvement recommendations                        │    │
│  │ - Bias report                                        │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Output Schema

{
  "evaluation_id": "judge-[timestamp]",
  "prompt_evaluation": {
    "scores_by_dimension": {
      "clarity_specificity": {
        "score": 4.0,
        "confidence": 0.82,
        "rationale": "Prompt clearly defines output format...",
        "detected_biases": [],
        "specific_issues": ["Term 'high-quality' undefined"]
      },
      "completeness": {...},
      "constraint_quality": {...},
      "domain_alignment": {...},
      "uncertainty_handling": {...}
    },
    "overall_prompt_score": 3.8,
    "overall_prompt_confidence": 0.78,
    "key_prompt_issues": [...],
    "improvement_suggestions": [...]
  },
  "response_evaluation": [
    {
      "agent_id": "analyst_1",
      "scores_by_dimension": {
        "factual_correctness": {
          "score": 4.5,
          "confidence": 0.88,
          "rationale": "Claims verified against sources...",
          "detected_biases": []
        },
        "coverage": {...},
        "reasoning_quality": {...},
        "uncertainty_expression": {...},
        "safety_compliance": {...}
      },
      "overall_response_score": 4.2,
      "overall_response_confidence": 0.83,
      "strengths": [...],
      "weaknesses": [...],
      "correctness_judgement": {
        "label": "mostly_correct",
        "justification": "Technical content accurate, minor coverage gaps"
      }
    }
  ],
  "meta_evaluation": {
    "judge_consensus": 0.85,
    "score_variance": 0.3,
    "potential_blind_spots": [...],
    "recommended_additional_judges": [...]
  }
}

Consequences

Positive

Calibrated Trust - Confidence scores indicate judgment reliability
Multi-Dimensional Insight - Specific areas for improvement identified
Bias Transparency - Known biases documented and mitigated
Actionable Recommendations - Clear improvement paths
Quality Assurance - Systematic evaluation of both prompts and responses

Negative

Evaluation Overhead - ~2000-3000 tokens per full evaluation
Latency - Multi-judge coordination adds processing time
Complexity - Sophisticated output parsing required
Judge Selection - Appropriate judges must be chosen for domain

Neutral

Shifts from subjective to structured evaluation
Requires calibration for new domains

Implementation

Phase 1: Core Components (Week 1-2)

Create moe-judge command specification
Define prompt quality rubric
Define response quality rubric
Implement confidence calibration
Create bias detection functions

Phase 2: Integration (Week 3-4)

Integrate with uncertainty-orchestrator
Connect to existing reviewer agents
Implement consensus calculation
Create output formatting

Phase 3: Validation (Week 5-6)

Validate against human judgments
Calibrate confidence thresholds
Test bias detection accuracy
Document edge cases

Validation Criteria

Metric	Target	Measurement
Human Agreement	>70%	Match with expert human judgments
Confidence Calibration	<0.1 ECE	Expected Calibration Error
Bias Detection Rate	>80%	Known biases correctly identified
Consensus Accuracy	>85%	Consensus correct when high
Improvement Relevance	>90%	Suggestions rated as actionable

References

Primary Research (Tier 1: 95%+ Certainty)

G-Eval - Microsoft Research, EMNLP 2023
- URL: https://deepeval.com/docs/metrics-llm-evals
- Contribution: Chain-of-thought evaluation methodology
LLM-Rubric - ACL 2024
- URL: https://arxiv.org/abs/2501.00274
- Contribution: Multi-dimensional calibrated rubrics
ChatEval - ICLR 2024
- URL: https://openreview.net/forum?id=FQepisCUWu
- Contribution: Multi-agent referee team pattern
RAGAS Framework - Explodinggradients
- URL: https://docs.ragas.io
- Contribution: Faithfulness and relevancy metrics

Secondary Research (Tier 2: 85-94% Certainty)

Agent-as-a-Judge - arXiv 2024
- URL: https://arxiv.org/html/2410.10934v2
- Contribution: Agentic evaluation patterns
DeepEval - Confident AI
- URL: https://github.com/confident-ai/deepeval
- Contribution: Comprehensive metrics implementation
LLM-as-Judge Guide - EvidentlyAI
- URL: https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- Contribution: Bias identification and mitigation

CODITECT Components

commands/moe-judge.md - Command specification
agents/uncertainty-orchestrator.md - Orchestration agent
skills/uncertainty-quantification/SKILL.md - Confidence calibration
skills/evaluation-framework/SKILL.md - LLM-as-judge patterns
docs/09-research-analysis/ACADEMIC-RESEARCH-REFERENCES-UQ-MOE-2024-2025.md - Full research catalog

Document Version: 1.0.0 Last Updated: 2025-12-19 Author: CODITECT Research Team Status: APPROVED

Context and Problem Statement​

The Dual Evaluation Problem​

Research Foundation​

Decision Drivers​

Considered Options​

Option A: Single LLM-as-Judge​

Option B: Human Evaluation Only​

Option C: MoE Judges Framework with Calibrated Rubrics (Selected)​

Option D: Automated Metrics Only (BLEU, ROUGE)​

Decision​

1. Dual Evaluation Scope​

Part A: Prompt Quality Assessment (Outbound)​

Part B: Response Quality Assessment (Inbound)​

2. Judge Panel Composition​

3. Confidence Calibration​

4. Bias Detection Protocol​

5. Correctness Determination​

Architecture​

Workflow Phases​

Output Schema​

Consequences​

Positive​

Negative​

Neutral​

Implementation​

Phase 1: Core Components (Week 1-2)​

Phase 2: Integration (Week 3-4)​

Phase 3: Validation (Week 5-6)​

Validation Criteria​

References​

Primary Research (Tier 1: 95%+ Certainty)​

Secondary Research (Tier 2: 85-94% Certainty)​

CODITECT Components​