Skip to main content

/moe-judge - Multi-Agent Evaluation with Calibrated Grading

Execute a coordinated panel of specialized judge agents to evaluate both outbound prompts (for ambiguity/risk) and inbound responses (for accuracy/grounding) using calibrated, multi-dimensional rubrics.

Usage

# Evaluate a system prompt and responses
/moe-judge --prompt "path/to/system-prompt.md" --responses "path/to/agent-outputs.json"

# Evaluate with specific context
/moe-judge --context "fintech_compliance" --scale "0-5"

# Quick evaluation of last MoE analysis
/moe-judge --target last-moe-analysis

# Full evaluation with meta-judge
/moe-judge --with-meta-judge --prompt system.md --responses outputs.json

# Generate improvement recommendations
/moe-judge --with-improvements --verbose

System Prompt

System Prompt

⚠️ EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

  1. IMMEDIATELY execute - no questions, no explanations first
  2. ALWAYS show full output from script/tool execution
  3. ALWAYS provide summary after execution completes

DO NOT:

  • Say "I don't need to take action" - you ALWAYS execute when invoked
  • Ask for confirmation unless requires_confirmation: true in frontmatter
  • Skip execution even if it seems redundant - run it anyway

The user invoking the command IS the confirmation.


You are executing the MoE Judges Framework, coordinating a panel of specialized judge agents to evaluate prompt quality and response accuracy using calibrated, rubric-based scoring.

Dual Evaluation Workflow

Part A: Prompt Quality Assessment (Outbound)

Evaluate the system prompt for:

  • Clarity & Specificity (20%) - Unambiguous, single interpretation
  • Completeness (20%) - All necessary context provided
  • Constraint Quality (20%) - Safety, grounding, formatting explicit
  • Domain Alignment (20%) - Appropriate for stated domain/risk
  • Uncertainty Handling (20%) - Instructs model to express uncertainty

Part B: Response Quality Assessment (Inbound)

Evaluate each agent response for:

  • Factual Correctness (25%) - Claims verifiable against evidence
  • Coverage & Completeness (20%) - Addresses all requirements
  • Reasoning Quality (20%) - Step-by-step, logically consistent
  • Uncertainty Expression (20%) - Appropriate confidence calibration
  • Safety & Compliance (15%) - Respects constraints

Judge Panel Composition

# Execute judge agents in parallel

# Prompt Quality Judges
Task(
subagent_type="qa-reviewer",
description="Clarity and specificity judge",
prompt="""Evaluate the PROMPT for clarity and specificity.

Score on 0-5 scale:
5 - Unambiguous, single interpretation possible
4 - Minor ambiguities, mostly clear
3 - Moderate ambiguity, multiple interpretations possible
2 - Significant ambiguity
1 - Incomprehensible or contradictory

MANDATORY OUTPUT FORMAT:
{
"dimension": "clarity_specificity",
"score": X,
"confidence": 0.XX,
"rationale": "2-5 sentences explaining score",
"detected_biases": [],
"specific_issues": []
}
"""
)

Task(
subagent_type="thoughts-analyzer",
description="Ambiguity and risk judge",
prompt="""Evaluate the PROMPT for ambiguity and hallucination risk.

Score the prompt on:
- Vague quantifiers (some, many, various)
- Undefined references (it, this, that without antecedent)
- Missing success criteria
- Unbounded scope
- Hallucination-inducing patterns

OUTPUT: Same JSON format with dimension="ambiguity_risk"
"""
)

# Response Quality Judges
Task(
subagent_type="codebase-analyzer",
description="Factual correctness judge",
prompt="""Evaluate RESPONSE for factual correctness.

Check:
- Are claims verifiable?
- Do code snippets work?
- Are technical details accurate?
- Are sources cited and reliable?

OUTPUT: JSON with dimension="factual_correctness"
"""
)

Task(
subagent_type="web-search-researcher",
description="Grounding verification judge",
prompt="""Verify claims against external sources.

For each major claim:
- Search for supporting evidence
- Rate evidence strength (strong/moderate/weak)
- Flag unsupported claims

OUTPUT: JSON with dimension="evidence_grounding"
"""
)

Grading Rubrics

Prompt Quality Rubric (prompt_quality_rubric_v1)

DimensionWeight5 (Excellent)3 (Adequate)1 (Failing)
Clarity & Specificity20%Unambiguous, exact requirementsGeneral requirements, some gapsVague, multiple interpretations
Completeness20%Full context, all constraintsMost context presentCritical context missing
Constraint Quality20%All safety/formatting explicitMost constraints statedConstraints unclear or missing
Domain Alignment20%Perfect domain fitReasonable fitPoor domain alignment
Uncertainty Handling20%Clear uncertainty instructionsImplicit uncertainty handlingNo uncertainty guidance

Response Quality Rubric (response_quality_rubric_v1)

DimensionWeight5 (Excellent)3 (Adequate)1 (Failing)
Factual Correctness25%100% verifiable claims70%+ verifiable<50% verifiable
Coverage20%All requirements addressedMain points coveredMajor gaps
Reasoning Quality20%Clear step-by-step logicMostly logicalInconsistent reasoning
Uncertainty Expression20%Calibrated confidenceSome uncertainty notedOverconfident assertions
Safety & Compliance15%All constraints respectedMinor violationsMajor violations

Confidence Calibration

Judges MUST provide calibrated confidence scores:

def calibrated_confidence(
self_certainty: float, # How sure the judge is
evidence_quality: float, # Quality of supporting evidence
ground_truth_available: bool # Is reference answer available?
) -> float:
"""Calculate calibrated confidence score."""
base_confidence = self_certainty * 0.5

if ground_truth_available:
return min(base_confidence + 0.4, 0.95)
elif evidence_quality > 0.7:
return min(base_confidence + 0.3, 0.85)
else:
return min(base_confidence + 0.1, 0.60)

Confidence Interpretation:

RangeMeaning
0.85-1.0High confidence - strong evidence, reference available
0.65-0.84Moderate confidence - good evidence, some uncertainty
0.40-0.64Low confidence - limited evidence, significant uncertainty
0.0-0.39Very low confidence - speculative, state clearly

Bias Detection

Judges MUST detect and report their own biases:

  • Verbosity Bias - Rewarding longer responses regardless of quality
  • Position Bias - Preferring first/last options in lists
  • Self-Enhancement Bias - Favoring outputs similar to judge's style
  • Recency Bias - Overweighting recent information
  • Confirmation Bias - Confirming pre-existing beliefs
detected_biases = []

if response_length > 1000 and score > 4:
detected_biases.append("verbosity_bias_possible")

if score_variance < 0.5 across_all_responses:
detected_biases.append("undifferentiated_scoring")

Output Format

{
"prompt_evaluation": {
"scores_by_dimension": {
"clarity_specificity": {
"score": 4.0,
"confidence": 0.82,
"rationale": "Prompt clearly defines output format and constraints...",
"detected_biases": [],
"specific_issues": ["Term 'high-quality' undefined"]
},
"ambiguity_risk": {
"score": 3.5,
"confidence": 0.75,
"rationale": "Some vague quantifiers present...",
"detected_biases": [],
"specific_issues": ["'various' sources - how many?"]
}
},
"overall_prompt_score": 3.8,
"overall_prompt_confidence": 0.78,
"key_prompt_issues": [
"Undefined success criteria",
"No explicit uncertainty handling instructions"
],
"improvement_suggestions": [
"Add specific success metrics",
"Include instruction to express confidence levels"
]
},
"response_evaluation": [
{
"agent_id": "analyst_1",
"scores_by_dimension": {
"factual_correctness": {
"score": 4.5,
"confidence": 0.88,
"rationale": "Claims verified against multiple sources...",
"detected_biases": []
},
"reasoning_quality": {
"score": 4.0,
"confidence": 0.80,
"rationale": "Clear step-by-step logic presented...",
"detected_biases": []
}
},
"overall_response_score": 4.2,
"overall_response_confidence": 0.83,
"strengths": [
"Strong evidence grounding",
"Clear reasoning chains"
],
"weaknesses": [
"Some claims lack source citations"
],
"correctness_judgement": {
"label": "mostly_correct",
"justification": "Technical content accurate, minor gaps in coverage"
}
}
],
"meta_evaluation": {
"judge_consensus": 0.85,
"score_variance": 0.3,
"potential_blind_spots": [
"No security specialist judge used"
]
}
}

Correctness Determination

Final correctness labels:

LabelScore RangeMeaning
mostly_correct4.0-5.0High accuracy, minor issues
mixed2.5-3.9Significant correct and incorrect elements
mostly_incorrect1.0-2.4Majority of content incorrect
unsafeAnySafety/compliance violations detected

Options

OptionDescription
--prompt [path]Path to system prompt to evaluate
--responses [path]Path to agent outputs JSON
--context [domain]Evaluation context (e.g., fintech, healthcare)
--scale [0-5/0-10/0-100]Scoring scale (default: 0-5)
--target [id]Evaluate specific MoE analysis output
--with-meta-judgeAdd meta-evaluation of judges themselves
--with-improvementsGenerate detailed improvement recommendations
--verboseInclude full rationales in output

Integration with /moe-analyze

# Run analysis then judge the results
/moe-analyze "Review DMS security architecture" --output analysis.json
/moe-judge --target analysis.json --with-improvements

Action Policy

<default_behavior> This command EXECUTES evaluation and PRODUCES judgements:

  • Dispatches specialized judge agents in parallel
  • Calculates calibrated scores with confidence
  • Detects and reports judge biases
  • Generates correctness determinations
  • Provides actionable improvement suggestions

User receives comprehensive quality assessment with clear confidence levels. </default_behavior>

After evaluation, verify: - All dimensions scored with confidence - Biases explicitly checked and reported - Consensus level calculated across judges - Correctness label justified - Improvement suggestions actionable

Related Commands:

  • /moe-analyze - Multi-agent research with certainty scoring
  • /ai-review - Code-focused review
  • /full-review - Comprehensive code review

Related Skills:

  • evaluation-framework - LLM-as-judge patterns
  • uncertainty-quantification - Calibration methods

Research References:

  • LLM-Rubric (ACL 2024) - Multi-dimensional calibrated evaluation
  • G-Eval - Chain-of-thought evaluation
  • VOCAL - Verbal uncertainty calibration

Success Output

When moe-judge completes:

✅ COMMAND COMPLETE: /moe-judge
Prompt Score: N/5 (N%)
Response Score: N/5 (N%)
Verdict: <label>
Consensus: N%
Improvements: N suggestions

Completion Checklist

Before marking complete:

  • Judges dispatched
  • Dimensions scored
  • Biases checked
  • Verdict determined
  • Improvements listed

Failure Indicators

This command has FAILED if:

  • ❌ No prompt or responses provided
  • ❌ Judges not responding
  • ❌ No scores calculated
  • ❌ Missing verdict

When NOT to Use

Do NOT use when:

  • Simple code review (use /ai-review)
  • Single-agent output
  • No rubric needed

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Skip prompt evalMiss issuesAlways evaluate prompts
Ignore biasesSkewed scoresCheck detected biases
Skip meta-judgeJudge errorsUse --with-meta-judge

Principles

This command embodies:

  • #9 Based on Facts - Evidence-based scoring
  • #6 Clear, Understandable - Clear rubrics
  • #3 Complete Execution - Full evaluation

Full Standard: CODITECT-STANDARD-AUTOMATION.md

Version: 1.0.0 Last Updated: 2025-12-19