Skip to main content

/moe-judges - Mixture of Expert Judges

Assembles a panel of specialized judges to evaluate, critique, and score output from multiple expert perspectives.

System Prompt

⚠️ EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

  1. IMMEDIATELY execute - no questions, no explanations first
  2. ALWAYS show full output from script/tool execution
  3. ALWAYS provide summary after execution completes

DO NOT:

  • Say "I don't need to take action" - you ALWAYS execute when invoked
  • Ask for confirmation unless requires_confirmation: true in frontmatter
  • Skip execution even if it seems redundant - run it anyway

The user invoking the command IS the confirmation.


Usage

/moe-judges <what-to-evaluate>

# Project-scoped evaluation (ADR-159)
/moe-judges <what-to-evaluate> --project PILOT

ADR-159 Project Scoping: When --project is provided (or auto-detected from $CODITECT_PROJECT), rubrics are scoped to the project's track configuration and evaluation results are attributed to the project.

How It Works

When you run /moe-judges, I will:

  1. Analyze the Deliverable - Understand what's being evaluated
  2. Select Judge Panel - Choose 3-5 judges with relevant expertise
  3. Define Evaluation Criteria - Establish scoring dimensions
  4. Execute Reviews - Each judge evaluates independently (multi-model)
  5. Synthesize Verdicts - Combine scores and feedback
  6. Provide Recommendations - Actionable improvements

Multi-Model Judge Panel (H.3.5)

The judge panel supports multiple LLM providers for diverse, independent evaluations:

Supported Providers

ProviderModelsUse Case
Anthropicclaude-opus-4.5, claude-sonnet-4, claude-haiku-4.5Deep reasoning, nuanced evaluation
OpenAIgpt-4o, gpt-4o-mini, gpt-4-turboGeneral-purpose analysis
DeepSeekdeepseek-v3, deepseek-chatCost-effective analysis
Alibabaqwen2.5-72b, qwen-maxMultilingual evaluation
Metallama-3.3-70b, llama-3.1-405bOpen-source perspective
Googlegemini-2.0-flash, gemini-1.5-proMultimodal capability

Model Diversity Requirements

To ensure independent perspectives, the judge panel must meet:

RequirementThresholdPurpose
Min Model Families3+Prevent single-vendor bias
Max Single Model Weight40%Balanced influence
Min Judges3Consensus reliability

Automatic Fallback

If a judge's primary model fails, the system automatically:

  1. Retries with exponential backoff (up to 2 retries)
  2. Falls back to backup model if configured
  3. Records all attempts in provenance chain

Provenance Tracking

Each judge decision includes:

  • model_used: Which LLM rendered the decision
  • token_usage: Input + output tokens consumed
  • latency_ms: Response time
  • timestamp: When evaluation was recorded
  • dimension_scores: Per-dimension scoring

Configuration

Model routing is configured in config/judge-model-routing.json:

{
"routing": {
"technical_architect": {
"primary_model": "claude-sonnet-4",
"backup_model": "gpt-4o"
},
"security_auditor": {
"primary_model": "claude-opus-4.5",
"backup_model": "deepseek-v3"
}
},
"diversity_requirements": {
"min_model_families": 3,
"max_single_model_weight": 0.40
}
}

Two-Stage Review Mode (ADR-076)

The two-stage review pattern separates spec compliance from code quality, preventing "well-written wrong code" - implementations that are technically excellent but don't meet requirements.

Usage

/moe-judges --two-stage <implementation-file> --spec <spec-file>

The Two Stages

┌─────────────────────────────────────────────────────────────────┐
│ IMPLEMENTATION │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ STAGE 1: SPEC COMPLIANCE (Blocking) ││
│ │ • Does it match requirements? ││
│ │ • Any missing features? ││
│ │ • Any scope creep? ││
│ │ ││
│ │ Judges: domain_expert, compliance_auditor, security_analyst ││
│ │ Verdict: PASS or FAIL ││
│ │ ││
│ │ If FAIL → Return to implementer (Stage 2 BLOCKED) ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ (Only if PASS) │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ STAGE 2: CODE QUALITY (Non-blocking) ││
│ │ • Is it well-architected? ││
│ │ • Adequate test coverage? ││
│ │ • Performance acceptable? ││
│ │ ││
│ │ Judges: technical_architect, qa_evaluator, ai_ethics_reviewer│
│ │ Verdict: APPROVE or REQUEST_CHANGES ││
│ │ ││
│ │ If REQUEST_CHANGES → Return with feedback ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ APPROVED (ready to merge) │
└─────────────────────────────────────────────────────────────────┘

Stage 1: Spec Compliance

Focus: "Is it the RIGHT thing?"

JudgeModelReviews
domain_expertclaude-sonnet-4-5Requirement coverage, business logic, edge cases
compliance_auditorgpt-4oRegulatory compliance, standards adherence
security_analystdeepseek-v3Security requirements, auth/authz, data protection

Verdicts:

  • PASS: All requirements satisfied → Proceed to Stage 2
  • FAIL: Requirements not met → Return to implementer (Stage 2 blocked)

Stage 2: Code Quality

Focus: "Is it BUILT well?"

Prerequisite: Stage 1 must PASS

JudgeModelReviews
technical_architectclaude-sonnet-4-5Architecture patterns, performance, maintainability
qa_evaluatorclaude-haiku-4-5Test coverage (>80%), test quality, integration tests
ai_ethics_reviewergpt-4oAI safety, bias potential, transparency, human oversight

Verdicts:

  • APPROVE: Quality standards met → Ready to merge
  • REQUEST_CHANGES: Improvements needed → Return with feedback

Final Verdicts

Final VerdictMeaning
APPROVEDStage 1 PASS + Stage 2 APPROVE
SPEC_VIOLATIONStage 1 FAIL (Stage 2 blocked)
QUALITY_IMPROVEMENTS_NEEDEDStage 1 PASS + Stage 2 REQUEST_CHANGES

Majority Voting

Both stages use 2/3 majority voting (threshold: 0.66):

  • 3 judges × 0.66 = 1.98 → requires 2+ agreeing judges to pass

Example: Two-Stage Review

/moe-judges --two-stage src/auth/login.py --spec specs/AUTH-001.md

Stage 1 Output:

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: SPEC COMPLIANCE │
├─────────────────────────────────────────────────────────────────┤
│ Verdict: PASS (3/3 judges) │
├─────────────────────────────────────────────────────────────────┤
│ domain_expert: PASS - All requirements covered │
│ compliance_auditor: PASS - HIPAA requirements met │
│ security_analyst: PASS - Auth implementation secure │
├─────────────────────────────────────────────────────────────────┤
│ → Proceeding to Stage 2: Code Quality │
└─────────────────────────────────────────────────────────────────┘

Stage 2 Output:

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: CODE QUALITY │
├─────────────────────────────────────────────────────────────────┤
│ Verdict: REQUEST_CHANGES (1/3 approve) │
├─────────────────────────────────────────────────────────────────┤
│ technical_architect: APPROVE - Clean architecture │
│ qa_evaluator: REQUEST_CHANGES - Test coverage 68% │
│ ai_ethics_reviewer: REQUEST_CHANGES - Add rate limiting logs │
├─────────────────────────────────────────────────────────────────┤
│ REQUIRED ACTIONS: │
│ 1. Increase test coverage to >80% │
│ 2. Add rate limiting audit logs │
└─────────────────────────────────────────────────────────────────┘

FINAL VERDICT: QUALITY_IMPROVEMENTS_NEEDED

Programmatic Usage

from scripts.core.two_stage_review import TwoStageReviewer

reviewer = TwoStageReviewer(orchestrator)
result = await reviewer.review(
implementation="...",
spec="...",
plan_task_id="A.1.1"
)

if result.final_verdict == "SPEC_VIOLATION":
# Address spec issues first
print(result.stage1.issues)
elif result.final_verdict == "QUALITY_IMPROVEMENTS_NEEDED":
# Address quality feedback
print(result.stage2.issues)
else:
# APPROVED - ready to merge
pass

Configuration

Judge assignments for two-stage review are configured in config/review-judge-assignment.json.


Judge Roles

Quality Assurance Judges

Judge RoleAgentEvaluates
QA Leadcodi-qa-specialistOverall quality, test coverage
Code Qualitycode-reviewerCode standards, best practices
Component QAcomponent-qa-reviewerComponent compliance
Comprehensivecomprehensive-reviewFull-spectrum review

Architecture Judges

Judge RoleAgentEvaluates
Architectarchitect-reviewDesign patterns, scalability
Senior Reviewsenior-architectStrategic decisions, trade-offs
ADR Complianceadr-compliance-specialistArchitecture decision compliance

Security Judges

Judge RoleAgentEvaluates
Security Auditorsecurity-auditorVulnerabilities, risks
Security Specialistsecurity-specialistSecurity best practices
Compliancecompliance-checker-agentRegulatory compliance

Documentation Judges

Judge RoleAgentEvaluates
Doc Qualitydocumentation-quality-agentClarity, completeness
QA Reviewerqa-reviewerStandards compliance

Business Judges

Judge RoleAgentEvaluates
Business Analystbusiness-intelligence-analystBusiness viability
Market Analystcompetitive-market-analystMarket fit
VC Perspectiveventure-capital-business-analystInvestment worthiness

Examples

Example 1: Judge API Design

/moe-judges evaluate the API design for production readiness

Judge Panel Assembled:

JudgePerspectiveWeight
architect-reviewArchitecture quality25%
backend-api-securitySecurity posture25%
code-reviewerCode quality20%
codi-qa-specialistTest coverage15%
documentation-quality-agentAPI docs quality15%

Evaluation Criteria:

  • RESTful design compliance (1-10)
  • Security (authentication, authorization, input validation) (1-10)
  • Error handling and resilience (1-10)
  • Documentation completeness (1-10)
  • Test coverage (1-10)

Verdict Format:

┌─────────────────────────────────────────────────────┐
│ API DESIGN EVALUATION │
├─────────────────────────────────────────────────────┤
│ Overall Score: 7.8/10 │
│ Verdict: APPROVED WITH RECOMMENDATIONS │
├─────────────────────────────────────────────────────┤
│ Architecture: 8/10 - Clean RESTful design │
│ Security: 7/10 - Add rate limiting │
│ Code Quality: 8/10 - Well-structured │
│ Test Coverage: 7/10 - Add integration tests │
│ Documentation: 9/10 - Comprehensive │
├─────────────────────────────────────────────────────┤
│ REQUIRED ACTIONS: │
│ 1. Implement rate limiting on all endpoints │
│ 2. Add integration tests for auth flow │
│ 3. Add request validation middleware │
└─────────────────────────────────────────────────────┘

Example 2: Judge Business Plan

/moe-judges evaluate business plan for investor readiness

Judge Panel Assembled:

JudgePerspectiveWeight
venture-capital-business-analystInvestment lens30%
business-intelligence-analystFinancial rigor25%
competitive-market-analystMarket positioning25%
market-researcherMarket validation20%

Evaluation Criteria:

  • Market opportunity (TAM/SAM/SOM clarity) (1-10)
  • Competitive differentiation (1-10)
  • Financial projections (realistic, detailed) (1-10)
  • Go-to-market strategy (1-10)
  • Team and execution capability (1-10)

Example 3: Judge Security Implementation

/moe-judges evaluate authentication system for production

Judge Panel Assembled:

JudgePerspectiveWeight
security-specialistOverall security30%
penetration-testing-agentVulnerability assessment25%
backend-api-securityAPI security25%
compliance-checker-agentCompliance20%

Example 4: Judge Documentation

/moe-judges evaluate developer documentation for completeness

Judge Panel Assembled:

JudgePerspectiveWeight
documentation-quality-agentQuality standards35%
qa-reviewerStandards compliance25%
codi-documentation-writerBest practices25%
documentation-librarianOrganization15%

Verdict Categories

ScoreVerdictMeaning
9-10EXCELLENTProduction ready, exceeds standards
7-8APPROVEDReady with minor improvements
5-6CONDITIONALNeeds improvements before approval
3-4REVISION REQUIREDSignificant issues to address
1-2REJECTEDDoes not meet minimum standards

Execution

# Assemble judges and evaluate
/agent orchestrator "Coordinate judge panel evaluation:
1. architect-review evaluates design patterns
2. security-specialist evaluates security posture
3. code-reviewer evaluates code quality
4. codi-qa-specialist evaluates test coverage

Each judge provides:
- Score (1-10)
- Strengths identified
- Issues found
- Required actions

Synthesize into unified verdict with weighted scores."

Combining with /moe-agents

Use experts to create, judges to evaluate:

# Create with experts
/moe-agents build authentication system

# Evaluate with judges
/moe-judges evaluate authentication system for production

# Or use combined workflow
/moe-workflow build and evaluate authentication system
  • /moe-agents <task> - Assemble expert team
  • /moe-workflow <task> - Combined expert + judge workflow
  • /which <task> - Quick agent recommendation

Action Policy

<default_behavior> This command assembles and executes judges. Provides:

  • Judge panel composition
  • Evaluation criteria
  • Weighted scores
  • Synthesized verdict

User receives actionable feedback. </default_behavior>

After evaluation, verify: - All judges scored - Weights applied - Verdict determined - Actions listed

Success Output

When moe-judges completes:

✅ COMMAND COMPLETE: /moe-judges
Target: <what-was-evaluated>
Judges: N panelists
Score: N/10
Verdict: <verdict-label>
Actions: N required

Completion Checklist

Before marking complete:

  • Deliverable analyzed
  • Judges selected
  • Criteria defined
  • Reviews executed
  • Verdict synthesized

Failure Indicators

This command has FAILED if:

  • ❌ Nothing to evaluate
  • ❌ No judges selected
  • ❌ No scores returned
  • ❌ Missing verdict

When NOT to Use

Do NOT use when:

  • Creating deliverable (use /moe-agents)
  • Single review sufficient
  • No criteria needed

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Wrong judgesPoor evaluationMatch judges to domain
Equal weightsMissing prioritiesWeight by importance
Skip actionsNo improvementAlways list actions

Principles

This command embodies:

  • #9 Based on Facts - Evidence-based scoring
  • #6 Clear, Understandable - Clear verdicts
  • #3 Complete Execution - Full evaluation

Full Standard: CODITECT-STANDARD-AUTOMATION.md


Version: 1.0.0 Created: 2025-12-22 Author: CODITECT Team