/moe-judges - Mixture of Expert Judges
Assembles a panel of specialized judges to evaluate, critique, and score output from multiple expert perspectives.
System Prompt
⚠️ EXECUTION DIRECTIVE: When the user invokes this command, you MUST:
- IMMEDIATELY execute - no questions, no explanations first
- ALWAYS show full output from script/tool execution
- ALWAYS provide summary after execution completes
DO NOT:
- Say "I don't need to take action" - you ALWAYS execute when invoked
- Ask for confirmation unless
requires_confirmation: truein frontmatter - Skip execution even if it seems redundant - run it anyway
The user invoking the command IS the confirmation.
Usage
/moe-judges <what-to-evaluate>
# Project-scoped evaluation (ADR-159)
/moe-judges <what-to-evaluate> --project PILOT
ADR-159 Project Scoping: When --project is provided (or auto-detected from $CODITECT_PROJECT), rubrics are scoped to the project's track configuration and evaluation results are attributed to the project.
How It Works
When you run /moe-judges, I will:
- Analyze the Deliverable - Understand what's being evaluated
- Select Judge Panel - Choose 3-5 judges with relevant expertise
- Define Evaluation Criteria - Establish scoring dimensions
- Execute Reviews - Each judge evaluates independently (multi-model)
- Synthesize Verdicts - Combine scores and feedback
- Provide Recommendations - Actionable improvements
Multi-Model Judge Panel (H.3.5)
The judge panel supports multiple LLM providers for diverse, independent evaluations:
Supported Providers
| Provider | Models | Use Case |
|---|---|---|
| Anthropic | claude-opus-4.5, claude-sonnet-4, claude-haiku-4.5 | Deep reasoning, nuanced evaluation |
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo | General-purpose analysis |
| DeepSeek | deepseek-v3, deepseek-chat | Cost-effective analysis |
| Alibaba | qwen2.5-72b, qwen-max | Multilingual evaluation |
| Meta | llama-3.3-70b, llama-3.1-405b | Open-source perspective |
| gemini-2.0-flash, gemini-1.5-pro | Multimodal capability |
Model Diversity Requirements
To ensure independent perspectives, the judge panel must meet:
| Requirement | Threshold | Purpose |
|---|---|---|
| Min Model Families | 3+ | Prevent single-vendor bias |
| Max Single Model Weight | 40% | Balanced influence |
| Min Judges | 3 | Consensus reliability |
Automatic Fallback
If a judge's primary model fails, the system automatically:
- Retries with exponential backoff (up to 2 retries)
- Falls back to backup model if configured
- Records all attempts in provenance chain
Provenance Tracking
Each judge decision includes:
model_used: Which LLM rendered the decisiontoken_usage: Input + output tokens consumedlatency_ms: Response timetimestamp: When evaluation was recordeddimension_scores: Per-dimension scoring
Configuration
Model routing is configured in config/judge-model-routing.json:
{
"routing": {
"technical_architect": {
"primary_model": "claude-sonnet-4",
"backup_model": "gpt-4o"
},
"security_auditor": {
"primary_model": "claude-opus-4.5",
"backup_model": "deepseek-v3"
}
},
"diversity_requirements": {
"min_model_families": 3,
"max_single_model_weight": 0.40
}
}
Two-Stage Review Mode (ADR-076)
The two-stage review pattern separates spec compliance from code quality, preventing "well-written wrong code" - implementations that are technically excellent but don't meet requirements.
Usage
/moe-judges --two-stage <implementation-file> --spec <spec-file>
The Two Stages
┌─────────────────────────────────────────────────────────────────┐
│ IMPLEMENTATION │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ STAGE 1: SPEC COMPLIANCE (Blocking) ││
│ │ • Does it match requirements? ││
│ │ • Any missing features? ││
│ │ • Any scope creep? ││
│ │ ││
│ │ Judges: domain_expert, compliance_auditor, security_analyst ││
│ │ Verdict: PASS or FAIL ││
│ │ ││
│ │ If FAIL → Return to implementer (Stage 2 BLOCKED) ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ (Only if PASS) │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ STAGE 2: CODE QUALITY (Non-blocking) ││
│ │ • Is it well-architected? ││
│ │ • Adequate test coverage? ││
│ │ • Performance acceptable? ││
│ │ ││
│ │ Judges: technical_architect, qa_evaluator, ai_ethics_reviewer│
│ │ Verdict: APPROVE or REQUEST_CHANGES ││
│ │ ││
│ │ If REQUEST_CHANGES → Return with feedback ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ APPROVED (ready to merge) │
└─────────────────────────────────────────────────────────────────┘
Stage 1: Spec Compliance
Focus: "Is it the RIGHT thing?"
| Judge | Model | Reviews |
|---|---|---|
domain_expert | claude-sonnet-4-5 | Requirement coverage, business logic, edge cases |
compliance_auditor | gpt-4o | Regulatory compliance, standards adherence |
security_analyst | deepseek-v3 | Security requirements, auth/authz, data protection |
Verdicts:
- PASS: All requirements satisfied → Proceed to Stage 2
- FAIL: Requirements not met → Return to implementer (Stage 2 blocked)
Stage 2: Code Quality
Focus: "Is it BUILT well?"
Prerequisite: Stage 1 must PASS
| Judge | Model | Reviews |
|---|---|---|
technical_architect | claude-sonnet-4-5 | Architecture patterns, performance, maintainability |
qa_evaluator | claude-haiku-4-5 | Test coverage (>80%), test quality, integration tests |
ai_ethics_reviewer | gpt-4o | AI safety, bias potential, transparency, human oversight |
Verdicts:
- APPROVE: Quality standards met → Ready to merge
- REQUEST_CHANGES: Improvements needed → Return with feedback
Final Verdicts
| Final Verdict | Meaning |
|---|---|
| APPROVED | Stage 1 PASS + Stage 2 APPROVE |
| SPEC_VIOLATION | Stage 1 FAIL (Stage 2 blocked) |
| QUALITY_IMPROVEMENTS_NEEDED | Stage 1 PASS + Stage 2 REQUEST_CHANGES |
Majority Voting
Both stages use 2/3 majority voting (threshold: 0.66):
- 3 judges × 0.66 = 1.98 → requires 2+ agreeing judges to pass
Example: Two-Stage Review
/moe-judges --two-stage src/auth/login.py --spec specs/AUTH-001.md
Stage 1 Output:
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: SPEC COMPLIANCE │
├─────────────────────────────────────────────────────────────────┤
│ Verdict: PASS (3/3 judges) │
├─────────────────────────────────────────────────────────────────┤
│ domain_expert: PASS - All requirements covered │
│ compliance_auditor: PASS - HIPAA requirements met │
│ security_analyst: PASS - Auth implementation secure │
├─────────────────────────────────────────────────────────────────┤
│ → Proceeding to Stage 2: Code Quality │
└─────────────────────────────────────────────────────────────────┘
Stage 2 Output:
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: CODE QUALITY │
├─────────────────────────────────────────────────────────────────┤
│ Verdict: REQUEST_CHANGES (1/3 approve) │
├─────────────────────────────────────────────────────────────────┤
│ technical_architect: APPROVE - Clean architecture │
│ qa_evaluator: REQUEST_CHANGES - Test coverage 68% │
│ ai_ethics_reviewer: REQUEST_CHANGES - Add rate limiting logs │
├─────────────────────────────────────────────────────────────────┤
│ REQUIRED ACTIONS: │
│ 1. Increase test coverage to >80% │
│ 2. Add rate limiting audit logs │
└─────────────────────────────────────────────────────────────────┘
FINAL VERDICT: QUALITY_IMPROVEMENTS_NEEDED
Programmatic Usage
from scripts.core.two_stage_review import TwoStageReviewer
reviewer = TwoStageReviewer(orchestrator)
result = await reviewer.review(
implementation="...",
spec="...",
plan_task_id="A.1.1"
)
if result.final_verdict == "SPEC_VIOLATION":
# Address spec issues first
print(result.stage1.issues)
elif result.final_verdict == "QUALITY_IMPROVEMENTS_NEEDED":
# Address quality feedback
print(result.stage2.issues)
else:
# APPROVED - ready to merge
pass
Configuration
Judge assignments for two-stage review are configured in config/review-judge-assignment.json.
Related
- ADR: ADR-076
- Skill: two-stage-review
- Source: Superpowers subagent-driven-development pattern
Judge Roles
Quality Assurance Judges
| Judge Role | Agent | Evaluates |
|---|---|---|
| QA Lead | codi-qa-specialist | Overall quality, test coverage |
| Code Quality | code-reviewer | Code standards, best practices |
| Component QA | component-qa-reviewer | Component compliance |
| Comprehensive | comprehensive-review | Full-spectrum review |
Architecture Judges
| Judge Role | Agent | Evaluates |
|---|---|---|
| Architect | architect-review | Design patterns, scalability |
| Senior Review | senior-architect | Strategic decisions, trade-offs |
| ADR Compliance | adr-compliance-specialist | Architecture decision compliance |
Security Judges
| Judge Role | Agent | Evaluates |
|---|---|---|
| Security Auditor | security-auditor | Vulnerabilities, risks |
| Security Specialist | security-specialist | Security best practices |
| Compliance | compliance-checker-agent | Regulatory compliance |
Documentation Judges
| Judge Role | Agent | Evaluates |
|---|---|---|
| Doc Quality | documentation-quality-agent | Clarity, completeness |
| QA Reviewer | qa-reviewer | Standards compliance |
Business Judges
| Judge Role | Agent | Evaluates |
|---|---|---|
| Business Analyst | business-intelligence-analyst | Business viability |
| Market Analyst | competitive-market-analyst | Market fit |
| VC Perspective | venture-capital-business-analyst | Investment worthiness |
Examples
Example 1: Judge API Design
/moe-judges evaluate the API design for production readiness
Judge Panel Assembled:
| Judge | Perspective | Weight |
|---|---|---|
architect-review | Architecture quality | 25% |
backend-api-security | Security posture | 25% |
code-reviewer | Code quality | 20% |
codi-qa-specialist | Test coverage | 15% |
documentation-quality-agent | API docs quality | 15% |
Evaluation Criteria:
- RESTful design compliance (1-10)
- Security (authentication, authorization, input validation) (1-10)
- Error handling and resilience (1-10)
- Documentation completeness (1-10)
- Test coverage (1-10)
Verdict Format:
┌─────────────────────────────────────────────────────┐
│ API DESIGN EVALUATION │
├─────────────────────────────────────────────────────┤
│ Overall Score: 7.8/10 │
│ Verdict: APPROVED WITH RECOMMENDATIONS │
├─────────────────────────────────────────────────────┤
│ Architecture: 8/10 - Clean RESTful design │
│ Security: 7/10 - Add rate limiting │
│ Code Quality: 8/10 - Well-structured │
│ Test Coverage: 7/10 - Add integration tests │
│ Documentation: 9/10 - Comprehensive │
├─────────────────────────────────────────────────────┤
│ REQUIRED ACTIONS: │
│ 1. Implement rate limiting on all endpoints │
│ 2. Add integration tests for auth flow │
│ 3. Add request validation middleware │
└─────────────────────────────────────────────────────┘
Example 2: Judge Business Plan
/moe-judges evaluate business plan for investor readiness
Judge Panel Assembled:
| Judge | Perspective | Weight |
|---|---|---|
venture-capital-business-analyst | Investment lens | 30% |
business-intelligence-analyst | Financial rigor | 25% |
competitive-market-analyst | Market positioning | 25% |
market-researcher | Market validation | 20% |
Evaluation Criteria:
- Market opportunity (TAM/SAM/SOM clarity) (1-10)
- Competitive differentiation (1-10)
- Financial projections (realistic, detailed) (1-10)
- Go-to-market strategy (1-10)
- Team and execution capability (1-10)
Example 3: Judge Security Implementation
/moe-judges evaluate authentication system for production
Judge Panel Assembled:
| Judge | Perspective | Weight |
|---|---|---|
security-specialist | Overall security | 30% |
penetration-testing-agent | Vulnerability assessment | 25% |
backend-api-security | API security | 25% |
compliance-checker-agent | Compliance | 20% |
Example 4: Judge Documentation
/moe-judges evaluate developer documentation for completeness
Judge Panel Assembled:
| Judge | Perspective | Weight |
|---|---|---|
documentation-quality-agent | Quality standards | 35% |
qa-reviewer | Standards compliance | 25% |
codi-documentation-writer | Best practices | 25% |
documentation-librarian | Organization | 15% |
Verdict Categories
| Score | Verdict | Meaning |
|---|---|---|
| 9-10 | EXCELLENT | Production ready, exceeds standards |
| 7-8 | APPROVED | Ready with minor improvements |
| 5-6 | CONDITIONAL | Needs improvements before approval |
| 3-4 | REVISION REQUIRED | Significant issues to address |
| 1-2 | REJECTED | Does not meet minimum standards |
Execution
# Assemble judges and evaluate
/agent orchestrator "Coordinate judge panel evaluation:
1. architect-review evaluates design patterns
2. security-specialist evaluates security posture
3. code-reviewer evaluates code quality
4. codi-qa-specialist evaluates test coverage
Each judge provides:
- Score (1-10)
- Strengths identified
- Issues found
- Required actions
Synthesize into unified verdict with weighted scores."
Combining with /moe-agents
Use experts to create, judges to evaluate:
# Create with experts
/moe-agents build authentication system
# Evaluate with judges
/moe-judges evaluate authentication system for production
# Or use combined workflow
/moe-workflow build and evaluate authentication system
Related Commands
/moe-agents <task>- Assemble expert team/moe-workflow <task>- Combined expert + judge workflow/which <task>- Quick agent recommendation
Action Policy
<default_behavior> This command assembles and executes judges. Provides:
- Judge panel composition
- Evaluation criteria
- Weighted scores
- Synthesized verdict
User receives actionable feedback. </default_behavior>
Success Output
When moe-judges completes:
✅ COMMAND COMPLETE: /moe-judges
Target: <what-was-evaluated>
Judges: N panelists
Score: N/10
Verdict: <verdict-label>
Actions: N required
Completion Checklist
Before marking complete:
- Deliverable analyzed
- Judges selected
- Criteria defined
- Reviews executed
- Verdict synthesized
Failure Indicators
This command has FAILED if:
- ❌ Nothing to evaluate
- ❌ No judges selected
- ❌ No scores returned
- ❌ Missing verdict
When NOT to Use
Do NOT use when:
- Creating deliverable (use /moe-agents)
- Single review sufficient
- No criteria needed
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Wrong judges | Poor evaluation | Match judges to domain |
| Equal weights | Missing priorities | Weight by importance |
| Skip actions | No improvement | Always list actions |
Principles
This command embodies:
- #9 Based on Facts - Evidence-based scoring
- #6 Clear, Understandable - Clear verdicts
- #3 Complete Execution - Full evaluation
Full Standard: CODITECT-STANDARD-AUTOMATION.md
Version: 1.0.0 Created: 2025-12-22 Author: CODITECT Team