/evaluate-response
Evaluate AI responses using professional LLM-as-judge patterns. Supports direct scoring, pairwise comparison with position bias mitigation, and rubric generation.
System Prompt
EXECUTION DIRECTIVE: When the user invokes this command, you MUST:
- Identify the mode (direct, pairwise, or rubric)
- Request content if not provided (response to evaluate, or A/B responses)
- Apply evaluation protocol for the selected mode
- Provide structured output with scores, evidence, and recommendations
DO NOT:
- Skip evidence gathering for scores
- Ignore position bias in pairwise mode
- Provide scores without justification
Usage
/evaluate-response <mode> [options]
Modes
1. Direct Scoring (direct)
Rate a single response against defined criteria.
Best for:
- Objective criteria (accuracy, completeness)
- Instruction following assessment
- Quality gates in workflows
2. Pairwise Comparison (pairwise)
Compare two responses and select the better one.
Best for:
- Subjective qualities (tone, style)
- A/B testing different approaches
- Model comparison
Note: Uses position bias mitigation (two-pass evaluation).
3. Rubric Generation (rubric)
Generate a domain-specific scoring rubric.
Best for:
- Creating evaluation standards
- Team-wide criteria alignment
- Reducing evaluation variance
Options
| Option | Type | Default | Description |
|---|---|---|---|
--criteria | string | accuracy,completeness,clarity | Evaluation criteria (comma-separated) |
--domain | string | "" | Domain for rubric generation |
--format | string | text | Output format (text, json, markdown) |
--weights | string | "" | Custom weights (e.g., "0.4,0.3,0.3") |
Direct Scoring Output
DIRECT EVALUATION
=================
Response: [first 100 chars...]
Criteria Scores:
├── Accuracy (30%): 4/5 - Good
│ Evidence: "Correctly states that..."
│ Improve: Could cite specific version numbers
│
├── Completeness (25%): 4/5 - Good
│ Evidence: "Covers main points A, B, C"
│ Improve: Missing edge case handling
│
├── Clarity (20%): 5/5 - Excellent
│ Evidence: "Uses clear analogies"
│ Improve: None needed
│
└── Instruction Following (15%): 4/5 - Good
Evidence: "Follows format requirements"
Improve: Could add code examples as requested
Weighted Score: 4.15/5 (83%)
Status: PASSED
Summary:
├── Strengths: Clear explanations, accurate core content
└── Weaknesses: Missing some requested examples
Pairwise Comparison Output
PAIRWISE COMPARISON
===================
Response A: [summary...]
Response B: [summary...]
Position Bias Check:
├── Pass 1 (A first): Winner = B, Confidence = 0.80
├── Pass 2 (B first): Winner = A (mapped: B), Confidence = 0.75
└── Consistency: CONSISTENT
Criterion-by-Criterion:
├── Clarity: B wins - Simpler analogies used
├── Accuracy: TIE - Both technically correct
├── Engagement: B wins - More memorable examples
└── Completeness: A wins - Covers more edge cases
Final Result:
├── Winner: Response B
├── Confidence: 0.78
└── Margin: Moderate
Reasoning: Response B is clearer and more engaging for the target
audience, though Response A is slightly more comprehensive.
Rubric Generation Output
EVALUATION RUBRIC
=================
Criterion: Code Readability
Domain: Software Engineering
Scale: 1-5
Level 5 (Excellent):
├── Description: Code is immediately clear and maintainable
├── Characteristics:
│ - All names are descriptive and consistent
│ - Comprehensive documentation
│ - Clean, modular structure
└── Example: def calculate_total_price(items: List[Item]) -> Decimal:
Level 3 (Acceptable):
├── Description: Code is understandable with some effort
├── Characteristics:
│ - Most variables have meaningful names
│ - Basic comments for complex sections
│ - Logic is followable but could be cleaner
└── Example: def calc_total(items): # calculate sum
Level 1 (Poor):
├── Description: Code is difficult to understand
├── Characteristics:
│ - No meaningful variable names
│ - No comments or documentation
│ - Deeply nested or convoluted logic
└── Example: def f(x): return x[0]*x[1]+x[2]
Scoring Guidelines:
1. Focus on readability, not cleverness
2. Consider the intended audience
3. Consistency matters more than style preference
Edge Cases:
├── Domain abbreviations: Score for domain experts
└── Auto-generated code: Apply same standards, note in evaluation
Examples
Evaluate Last Response
/evaluate-response direct
Evaluates the previous AI response using default criteria.
Compare Two Approaches
/evaluate-response pairwise
Prompts for two responses to compare.
Custom Criteria
/evaluate-response direct --criteria "persuasiveness,clarity,brand_alignment" --weights "0.4,0.3,0.3"
Generate Rubric
/evaluate-response rubric --domain "technical documentation"
JSON Output
/evaluate-response direct --format json
Related Components
Invokes Agent
llm-judge: Performs evaluation logic
Related Agents
compression-evaluator: Specialized for compression qualityqa-reviewer: Documentation quality review
Related Skills
advanced-evaluation: Evaluation frameworkscontext-compression: For evaluating compression
Required Tools
| Tool | Purpose | Required |
|---|---|---|
Read | Access responses from files or context | Optional |
| None | Internal LLM evaluation only | - |
Note: This command uses internal LLM reasoning without external tool calls.
Output Validation
Before marking complete, verify output contains:
Direct Mode:
- Each criterion scored (1-5)
- Evidence cited for each score
- Improvement suggestions per criterion
- Weighted total score
- PASSED/FAILED status
Pairwise Mode:
- Two-pass evaluation (position bias check)
- Consistency indicator
- Per-criterion comparison
- Winner with confidence
- Reasoning summary
Rubric Mode:
- At least 3 levels defined (1, 3, 5)
- Description per level
- Characteristics listed
- Examples for each level
- Edge cases documented
Best Practices
For Accurate Evaluation
- Define criteria before evaluating
- Provide clear context about the task
- Use appropriate mode for the evaluation type
- Review evidence for each score
For Pairwise Comparison
- Always use the two-pass method (automatic)
- If results inconsistent, trust TIE result
- Consider multiple criteria, not just overall
For Rubric Generation
- Specify the domain clearly
- Review generated levels for appropriateness
- Customize edge cases for your context
Success Output
When response evaluation completes:
✅ COMMAND COMPLETE: /evaluate-response
Mode: <direct|pairwise|rubric>
Criteria: N evaluated
Score: X.XX/5 (N%)
Status: <PASSED|FAILED>
Winner: <A|B|TIE> (pairwise only)
Confidence: X.XX
Completion Checklist
Before marking complete:
- Mode identified
- Content provided
- Criteria evaluated
- Evidence gathered
- Results formatted
Failure Indicators
This command has FAILED if:
- ❌ No mode specified
- ❌ No content to evaluate
- ❌ Scores without evidence
- ❌ Position bias not checked (pairwise)
When NOT to Use
Do NOT use when:
- Simple yes/no check
- Automated testing (use test framework)
- Code review (use /council-review)
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Skip evidence | Unreliable scores | Always cite evidence |
| Single pass | Position bias | Use two-pass for pairwise |
| Vague criteria | Inconsistent | Define clear rubric |
Principles
This command embodies:
- #9 Based on Facts - Evidence-based evaluation
- #6 Clear, Understandable - Structured output
- #3 Complete Execution - Full evaluation workflow
Full Standard: CODITECT-STANDARD-AUTOMATION.md