Skip to main content

/evaluate-response

Evaluate AI responses using professional LLM-as-judge patterns. Supports direct scoring, pairwise comparison with position bias mitigation, and rubric generation.

System Prompt

EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

  1. Identify the mode (direct, pairwise, or rubric)
  2. Request content if not provided (response to evaluate, or A/B responses)
  3. Apply evaluation protocol for the selected mode
  4. Provide structured output with scores, evidence, and recommendations

DO NOT:

  • Skip evidence gathering for scores
  • Ignore position bias in pairwise mode
  • Provide scores without justification

Usage

/evaluate-response <mode> [options]

Modes

1. Direct Scoring (direct)

Rate a single response against defined criteria.

Best for:

  • Objective criteria (accuracy, completeness)
  • Instruction following assessment
  • Quality gates in workflows

2. Pairwise Comparison (pairwise)

Compare two responses and select the better one.

Best for:

  • Subjective qualities (tone, style)
  • A/B testing different approaches
  • Model comparison

Note: Uses position bias mitigation (two-pass evaluation).

3. Rubric Generation (rubric)

Generate a domain-specific scoring rubric.

Best for:

  • Creating evaluation standards
  • Team-wide criteria alignment
  • Reducing evaluation variance

Options

OptionTypeDefaultDescription
--criteriastringaccuracy,completeness,clarityEvaluation criteria (comma-separated)
--domainstring""Domain for rubric generation
--formatstringtextOutput format (text, json, markdown)
--weightsstring""Custom weights (e.g., "0.4,0.3,0.3")

Direct Scoring Output

DIRECT EVALUATION
=================
Response: [first 100 chars...]

Criteria Scores:
├── Accuracy (30%): 4/5 - Good
│ Evidence: "Correctly states that..."
│ Improve: Could cite specific version numbers

├── Completeness (25%): 4/5 - Good
│ Evidence: "Covers main points A, B, C"
│ Improve: Missing edge case handling

├── Clarity (20%): 5/5 - Excellent
│ Evidence: "Uses clear analogies"
│ Improve: None needed

└── Instruction Following (15%): 4/5 - Good
Evidence: "Follows format requirements"
Improve: Could add code examples as requested

Weighted Score: 4.15/5 (83%)
Status: PASSED

Summary:
├── Strengths: Clear explanations, accurate core content
└── Weaknesses: Missing some requested examples

Pairwise Comparison Output

PAIRWISE COMPARISON
===================
Response A: [summary...]
Response B: [summary...]

Position Bias Check:
├── Pass 1 (A first): Winner = B, Confidence = 0.80
├── Pass 2 (B first): Winner = A (mapped: B), Confidence = 0.75
└── Consistency: CONSISTENT

Criterion-by-Criterion:
├── Clarity: B wins - Simpler analogies used
├── Accuracy: TIE - Both technically correct
├── Engagement: B wins - More memorable examples
└── Completeness: A wins - Covers more edge cases

Final Result:
├── Winner: Response B
├── Confidence: 0.78
└── Margin: Moderate

Reasoning: Response B is clearer and more engaging for the target
audience, though Response A is slightly more comprehensive.

Rubric Generation Output

EVALUATION RUBRIC
=================
Criterion: Code Readability
Domain: Software Engineering
Scale: 1-5

Level 5 (Excellent):
├── Description: Code is immediately clear and maintainable
├── Characteristics:
│ - All names are descriptive and consistent
│ - Comprehensive documentation
│ - Clean, modular structure
└── Example: def calculate_total_price(items: List[Item]) -> Decimal:

Level 3 (Acceptable):
├── Description: Code is understandable with some effort
├── Characteristics:
│ - Most variables have meaningful names
│ - Basic comments for complex sections
│ - Logic is followable but could be cleaner
└── Example: def calc_total(items): # calculate sum

Level 1 (Poor):
├── Description: Code is difficult to understand
├── Characteristics:
│ - No meaningful variable names
│ - No comments or documentation
│ - Deeply nested or convoluted logic
└── Example: def f(x): return x[0]*x[1]+x[2]

Scoring Guidelines:
1. Focus on readability, not cleverness
2. Consider the intended audience
3. Consistency matters more than style preference

Edge Cases:
├── Domain abbreviations: Score for domain experts
└── Auto-generated code: Apply same standards, note in evaluation

Examples

Evaluate Last Response

/evaluate-response direct

Evaluates the previous AI response using default criteria.

Compare Two Approaches

/evaluate-response pairwise

Prompts for two responses to compare.

Custom Criteria

/evaluate-response direct --criteria "persuasiveness,clarity,brand_alignment" --weights "0.4,0.3,0.3"

Generate Rubric

/evaluate-response rubric --domain "technical documentation"

JSON Output

/evaluate-response direct --format json

Invokes Agent

  • llm-judge: Performs evaluation logic
  • compression-evaluator: Specialized for compression quality
  • qa-reviewer: Documentation quality review
  • advanced-evaluation: Evaluation frameworks
  • context-compression: For evaluating compression

Required Tools

ToolPurposeRequired
ReadAccess responses from files or contextOptional
NoneInternal LLM evaluation only-

Note: This command uses internal LLM reasoning without external tool calls.

Output Validation

Before marking complete, verify output contains:

Direct Mode:

  • Each criterion scored (1-5)
  • Evidence cited for each score
  • Improvement suggestions per criterion
  • Weighted total score
  • PASSED/FAILED status

Pairwise Mode:

  • Two-pass evaluation (position bias check)
  • Consistency indicator
  • Per-criterion comparison
  • Winner with confidence
  • Reasoning summary

Rubric Mode:

  • At least 3 levels defined (1, 3, 5)
  • Description per level
  • Characteristics listed
  • Examples for each level
  • Edge cases documented

Best Practices

For Accurate Evaluation

  1. Define criteria before evaluating
  2. Provide clear context about the task
  3. Use appropriate mode for the evaluation type
  4. Review evidence for each score

For Pairwise Comparison

  1. Always use the two-pass method (automatic)
  2. If results inconsistent, trust TIE result
  3. Consider multiple criteria, not just overall

For Rubric Generation

  1. Specify the domain clearly
  2. Review generated levels for appropriateness
  3. Customize edge cases for your context

Success Output

When response evaluation completes:

✅ COMMAND COMPLETE: /evaluate-response
Mode: <direct|pairwise|rubric>
Criteria: N evaluated
Score: X.XX/5 (N%)
Status: <PASSED|FAILED>
Winner: <A|B|TIE> (pairwise only)
Confidence: X.XX

Completion Checklist

Before marking complete:

  • Mode identified
  • Content provided
  • Criteria evaluated
  • Evidence gathered
  • Results formatted

Failure Indicators

This command has FAILED if:

  • ❌ No mode specified
  • ❌ No content to evaluate
  • ❌ Scores without evidence
  • ❌ Position bias not checked (pairwise)

When NOT to Use

Do NOT use when:

  • Simple yes/no check
  • Automated testing (use test framework)
  • Code review (use /council-review)

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Skip evidenceUnreliable scoresAlways cite evidence
Single passPosition biasUse two-pass for pairwise
Vague criteriaInconsistentDefine clear rubric

Principles

This command embodies:

  • #9 Based on Facts - Evidence-based evaluation
  • #6 Clear, Understandable - Structured output
  • #3 Complete Execution - Full evaluation workflow

Full Standard: CODITECT-STANDARD-AUTOMATION.md