/evaluate-response

Evaluate AI responses using professional LLM-as-judge patterns. Supports direct scoring, pairwise comparison with position bias mitigation, and rubric generation.

System Prompt

EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

Identify the mode (direct, pairwise, or rubric)
Request content if not provided (response to evaluate, or A/B responses)
Apply evaluation protocol for the selected mode
Provide structured output with scores, evidence, and recommendations

DO NOT:

Skip evidence gathering for scores
Ignore position bias in pairwise mode
Provide scores without justification

Usage

/evaluate-response <mode> [options]

Modes

1. Direct Scoring (`direct`)

Rate a single response against defined criteria.

Best for:

Objective criteria (accuracy, completeness)
Instruction following assessment
Quality gates in workflows

2. Pairwise Comparison (`pairwise`)

Compare two responses and select the better one.

Best for:

Subjective qualities (tone, style)
A/B testing different approaches
Model comparison

Note: Uses position bias mitigation (two-pass evaluation).

3. Rubric Generation (`rubric`)

Generate a domain-specific scoring rubric.

Best for:

Creating evaluation standards
Team-wide criteria alignment
Reducing evaluation variance

Options

Option	Type	Default	Description
`--criteria`	string	accuracy,completeness,clarity	Evaluation criteria (comma-separated)
`--domain`	string	""	Domain for rubric generation
`--format`	string	text	Output format (text, json, markdown)
`--weights`	string	""	Custom weights (e.g., "0.4,0.3,0.3")

Direct Scoring Output

DIRECT EVALUATION
=================
Response: [first 100 chars...]

Criteria Scores:
├── Accuracy (30%):     4/5 - Good
│   Evidence: "Correctly states that..."
│   Improve: Could cite specific version numbers
│
├── Completeness (25%): 4/5 - Good
│   Evidence: "Covers main points A, B, C"
│   Improve: Missing edge case handling
│
├── Clarity (20%):      5/5 - Excellent
│   Evidence: "Uses clear analogies"
│   Improve: None needed
│
└── Instruction Following (15%): 4/5 - Good
    Evidence: "Follows format requirements"
    Improve: Could add code examples as requested

Weighted Score: 4.15/5 (83%)
Status: PASSED

Summary:
├── Strengths: Clear explanations, accurate core content
└── Weaknesses: Missing some requested examples

Pairwise Comparison Output

PAIRWISE COMPARISON
===================
Response A: [summary...]
Response B: [summary...]

Position Bias Check:
├── Pass 1 (A first): Winner = B, Confidence = 0.80
├── Pass 2 (B first): Winner = A (mapped: B), Confidence = 0.75
└── Consistency: CONSISTENT

Criterion-by-Criterion:
├── Clarity:      B wins - Simpler analogies used
├── Accuracy:     TIE - Both technically correct
├── Engagement:   B wins - More memorable examples
└── Completeness: A wins - Covers more edge cases

Final Result:
├── Winner: Response B
├── Confidence: 0.78
└── Margin: Moderate

Reasoning: Response B is clearer and more engaging for the target
audience, though Response A is slightly more comprehensive.

Rubric Generation Output

EVALUATION RUBRIC
=================
Criterion: Code Readability
Domain: Software Engineering
Scale: 1-5

Level 5 (Excellent):
├── Description: Code is immediately clear and maintainable
├── Characteristics:
│   - All names are descriptive and consistent
│   - Comprehensive documentation
│   - Clean, modular structure
└── Example: def calculate_total_price(items: List[Item]) -> Decimal:

Level 3 (Acceptable):
├── Description: Code is understandable with some effort
├── Characteristics:
│   - Most variables have meaningful names
│   - Basic comments for complex sections
│   - Logic is followable but could be cleaner
└── Example: def calc_total(items): # calculate sum

Level 1 (Poor):
├── Description: Code is difficult to understand
├── Characteristics:
│   - No meaningful variable names
│   - No comments or documentation
│   - Deeply nested or convoluted logic
└── Example: def f(x): return x[0]*x[1]+x[2]

Scoring Guidelines:
1. Focus on readability, not cleverness
2. Consider the intended audience
3. Consistency matters more than style preference

Edge Cases:
├── Domain abbreviations: Score for domain experts
└── Auto-generated code: Apply same standards, note in evaluation

Examples

Evaluate Last Response

/evaluate-response direct

Evaluates the previous AI response using default criteria.

Compare Two Approaches

/evaluate-response pairwise

Prompts for two responses to compare.

Custom Criteria

/evaluate-response direct --criteria "persuasiveness,clarity,brand_alignment" --weights "0.4,0.3,0.3"

Generate Rubric

/evaluate-response rubric --domain "technical documentation"

JSON Output

/evaluate-response direct --format json

Invokes Agent

llm-judge: Performs evaluation logic

compression-evaluator: Specialized for compression quality
qa-reviewer: Documentation quality review

advanced-evaluation: Evaluation frameworks
context-compression: For evaluating compression

Required Tools

Tool	Purpose	Required
`Read`	Access responses from files or context	Optional
None	Internal LLM evaluation only	-

Note: This command uses internal LLM reasoning without external tool calls.

Output Validation

Before marking complete, verify output contains:

Direct Mode:

Pairwise Mode:

Rubric Mode:

Best Practices

For Accurate Evaluation

Define criteria before evaluating
Provide clear context about the task
Use appropriate mode for the evaluation type
Review evidence for each score

For Pairwise Comparison

Always use the two-pass method (automatic)
If results inconsistent, trust TIE result
Consider multiple criteria, not just overall

For Rubric Generation

Specify the domain clearly
Review generated levels for appropriateness
Customize edge cases for your context

Success Output

When response evaluation completes:

✅ COMMAND COMPLETE: /evaluate-response
Mode: <direct|pairwise|rubric>
Criteria: N evaluated
Score: X.XX/5 (N%)
Status: <PASSED|FAILED>
Winner: <A|B|TIE> (pairwise only)
Confidence: X.XX

Completion Checklist

Before marking complete:

Failure Indicators

This command has FAILED if:

❌ No mode specified
❌ No content to evaluate
❌ Scores without evidence
❌ Position bias not checked (pairwise)

When NOT to Use

Do NOT use when:

Simple yes/no check
Automated testing (use test framework)
Code review (use /council-review)

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Skip evidence	Unreliable scores	Always cite evidence
Single pass	Position bias	Use two-pass for pairwise
Vague criteria	Inconsistent	Define clear rubric

Principles

This command embodies:

#9 Based on Facts - Evidence-based evaluation
#6 Clear, Understandable - Structured output
#3 Complete Execution - Full evaluation workflow

Full Standard: CODITECT-STANDARD-AUTOMATION.md

System Prompt​

Usage​

Modes​

1. Direct Scoring (direct)​

2. Pairwise Comparison (pairwise)​

3. Rubric Generation (rubric)​

Options​

Direct Scoring Output​

Pairwise Comparison Output​

Rubric Generation Output​

Examples​

Evaluate Last Response​

Compare Two Approaches​

Custom Criteria​

Generate Rubric​

JSON Output​

Related Components​

Invokes Agent​

Related Agents​

Related Skills​

Required Tools​

Output Validation​

Best Practices​

For Accurate Evaluation​

For Pairwise Comparison​

For Rubric Generation​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

System Prompt

Usage

Modes

1. Direct Scoring (`direct`)

2. Pairwise Comparison (`pairwise`)

3. Rubric Generation (`rubric`)

Options

Direct Scoring Output

Pairwise Comparison Output

Rubric Generation Output

Examples

Evaluate Last Response

Compare Two Approaches

Custom Criteria

Generate Rubric

JSON Output

Related Components

Invokes Agent

Related Agents

Related Skills

Required Tools

Output Validation

Best Practices

For Accurate Evaluation

For Pairwise Comparison

For Rubric Generation

Success Output

Completion Checklist

Failure Indicators

When NOT to Use

Anti-Patterns (Avoid)

Principles