Advanced LLM Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts.

When to Use

✅ Use this skill when:

Building automated evaluation pipelines for LLM outputs
Comparing multiple model responses to select the best one
Establishing consistent quality standards across evaluation teams
Debugging evaluation systems with inconsistent results
Designing A/B tests for prompt or model changes
Creating rubrics for human or automated evaluation
Analyzing correlation between automated and human judgments

❌ Don't use this skill when:

Simple pass/fail validation
Real-time validation where latency matters
Tasks with deterministic correct answers (use exact matching)

The Evaluation Taxonomy

Approach	Best For	Reliability	Failure Mode
Direct Scoring	Objective criteria (accuracy, format)	Moderate-High	Calibration drift
Pairwise Comparison	Subjective preferences (tone, style)	Higher	Position/length bias

Research finding (MT-Bench, Zheng et al.): Pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation.

The Bias Landscape

LLM judges exhibit systematic biases that must be actively mitigated:

Bias	Description	Mitigation
Position Bias	First-position preference in pairwise	Swap positions, majority vote
Length Bias	Longer responses rated higher	Explicit prompting, normalization
Self-Enhancement	Models rate own outputs higher	Different gen/eval models
Verbosity Bias	Detail preferred regardless of need	Criteria-specific rubrics
Authority Bias	Confident tone rated higher	Require evidence citation

Direct Scoring Implementation

Direct scoring requires: clear criteria, calibrated scale, and structured output.

Scale Selection

Scale	Use Case	Reliability
1-3	Binary with neutral	Highest (low cognitive load)
1-5	Standard Likert	Good balance
1-10	High granularity	Only with detailed rubrics

Prompt Structure

You are an expert evaluator assessing response quality.

## Task
Evaluate the following response against each criterion.

## Original Prompt
{prompt}

## Response to Evaluate
{response}

## Criteria
{for each criterion: name, description, weight}

## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-5 scale)
3. Justify your score with evidence
4. Suggest one specific improvement

## Output Format
Respond with JSON containing scores, justifications, and summary.

Critical: Always require justification before scores. Chain-of-thought prompting improves reliability by 15-25%.

Pairwise Comparison Implementation

Position Bias Mitigation Protocol

First pass: Response A in first position, Response B in second
Second pass: Response B first, Response A second
Consistency check: If passes disagree → TIE with reduced confidence
Final verdict: Consistent winner with averaged confidence

Prompt Structure

You are an expert evaluator comparing two AI responses.

## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to specified criteria
- Ties are acceptable when responses are genuinely equivalent

## Original Prompt
{prompt}

## Response A
{response_a}

## Response B
{response_b}

## Comparison Criteria
{criteria list}

## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), reasoning.

Confidence Calibration

def calculate_final_confidence(pass1_winner, pass1_conf, pass2_winner, pass2_conf):
    if pass1_winner == pass2_winner:
        return (pass1_conf + pass2_conf) / 2, pass1_winner
    else:
        return 0.5, "TIE"  # Disagreement → low confidence tie

Rubric Generation

Well-defined rubrics reduce evaluation variance by 40-60%.

Rubric Components

Level descriptions: Clear boundaries for each score level
Characteristics: Observable features defining each level
Examples: Representative text (optional but valuable)
Edge cases: Guidance for ambiguous situations
Scoring guidelines: General principles for consistency

Example: Code Readability Rubric

{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Domain-specific abbreviations used",
      "guidance": "Score based on readability for domain experts"
    }
  ]
}

Decision Framework: Direct vs. Pairwise

Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, format compliance
│
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, creativity
    │
    └── No → Reference-based evaluation
        └── Examples: summarization, translation

Scaling Evaluation

Panel of LLMs (PoLL)

Use multiple models as judges, aggregate votes:

Reduces individual model bias
More expensive but more reliable for high-stakes decisions

Hierarchical Evaluation

Fast cheap model → Screening
          │
          ▼
    Low confidence?
          │
          ▼
Expensive model → Final verdict

Human-in-the-Loop

Automated evaluation for clear cases
Human review for low-confidence results
Design feedback loop to improve automated evaluation

Common Anti-Patterns

Anti-Pattern	Problem	Solution
Score without justification	Scores lack grounding	Require evidence first
Single-pass pairwise	Position bias corruption	Always swap positions
Overloaded criteria	Unreliable measurement	One criterion = one aspect
Missing edge cases	Inconsistent handling	Include guidance in rubrics
Ignoring confidence	Wrong high-confidence worse	Calibrate to consistency

Guidelines

Always require justification before scores - 15-25% reliability improvement
Always swap positions in pairwise comparison - Position bias mitigation
Match scale granularity to rubric specificity - Don't use 1-10 without details
Separate objective and subjective criteria - Use appropriate method
Include confidence scores - Calibrate to position consistency
Define edge cases explicitly - Reduce evaluation variance
Use domain-specific rubrics - Generic = less useful
Validate against human judgments - Ground truth correlation
Monitor for systematic bias - Track disagreement patterns
Design for iteration - Feedback loops improve systems

Evaluation Criteria Template

Use this template when defining new evaluation criteria:

# evaluation-criteria-template.yaml
criteria:
  - name: "{Criterion Name}"
    id: "EVAL-{ID}"
    type: "{objective|subjective}"  # Determines method selection
    description: "{What this criterion measures}"

    # Scoring rubric (1-5 scale recommended)
    rubric:
      1:
        label: "Poor"
        description: "{Clear definition of score 1}"
        characteristics:
          - "{Observable characteristic 1}"
          - "{Observable characteristic 2}"
      3:
        label: "Adequate"
        description: "{Clear definition of score 3}"
        characteristics:
          - "{Observable characteristic 1}"
          - "{Observable characteristic 2}"
      5:
        label: "Excellent"
        description: "{Clear definition of score 5}"
        characteristics:
          - "{Observable characteristic 1}"
          - "{Observable characteristic 2}"

    # Edge cases with specific guidance
    edge_cases:
      - situation: "{Ambiguous situation description}"
        guidance: "{How to score in this case}"

    # Evaluation method configuration
    method:
      primary: "{direct_scoring|pairwise_comparison}"
      bias_mitigations:
        - "{position_swap|length_normalization|etc}"
      confidence_threshold: 0.7  # Below this, escalate to human

    # Weight in overall score (optional)
    weight: 1.0

Example instantiation:

criteria:
  - name: "Code Correctness"
    id: "EVAL-001"
    type: "objective"
    description: "Whether the code produces correct output for given inputs"

    rubric:
      1:
        label: "Incorrect"
        description: "Code fails for most or all test cases"
        characteristics:
          - "Runtime errors on execution"
          - "Wrong output for standard inputs"
      3:
        label: "Partially Correct"
        description: "Code works for common cases but fails edge cases"
        characteristics:
          - "Passes basic test cases"
          - "Fails on boundary conditions"
      5:
        label: "Fully Correct"
        description: "Code handles all cases including edge cases"
        characteristics:
          - "Passes all test cases"
          - "Handles empty/null inputs gracefully"

    edge_cases:
      - situation: "Code is correct but uses deprecated API"
        guidance: "Score correctness separately; note deprecation in feedback"

    method:
      primary: "direct_scoring"
      bias_mitigations:
        - "require_justification_first"
      confidence_threshold: 0.8

    weight: 2.0  # Weighted higher than style criteria

Skills

context-fundamentals - Context structure for evaluation prompts
tool-design - Building evaluation tools
evaluation-framework - Foundational evaluation concepts

Agents

llm-judge - LLM-as-judge evaluation agent
compression-evaluator - Probe-based compression evaluation

Commands

/evaluate-response - Run LLM-as-judge evaluation

Scripts

external/Agent-Skills-for-Context-Engineering/skills/advanced-evaluation/scripts/evaluation_example.py - LLM-as-judge implementation patterns

Success Output

When this skill is successfully applied, output:

✅ SKILL COMPLETE: advanced-evaluation

Completed:
- [x] Evaluation method selected (direct scoring / pairwise / hybrid)
- [x] Rubric created with clear level descriptions
- [x] Bias mitigation protocols implemented
- [x] Evaluation prompts tested with sample data
- [x] Confidence calibration verified
- [x] Inter-rater reliability measured (if human baseline available)

Outputs:
- evaluation_rubric.json (scoring criteria and levels)
- evaluation_prompt.md (LLM judge prompt)
- evaluation_results.json (scores, justifications, confidence)
- bias_analysis_report.md (position bias, length bias metrics)
- Correlation with human judgments: X% (if baseline available)

Completion Checklist

Before marking this skill as complete, verify:

Failure Indicators

This skill has FAILED if:

❌ Evaluation criteria vague or unmeasurable
❌ LLM judge produces scores without justification
❌ Position bias not mitigated in pairwise comparison
❌ Confidence scores don't correlate with consistency
❌ Rubric missing level boundaries or characteristics
❌ Direct scoring used for subjective preferences
❌ Pairwise comparison used for objective criteria
❌ No edge case guidance in rubric
❌ Evaluation results inconsistent across runs
❌ Length bias favoring verbose responses unchecked

When NOT to Use

Do NOT use this skill when:

Deterministic correct answer exists (use exact matching or regex instead)
Real-time validation required (LLM-as-judge has latency)
Simple pass/fail validation sufficient (use rule-based checks)
No clear evaluation criteria can be defined
Evaluation cost exceeds value of quality improvement
Human judgment required for legal/ethical reasons
Evaluation domain requires specialized expertise LLM lacks
Ground truth available and preferable (use reference-based evaluation)

Use alternatives:

exact-matching - Deterministic validation
rule-based-validation - Heuristic checks
reference-based-evaluation - Compare to gold standard
human-in-the-loop - Critical decisions requiring human judgment

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Score without justification	Unreliable, no grounding	Always require chain-of-thought reasoning first
Single-pass pairwise	Position bias corrupts results	Always swap positions and check consistency
Overloaded criteria	Can't reliably measure multiple aspects	One criterion = one measurable aspect
Generic rubrics	Low discrimination, inconsistent	Create domain-specific rubrics with examples
Ignoring confidence	Wrong with high confidence worse	Calibrate confidence to consistency metrics
No edge cases	Inconsistent handling of ambiguous inputs	Document edge cases with specific guidance
Wrong scale granularity	1-10 scale without detailed rubric	Match scale to rubric specificity (use 1-5 for most)
Self-enhancement bias	Model rates own outputs higher	Use different models for generation and evaluation
Blind to length bias	Longer always rated higher	Explicitly prompt to ignore length, normalize scores

Principles

This skill embodies CODITECT core principles:

#5 Eliminate Ambiguity

Clear rubric with level boundaries eliminates subjective interpretation
Explicit bias mitigation protocols remove systematic errors
Confidence scores communicate uncertainty

#6 Clear, Understandable, Explainable

Chain-of-thought justification makes scores explainable
Rubric characteristics provide concrete criteria
Decision framework guides method selection

#8 No Assumptions

Always validate against human judgments when available
Test for systematic biases (position, length, verbosity)
Verify consistency across multiple runs

#9 Evidence-Based Decisions

Evaluation method backed by research (MT-Bench, Zheng et al.)
Bias mitigation protocols proven to improve reliability
Confidence calibration grounded in position swap consistency

#10 Research When in Doubt

LLM-as-judge techniques evolving rapidly
New bias types discovered regularly
Correlation with human judgment varies by domain

Full Standard: CODITECT-STANDARD-AUTOMATION.md

When to Use​

The Evaluation Taxonomy​

The Bias Landscape​

Direct Scoring Implementation​

Scale Selection​

Prompt Structure​

Pairwise Comparison Implementation​

Position Bias Mitigation Protocol​

Prompt Structure​

Confidence Calibration​

Rubric Generation​

Rubric Components​

Example: Code Readability Rubric​

Decision Framework: Direct vs. Pairwise​

Scaling Evaluation​

Panel of LLMs (PoLL)​

Hierarchical Evaluation​

Human-in-the-Loop​

Common Anti-Patterns​

Guidelines​

Evaluation Criteria Template​

Related Components​

Skills​

Agents​

Commands​

Scripts​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​

When to Use

The Evaluation Taxonomy

The Bias Landscape

Direct Scoring Implementation

Scale Selection

Prompt Structure

Pairwise Comparison Implementation

Position Bias Mitigation Protocol

Prompt Structure

Confidence Calibration

Rubric Generation

Rubric Components

Example: Code Readability Rubric

Decision Framework: Direct vs. Pairwise

Scaling Evaluation

Panel of LLMs (PoLL)

Hierarchical Evaluation

Human-in-the-Loop

Common Anti-Patterns

Guidelines

Evaluation Criteria Template

Related Components

Skills

Agents

Commands

Scripts

Success Output

Completion Checklist

Failure Indicators

When NOT to Use

Anti-Patterns (Avoid)

Principles