Skip to main content

Advanced LLM Evaluation

Advanced LLM Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts.

When to Use

Use this skill when:

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems with inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • Analyzing correlation between automated and human judgments

Don't use this skill when:

  • Simple pass/fail validation
  • Real-time validation where latency matters
  • Tasks with deterministic correct answers (use exact matching)

The Evaluation Taxonomy

ApproachBest ForReliabilityFailure Mode
Direct ScoringObjective criteria (accuracy, format)Moderate-HighCalibration drift
Pairwise ComparisonSubjective preferences (tone, style)HigherPosition/length bias

Research finding (MT-Bench, Zheng et al.): Pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation.

The Bias Landscape

LLM judges exhibit systematic biases that must be actively mitigated:

BiasDescriptionMitigation
Position BiasFirst-position preference in pairwiseSwap positions, majority vote
Length BiasLonger responses rated higherExplicit prompting, normalization
Self-EnhancementModels rate own outputs higherDifferent gen/eval models
Verbosity BiasDetail preferred regardless of needCriteria-specific rubrics
Authority BiasConfident tone rated higherRequire evidence citation

Direct Scoring Implementation

Direct scoring requires: clear criteria, calibrated scale, and structured output.

Scale Selection

ScaleUse CaseReliability
1-3Binary with neutralHighest (low cognitive load)
1-5Standard LikertGood balance
1-10High granularityOnly with detailed rubrics

Prompt Structure

You are an expert evaluator assessing response quality.

## Task
Evaluate the following response against each criterion.

## Original Prompt
{prompt}

## Response to Evaluate
{response}

## Criteria
{for each criterion: name, description, weight}

## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-5 scale)
3. Justify your score with evidence
4. Suggest one specific improvement

## Output Format
Respond with JSON containing scores, justifications, and summary.

Critical: Always require justification before scores. Chain-of-thought prompting improves reliability by 15-25%.

Pairwise Comparison Implementation

Position Bias Mitigation Protocol

  1. First pass: Response A in first position, Response B in second
  2. Second pass: Response B first, Response A second
  3. Consistency check: If passes disagree → TIE with reduced confidence
  4. Final verdict: Consistent winner with averaged confidence

Prompt Structure

You are an expert evaluator comparing two AI responses.

## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to specified criteria
- Ties are acceptable when responses are genuinely equivalent

## Original Prompt
{prompt}

## Response A
{response_a}

## Response B
{response_b}

## Comparison Criteria
{criteria list}

## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), reasoning.

Confidence Calibration

def calculate_final_confidence(pass1_winner, pass1_conf, pass2_winner, pass2_conf):
if pass1_winner == pass2_winner:
return (pass1_conf + pass2_conf) / 2, pass1_winner
else:
return 0.5, "TIE" # Disagreement → low confidence tie

Rubric Generation

Well-defined rubrics reduce evaluation variance by 40-60%.

Rubric Components

  1. Level descriptions: Clear boundaries for each score level
  2. Characteristics: Observable features defining each level
  3. Examples: Representative text (optional but valuable)
  4. Edge cases: Guidance for ambiguous situations
  5. Scoring guidelines: General principles for consistency

Example: Code Readability Rubric

{
"levels": [
{
"score": 1,
"label": "Poor",
"description": "Code is difficult to understand without significant effort",
"characteristics": [
"No meaningful variable or function names",
"No comments or documentation",
"Deeply nested or convoluted logic"
]
},
{
"score": 3,
"label": "Adequate",
"description": "Code is understandable with some effort",
"characteristics": [
"Most variables have meaningful names",
"Basic comments for complex sections",
"Logic is followable but could be cleaner"
]
},
{
"score": 5,
"label": "Excellent",
"description": "Code is immediately clear and maintainable",
"characteristics": [
"All names are descriptive and consistent",
"Comprehensive documentation",
"Clean, modular structure"
]
}
],
"edgeCases": [
{
"situation": "Domain-specific abbreviations used",
"guidance": "Score based on readability for domain experts"
}
]
}

Decision Framework: Direct vs. Pairwise

Is there an objective ground truth?
├── Yes → Direct Scoring
│ └── Examples: factual accuracy, format compliance

└── No → Is it a preference or quality judgment?
├── Yes → Pairwise Comparison
│ └── Examples: tone, style, creativity

└── No → Reference-based evaluation
└── Examples: summarization, translation

Scaling Evaluation

Panel of LLMs (PoLL)

Use multiple models as judges, aggregate votes:

  • Reduces individual model bias
  • More expensive but more reliable for high-stakes decisions

Hierarchical Evaluation

Fast cheap model → Screening


Low confidence?


Expensive model → Final verdict

Human-in-the-Loop

  • Automated evaluation for clear cases
  • Human review for low-confidence results
  • Design feedback loop to improve automated evaluation

Common Anti-Patterns

Anti-PatternProblemSolution
Score without justificationScores lack groundingRequire evidence first
Single-pass pairwisePosition bias corruptionAlways swap positions
Overloaded criteriaUnreliable measurementOne criterion = one aspect
Missing edge casesInconsistent handlingInclude guidance in rubrics
Ignoring confidenceWrong high-confidence worseCalibrate to consistency

Guidelines

  1. Always require justification before scores - 15-25% reliability improvement
  2. Always swap positions in pairwise comparison - Position bias mitigation
  3. Match scale granularity to rubric specificity - Don't use 1-10 without details
  4. Separate objective and subjective criteria - Use appropriate method
  5. Include confidence scores - Calibrate to position consistency
  6. Define edge cases explicitly - Reduce evaluation variance
  7. Use domain-specific rubrics - Generic = less useful
  8. Validate against human judgments - Ground truth correlation
  9. Monitor for systematic bias - Track disagreement patterns
  10. Design for iteration - Feedback loops improve systems

Evaluation Criteria Template

Use this template when defining new evaluation criteria:

# evaluation-criteria-template.yaml
criteria:
- name: "{Criterion Name}"
id: "EVAL-{ID}"
type: "{objective|subjective}" # Determines method selection
description: "{What this criterion measures}"

# Scoring rubric (1-5 scale recommended)
rubric:
1:
label: "Poor"
description: "{Clear definition of score 1}"
characteristics:
- "{Observable characteristic 1}"
- "{Observable characteristic 2}"
3:
label: "Adequate"
description: "{Clear definition of score 3}"
characteristics:
- "{Observable characteristic 1}"
- "{Observable characteristic 2}"
5:
label: "Excellent"
description: "{Clear definition of score 5}"
characteristics:
- "{Observable characteristic 1}"
- "{Observable characteristic 2}"

# Edge cases with specific guidance
edge_cases:
- situation: "{Ambiguous situation description}"
guidance: "{How to score in this case}"

# Evaluation method configuration
method:
primary: "{direct_scoring|pairwise_comparison}"
bias_mitigations:
- "{position_swap|length_normalization|etc}"
confidence_threshold: 0.7 # Below this, escalate to human

# Weight in overall score (optional)
weight: 1.0

Example instantiation:

criteria:
- name: "Code Correctness"
id: "EVAL-001"
type: "objective"
description: "Whether the code produces correct output for given inputs"

rubric:
1:
label: "Incorrect"
description: "Code fails for most or all test cases"
characteristics:
- "Runtime errors on execution"
- "Wrong output for standard inputs"
3:
label: "Partially Correct"
description: "Code works for common cases but fails edge cases"
characteristics:
- "Passes basic test cases"
- "Fails on boundary conditions"
5:
label: "Fully Correct"
description: "Code handles all cases including edge cases"
characteristics:
- "Passes all test cases"
- "Handles empty/null inputs gracefully"

edge_cases:
- situation: "Code is correct but uses deprecated API"
guidance: "Score correctness separately; note deprecation in feedback"

method:
primary: "direct_scoring"
bias_mitigations:
- "require_justification_first"
confidence_threshold: 0.8

weight: 2.0 # Weighted higher than style criteria

Skills

  • context-fundamentals - Context structure for evaluation prompts
  • tool-design - Building evaluation tools
  • evaluation-framework - Foundational evaluation concepts

Agents

  • llm-judge - LLM-as-judge evaluation agent
  • compression-evaluator - Probe-based compression evaluation

Commands

  • /evaluate-response - Run LLM-as-judge evaluation

Scripts

  • external/Agent-Skills-for-Context-Engineering/skills/advanced-evaluation/scripts/evaluation_example.py - LLM-as-judge implementation patterns

Success Output

When this skill is successfully applied, output:

✅ SKILL COMPLETE: advanced-evaluation

Completed:
- [x] Evaluation method selected (direct scoring / pairwise / hybrid)
- [x] Rubric created with clear level descriptions
- [x] Bias mitigation protocols implemented
- [x] Evaluation prompts tested with sample data
- [x] Confidence calibration verified
- [x] Inter-rater reliability measured (if human baseline available)

Outputs:
- evaluation_rubric.json (scoring criteria and levels)
- evaluation_prompt.md (LLM judge prompt)
- evaluation_results.json (scores, justifications, confidence)
- bias_analysis_report.md (position bias, length bias metrics)
- Correlation with human judgments: X% (if baseline available)

Completion Checklist

Before marking this skill as complete, verify:

  • Evaluation criteria clearly defined and measurable
  • Rubric includes level descriptions with characteristics
  • Prompt requires justification before scores (chain-of-thought)
  • Bias mitigation implemented (position swap for pairwise)
  • Confidence scores calibrated and validated
  • Test evaluations run on sample data
  • Edge cases identified and documented in rubric
  • Output format structured (JSON/markdown)
  • Evaluation results reproducible
  • Documentation includes when to use each method

Failure Indicators

This skill has FAILED if:

  • ❌ Evaluation criteria vague or unmeasurable
  • ❌ LLM judge produces scores without justification
  • ❌ Position bias not mitigated in pairwise comparison
  • ❌ Confidence scores don't correlate with consistency
  • ❌ Rubric missing level boundaries or characteristics
  • ❌ Direct scoring used for subjective preferences
  • ❌ Pairwise comparison used for objective criteria
  • ❌ No edge case guidance in rubric
  • ❌ Evaluation results inconsistent across runs
  • ❌ Length bias favoring verbose responses unchecked

When NOT to Use

Do NOT use this skill when:

  • Deterministic correct answer exists (use exact matching or regex instead)
  • Real-time validation required (LLM-as-judge has latency)
  • Simple pass/fail validation sufficient (use rule-based checks)
  • No clear evaluation criteria can be defined
  • Evaluation cost exceeds value of quality improvement
  • Human judgment required for legal/ethical reasons
  • Evaluation domain requires specialized expertise LLM lacks
  • Ground truth available and preferable (use reference-based evaluation)

Use alternatives:

  • exact-matching - Deterministic validation
  • rule-based-validation - Heuristic checks
  • reference-based-evaluation - Compare to gold standard
  • human-in-the-loop - Critical decisions requiring human judgment

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Score without justificationUnreliable, no groundingAlways require chain-of-thought reasoning first
Single-pass pairwisePosition bias corrupts resultsAlways swap positions and check consistency
Overloaded criteriaCan't reliably measure multiple aspectsOne criterion = one measurable aspect
Generic rubricsLow discrimination, inconsistentCreate domain-specific rubrics with examples
Ignoring confidenceWrong with high confidence worseCalibrate confidence to consistency metrics
No edge casesInconsistent handling of ambiguous inputsDocument edge cases with specific guidance
Wrong scale granularity1-10 scale without detailed rubricMatch scale to rubric specificity (use 1-5 for most)
Self-enhancement biasModel rates own outputs higherUse different models for generation and evaluation
Blind to length biasLonger always rated higherExplicitly prompt to ignore length, normalize scores

Principles

This skill embodies CODITECT core principles:

#5 Eliminate Ambiguity

  • Clear rubric with level boundaries eliminates subjective interpretation
  • Explicit bias mitigation protocols remove systematic errors
  • Confidence scores communicate uncertainty

#6 Clear, Understandable, Explainable

  • Chain-of-thought justification makes scores explainable
  • Rubric characteristics provide concrete criteria
  • Decision framework guides method selection

#8 No Assumptions

  • Always validate against human judgments when available
  • Test for systematic biases (position, length, verbosity)
  • Verify consistency across multiple runs

#9 Evidence-Based Decisions

  • Evaluation method backed by research (MT-Bench, Zheng et al.)
  • Bias mitigation protocols proven to improve reliability
  • Confidence calibration grounded in position swap consistency

#10 Research When in Doubt

  • LLM-as-judge techniques evolving rapidly
  • New bias types discovered regularly
  • Correlation with human judgment varies by domain

Full Standard: CODITECT-STANDARD-AUTOMATION.md