Advanced LLM Evaluation
Advanced LLM Evaluation
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts.
When to Use
✅ Use this skill when:
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems with inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
❌ Don't use this skill when:
- Simple pass/fail validation
- Real-time validation where latency matters
- Tasks with deterministic correct answers (use exact matching)
The Evaluation Taxonomy
| Approach | Best For | Reliability | Failure Mode |
|---|---|---|---|
| Direct Scoring | Objective criteria (accuracy, format) | Moderate-High | Calibration drift |
| Pairwise Comparison | Subjective preferences (tone, style) | Higher | Position/length bias |
Research finding (MT-Bench, Zheng et al.): Pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation.
The Bias Landscape
LLM judges exhibit systematic biases that must be actively mitigated:
| Bias | Description | Mitigation |
|---|---|---|
| Position Bias | First-position preference in pairwise | Swap positions, majority vote |
| Length Bias | Longer responses rated higher | Explicit prompting, normalization |
| Self-Enhancement | Models rate own outputs higher | Different gen/eval models |
| Verbosity Bias | Detail preferred regardless of need | Criteria-specific rubrics |
| Authority Bias | Confident tone rated higher | Require evidence citation |
Direct Scoring Implementation
Direct scoring requires: clear criteria, calibrated scale, and structured output.
Scale Selection
| Scale | Use Case | Reliability |
|---|---|---|
| 1-3 | Binary with neutral | Highest (low cognitive load) |
| 1-5 | Standard Likert | Good balance |
| 1-10 | High granularity | Only with detailed rubrics |
Prompt Structure
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{for each criterion: name, description, weight}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-5 scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with JSON containing scores, justifications, and summary.
Critical: Always require justification before scores. Chain-of-thought prompting improves reliability by 15-25%.
Pairwise Comparison Implementation
Position Bias Mitigation Protocol
- First pass: Response A in first position, Response B in second
- Second pass: Response B first, Response A second
- Consistency check: If passes disagree → TIE with reduced confidence
- Final verdict: Consistent winner with averaged confidence
Prompt Structure
You are an expert evaluator comparing two AI responses.
## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to specified criteria
- Ties are acceptable when responses are genuinely equivalent
## Original Prompt
{prompt}
## Response A
{response_a}
## Response B
{response_b}
## Comparison Criteria
{criteria list}
## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), reasoning.
Confidence Calibration
def calculate_final_confidence(pass1_winner, pass1_conf, pass2_winner, pass2_conf):
if pass1_winner == pass2_winner:
return (pass1_conf + pass2_conf) / 2, pass1_winner
else:
return 0.5, "TIE" # Disagreement → low confidence tie
Rubric Generation
Well-defined rubrics reduce evaluation variance by 40-60%.
Rubric Components
- Level descriptions: Clear boundaries for each score level
- Characteristics: Observable features defining each level
- Examples: Representative text (optional but valuable)
- Edge cases: Guidance for ambiguous situations
- Scoring guidelines: General principles for consistency
Example: Code Readability Rubric
{
"levels": [
{
"score": 1,
"label": "Poor",
"description": "Code is difficult to understand without significant effort",
"characteristics": [
"No meaningful variable or function names",
"No comments or documentation",
"Deeply nested or convoluted logic"
]
},
{
"score": 3,
"label": "Adequate",
"description": "Code is understandable with some effort",
"characteristics": [
"Most variables have meaningful names",
"Basic comments for complex sections",
"Logic is followable but could be cleaner"
]
},
{
"score": 5,
"label": "Excellent",
"description": "Code is immediately clear and maintainable",
"characteristics": [
"All names are descriptive and consistent",
"Comprehensive documentation",
"Clean, modular structure"
]
}
],
"edgeCases": [
{
"situation": "Domain-specific abbreviations used",
"guidance": "Score based on readability for domain experts"
}
]
}
Decision Framework: Direct vs. Pairwise
Is there an objective ground truth?
├── Yes → Direct Scoring
│ └── Examples: factual accuracy, format compliance
│
└── No → Is it a preference or quality judgment?
├── Yes → Pairwise Comparison
│ └── Examples: tone, style, creativity
│
└── No → Reference-based evaluation
└── Examples: summarization, translation
Scaling Evaluation
Panel of LLMs (PoLL)
Use multiple models as judges, aggregate votes:
- Reduces individual model bias
- More expensive but more reliable for high-stakes decisions
Hierarchical Evaluation
Fast cheap model → Screening
│
▼
Low confidence?
│
▼
Expensive model → Final verdict
Human-in-the-Loop
- Automated evaluation for clear cases
- Human review for low-confidence results
- Design feedback loop to improve automated evaluation
Common Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Score without justification | Scores lack grounding | Require evidence first |
| Single-pass pairwise | Position bias corruption | Always swap positions |
| Overloaded criteria | Unreliable measurement | One criterion = one aspect |
| Missing edge cases | Inconsistent handling | Include guidance in rubrics |
| Ignoring confidence | Wrong high-confidence worse | Calibrate to consistency |
Guidelines
- Always require justification before scores - 15-25% reliability improvement
- Always swap positions in pairwise comparison - Position bias mitigation
- Match scale granularity to rubric specificity - Don't use 1-10 without details
- Separate objective and subjective criteria - Use appropriate method
- Include confidence scores - Calibrate to position consistency
- Define edge cases explicitly - Reduce evaluation variance
- Use domain-specific rubrics - Generic = less useful
- Validate against human judgments - Ground truth correlation
- Monitor for systematic bias - Track disagreement patterns
- Design for iteration - Feedback loops improve systems
Evaluation Criteria Template
Use this template when defining new evaluation criteria:
# evaluation-criteria-template.yaml
criteria:
- name: "{Criterion Name}"
id: "EVAL-{ID}"
type: "{objective|subjective}" # Determines method selection
description: "{What this criterion measures}"
# Scoring rubric (1-5 scale recommended)
rubric:
1:
label: "Poor"
description: "{Clear definition of score 1}"
characteristics:
- "{Observable characteristic 1}"
- "{Observable characteristic 2}"
3:
label: "Adequate"
description: "{Clear definition of score 3}"
characteristics:
- "{Observable characteristic 1}"
- "{Observable characteristic 2}"
5:
label: "Excellent"
description: "{Clear definition of score 5}"
characteristics:
- "{Observable characteristic 1}"
- "{Observable characteristic 2}"
# Edge cases with specific guidance
edge_cases:
- situation: "{Ambiguous situation description}"
guidance: "{How to score in this case}"
# Evaluation method configuration
method:
primary: "{direct_scoring|pairwise_comparison}"
bias_mitigations:
- "{position_swap|length_normalization|etc}"
confidence_threshold: 0.7 # Below this, escalate to human
# Weight in overall score (optional)
weight: 1.0
Example instantiation:
criteria:
- name: "Code Correctness"
id: "EVAL-001"
type: "objective"
description: "Whether the code produces correct output for given inputs"
rubric:
1:
label: "Incorrect"
description: "Code fails for most or all test cases"
characteristics:
- "Runtime errors on execution"
- "Wrong output for standard inputs"
3:
label: "Partially Correct"
description: "Code works for common cases but fails edge cases"
characteristics:
- "Passes basic test cases"
- "Fails on boundary conditions"
5:
label: "Fully Correct"
description: "Code handles all cases including edge cases"
characteristics:
- "Passes all test cases"
- "Handles empty/null inputs gracefully"
edge_cases:
- situation: "Code is correct but uses deprecated API"
guidance: "Score correctness separately; note deprecation in feedback"
method:
primary: "direct_scoring"
bias_mitigations:
- "require_justification_first"
confidence_threshold: 0.8
weight: 2.0 # Weighted higher than style criteria
Related Components
Skills
context-fundamentals- Context structure for evaluation promptstool-design- Building evaluation toolsevaluation-framework- Foundational evaluation concepts
Agents
llm-judge- LLM-as-judge evaluation agentcompression-evaluator- Probe-based compression evaluation
Commands
/evaluate-response- Run LLM-as-judge evaluation
Scripts
external/Agent-Skills-for-Context-Engineering/skills/advanced-evaluation/scripts/evaluation_example.py- LLM-as-judge implementation patterns
Success Output
When this skill is successfully applied, output:
✅ SKILL COMPLETE: advanced-evaluation
Completed:
- [x] Evaluation method selected (direct scoring / pairwise / hybrid)
- [x] Rubric created with clear level descriptions
- [x] Bias mitigation protocols implemented
- [x] Evaluation prompts tested with sample data
- [x] Confidence calibration verified
- [x] Inter-rater reliability measured (if human baseline available)
Outputs:
- evaluation_rubric.json (scoring criteria and levels)
- evaluation_prompt.md (LLM judge prompt)
- evaluation_results.json (scores, justifications, confidence)
- bias_analysis_report.md (position bias, length bias metrics)
- Correlation with human judgments: X% (if baseline available)
Completion Checklist
Before marking this skill as complete, verify:
- Evaluation criteria clearly defined and measurable
- Rubric includes level descriptions with characteristics
- Prompt requires justification before scores (chain-of-thought)
- Bias mitigation implemented (position swap for pairwise)
- Confidence scores calibrated and validated
- Test evaluations run on sample data
- Edge cases identified and documented in rubric
- Output format structured (JSON/markdown)
- Evaluation results reproducible
- Documentation includes when to use each method
Failure Indicators
This skill has FAILED if:
- ❌ Evaluation criteria vague or unmeasurable
- ❌ LLM judge produces scores without justification
- ❌ Position bias not mitigated in pairwise comparison
- ❌ Confidence scores don't correlate with consistency
- ❌ Rubric missing level boundaries or characteristics
- ❌ Direct scoring used for subjective preferences
- ❌ Pairwise comparison used for objective criteria
- ❌ No edge case guidance in rubric
- ❌ Evaluation results inconsistent across runs
- ❌ Length bias favoring verbose responses unchecked
When NOT to Use
Do NOT use this skill when:
- Deterministic correct answer exists (use exact matching or regex instead)
- Real-time validation required (LLM-as-judge has latency)
- Simple pass/fail validation sufficient (use rule-based checks)
- No clear evaluation criteria can be defined
- Evaluation cost exceeds value of quality improvement
- Human judgment required for legal/ethical reasons
- Evaluation domain requires specialized expertise LLM lacks
- Ground truth available and preferable (use reference-based evaluation)
Use alternatives:
- exact-matching - Deterministic validation
- rule-based-validation - Heuristic checks
- reference-based-evaluation - Compare to gold standard
- human-in-the-loop - Critical decisions requiring human judgment
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Score without justification | Unreliable, no grounding | Always require chain-of-thought reasoning first |
| Single-pass pairwise | Position bias corrupts results | Always swap positions and check consistency |
| Overloaded criteria | Can't reliably measure multiple aspects | One criterion = one measurable aspect |
| Generic rubrics | Low discrimination, inconsistent | Create domain-specific rubrics with examples |
| Ignoring confidence | Wrong with high confidence worse | Calibrate confidence to consistency metrics |
| No edge cases | Inconsistent handling of ambiguous inputs | Document edge cases with specific guidance |
| Wrong scale granularity | 1-10 scale without detailed rubric | Match scale to rubric specificity (use 1-5 for most) |
| Self-enhancement bias | Model rates own outputs higher | Use different models for generation and evaluation |
| Blind to length bias | Longer always rated higher | Explicitly prompt to ignore length, normalize scores |
Principles
This skill embodies CODITECT core principles:
#5 Eliminate Ambiguity
- Clear rubric with level boundaries eliminates subjective interpretation
- Explicit bias mitigation protocols remove systematic errors
- Confidence scores communicate uncertainty
#6 Clear, Understandable, Explainable
- Chain-of-thought justification makes scores explainable
- Rubric characteristics provide concrete criteria
- Decision framework guides method selection
#8 No Assumptions
- Always validate against human judgments when available
- Test for systematic biases (position, length, verbosity)
- Verify consistency across multiple runs
#9 Evidence-Based Decisions
- Evaluation method backed by research (MT-Bench, Zheng et al.)
- Bias mitigation protocols proven to improve reliability
- Confidence calibration grounded in position swap consistency
#10 Research When in Doubt
- LLM-as-judge techniques evolving rapidly
- New bias types discovered regularly
- Correlation with human judgment varies by domain
Full Standard: CODITECT-STANDARD-AUTOMATION.md