Skip to main content

Evaluation Framework

Evaluation Framework

How to Use This Skill

  1. Review the patterns and examples below
  2. Apply the relevant patterns to your implementation
  3. Follow the best practices outlined in this skill

Expert skill for creating evaluation rubrics, implementing LLM-as-judge patterns, and assessing quality.

When to Use

Use this skill when:

  • Creating evaluation rubrics for code/outputs (consistent scoring)
  • Implementing LLM-as-judge patterns (automated review)
  • Quality assessment frameworks (standardized criteria)
  • Automated code review systems (scalable evaluation)
  • Output validation and scoring (objective assessment)
  • Creating grading criteria (5-level scoring guides)
  • Time savings: 75% faster reviews (20→5 min per evaluation)

Don't use this skill when:

  • Simple pass/fail checks (use basic validation instead)
  • Subjective aesthetic judgments (rubrics work best for objective criteria)
  • Real-time interactive reviews (LLM-as-judge is async)
  • Single evaluation (not worth rubric setup overhead)

LLM-as-Judge Pattern

Core Concept

Use an LLM to evaluate outputs based on defined criteria, providing:

  • Structured scoring
  • Consistent evaluation
  • Detailed feedback
  • Comparative analysis

Evaluation Template

from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum

class ScoreLevel(Enum):
"""Standardized score levels"""
EXCELLENT = 5 # Exceeds all criteria
GOOD = 4 # Meets all criteria well
ADEQUATE = 3 # Meets minimum criteria
POOR = 2 # Below minimum criteria
FAILING = 1 # Does not meet criteria

@dataclass
class EvaluationCriterion:
"""Single evaluation criterion"""
name: str
description: str
weight: float # 0.0 - 1.0
scoring_guide: Dict[ScoreLevel, str]

@dataclass
class EvaluationResult:
"""Result of evaluation"""
criterion: str
score: ScoreLevel
justification: str
examples: List[str]
improvement_suggestions: List[str]

@dataclass
class OverallEvaluation:
"""Complete evaluation"""
individual_scores: List[EvaluationResult]
weighted_average: float
overall_assessment: str
strengths: List[str]
weaknesses: List[str]
actionable_feedback: List[str]

LLM-as-Judge Prompt Template

# Evaluation Task

Evaluate the following OUTPUT against the specified CRITERIA.

## Output to Evaluate

{output_text}


## Evaluation Criteria

{criterion_1_name} (Weight: {weight}%)
- Excellent (5): {excellent_description}
- Good (4): {good_description}
- Adequate (3): {adequate_description}
- Poor (2): {poor_description}
- Failing (1): {failing_description}

{criterion_2_name} (Weight: {weight}%)
[...repeat for all criteria...]

## Required Response Format

For EACH criterion, provide:

1. **Score**: {1-5}
2. **Justification**: {detailed explanation referencing specific parts of output}
3. **Evidence**: {quote specific examples from output}
4. **Improvement Suggestions**: {actionable recommendations}

## Final Summary

- **Weighted Average Score**: {calculated from individual scores}
- **Overall Assessment**: {holistic evaluation}
- **Top 3 Strengths**: {bullet list}
- **Top 3 Weaknesses**: {bullet list}
- **Priority Improvements**: {ranked list of most impactful changes}

## Important
- Be objective and evidence-based
- Quote specific examples
- Provide actionable feedback
- Consider context and constraints
- Be consistent across evaluations

Code Quality Rubric

Criteria

1. Correctness (Weight: 30%)

  • Excellent (5): Handles all cases including edge cases, no bugs
  • Good (4): Handles main cases correctly, minor edge case issues
  • Adequate (3): Core functionality works, some edge case bugs
  • Poor (2): Core functionality has bugs
  • Failing (1): Does not work as intended

2. Code Structure (Weight: 20%)

  • Excellent (5): Well-organized, clear separation of concerns, DRY
  • Good (4): Organized, minor repetition
  • Adequate (3): Functional organization, some repetition
  • Poor (2): Poor organization, significant repetition
  • Failing (1): Unstructured, unmaintainable

3. Error Handling (Weight: 15%)

  • Excellent (5): Comprehensive error handling with recovery, detailed messages
  • Good (4): Good error handling, clear messages
  • Adequate (3): Basic error handling present
  • Poor (2): Minimal error handling
  • Failing (1): No error handling

4. Documentation (Weight: 10%)

  • Excellent (5): Comprehensive docs, examples, edge cases documented
  • Good (4): Good documentation coverage
  • Adequate (3): Basic documentation present
  • Poor (2): Minimal documentation
  • Failing (1): No documentation

5. Type Safety (Weight: 10%)

  • Excellent (5): Full type hints, passes strict type checking
  • Good (4): Good type coverage (>80%)
  • Adequate (3): Basic type hints (>50%)
  • Poor (2): Minimal type hints (<50%)
  • Failing (1): No type hints

6. Performance (Weight: 10%)

  • Excellent (5): Optimal algorithms, efficient implementation
  • Good (4): Good performance, room for minor optimization
  • Adequate (3): Acceptable performance
  • Poor (2): Performance issues
  • Failing (1): Unacceptable performance

7. Security (Weight: 5%)

  • Excellent (5): Security best practices, input validation, no vulnerabilities
  • Good (4): Good security practices
  • Adequate (3): Basic security measures
  • Poor (2): Security concerns present
  • Failing (1): Critical security issues

Architecture Quality Rubric

1. Scalability (Weight: 25%)

  • Excellent (5): Scales horizontally, handles 10x growth
  • Good (4): Scales with minor modifications
  • Adequate (3): Handles current load
  • Poor (2): Scaling issues likely
  • Failing (1): Cannot scale

2. Maintainability (Weight: 20%)

  • Excellent (5): Clear boundaries, easy to modify, well-tested
  • Good (4): Generally maintainable
  • Adequate (3): Can be maintained with effort
  • Poor (2): Difficult to maintain
  • Failing (1): Unmaintainable

3. Observability (Weight: 15%)

  • Excellent (5): Comprehensive metrics, logging, tracing
  • Good (4): Good observability coverage
  • Adequate (3): Basic logging/metrics
  • Poor (2): Minimal observability
  • Failing (1): No observability

4. Fault Tolerance (Weight: 15%)

  • Excellent (5): Circuit breakers, retries, graceful degradation
  • Good (4): Good error recovery
  • Adequate (3): Basic error handling
  • Poor (2): Poor fault tolerance
  • Failing (1): No fault tolerance

5. Security (Weight: 15%)

  • Excellent (5): Defense in depth, least privilege, validated inputs
  • Good (4): Good security practices
  • Adequate (3): Basic security
  • Poor (2): Security gaps
  • Failing (1): Critical vulnerabilities

6. Documentation (Weight: 10%)

  • Excellent (5): Architecture diagrams, ADRs, runbooks
  • Good (4): Good documentation
  • Adequate (3): Basic documentation
  • Poor (2): Minimal documentation
  • Failing (1): No documentation

Multi-Agent System Rubric

1. Coordination Efficiency (Weight: 25%)

  • Excellent (5): Minimal coordination overhead, async patterns
  • Good (4): Efficient coordination
  • Adequate (3): Acceptable coordination
  • Poor (2): High coordination overhead
  • Failing (1): Coordination bottleneck

2. Error Cascade Prevention (Weight: 20%)

  • Excellent (5): Circuit breakers, bulkheads, timeouts everywhere
  • Good (4): Good isolation
  • Adequate (3): Basic isolation
  • Poor (2): Error cascade risk
  • Failing (1): No isolation

3. Token Economics (Weight: 15%)

  • Excellent (5): Optimized token usage, checkpointing, compression
  • Good (4): Good token management
  • Adequate (3): Acceptable token usage
  • Poor (2): High token consumption
  • Failing (1): Excessive token waste

4. Observability (Weight: 15%)

  • Excellent (5): Full tracing, agent state visibility, debug tools
  • Good (4): Good observability
  • Adequate (3): Basic logging
  • Poor (2): Limited visibility
  • Failing (1): No observability

5. Delegation Clarity (Weight: 15%)

  • Excellent (5): Clear responsibilities, typed interfaces, boundaries
  • Good (4): Clear delegation
  • Adequate (3): Understandable delegation
  • Poor (2): Unclear responsibilities
  • Failing (1): Chaotic delegation

6. Checkpoint/Resume (Weight: 10%)

  • Excellent (5): Comprehensive checkpointing, resume from any state
  • Good (4): Good checkpoint coverage
  • Adequate (3): Basic checkpointing
  • Poor (2): Limited checkpointing
  • Failing (1): No checkpointing

Evaluation Process

Step 1: Define Criteria

code_quality_criteria = [
EvaluationCriterion(
name="Correctness",
description="Code works as intended, handles edge cases",
weight=0.30,
scoring_guide={
ScoreLevel.EXCELLENT: "Handles all cases including edge cases, no bugs",
ScoreLevel.GOOD: "Handles main cases correctly, minor edge case issues",
ScoreLevel.ADEQUATE: "Core functionality works, some edge case bugs",
ScoreLevel.POOR: "Core functionality has bugs",
ScoreLevel.FAILING: "Does not work as intended",
}
),
# ... more criteria
]

Step 2: Generate Evaluation Prompt

def generate_evaluation_prompt(
output: str,
criteria: List[EvaluationCriterion]
) -> str:
"""Generate LLM-as-judge prompt"""
prompt = f"""# Evaluation Task

Evaluate the following OUTPUT against the specified CRITERIA.

## Output to Evaluate

{output}


## Evaluation Criteria

"""
for criterion in criteria:
prompt += f"\n{criterion.name} (Weight: {criterion.weight * 100}%)\n"
for level, description in criterion.scoring_guide.items():
prompt += f"- {level.name.title()} ({level.value}): {description}\n"

prompt += """
## Required Response Format

[Format instructions...]
"""
return prompt

Step 3: Parse Evaluation Response

def parse_evaluation_response(
response: str,
criteria: List[EvaluationCriterion]
) -> OverallEvaluation:
"""Parse LLM evaluation response"""
# Extract individual scores
# Calculate weighted average
# Generate overall assessment
pass

Step 4: Generate Report

def generate_evaluation_report(eval: OverallEvaluation) -> str:
"""Generate human-readable report"""
report = f"""
# Evaluation Report

**Overall Score**: {eval.weighted_average:.2f}/5.0

## Strengths
{chr(10).join(f"- {s}" for s in eval.strengths)}

## Weaknesses
{chr(10).join(f"- {w}" for w in eval.weaknesses)}

## Detailed Scores

"""
for result in eval.individual_scores:
report += f"""
### {result.criterion}
**Score**: {result.score.value}/5 ({result.score.name})

**Justification**: {result.justification}

**Examples**:
{chr(10).join(f"- {ex}" for ex in result.examples)}

**Improvements**:
{chr(10).join(f"- {imp}" for imp in result.improvement_suggestions)}
"""
return report

Comparative Evaluation

For comparing multiple implementations:

@dataclass
class ComparativeEvaluation:
"""Compare multiple outputs"""
outputs: List[str]
criteria: List[EvaluationCriterion]
individual_evaluations: List[OverallEvaluation]
rankings: Dict[str, int] # output_id -> rank
best_practices: List[str]
common_issues: List[str]

Executable Scripts

See core/llm_as_judge.py for LLM-as-judge implementation. See core/rubric_generator.py for rubric generation utilities.

Best Practices

✅ DO

  • Define clear criteria - Specific, measurable, actionable
  • Use weighted scoring - Reflect importance of criteria
  • Provide scoring guides - Clear descriptions for each level
  • Require evidence - Quote specific examples
  • Give actionable feedback - Specific improvement suggestions
  • Be consistent - Apply same standards across evaluations

❌ DON'T

  • Don't use vague criteria - "Good code" is not measurable
  • Don't skip justifications - Always explain scores
  • Don't ignore context - Consider constraints and requirements
  • Don't be subjective - Base on evidence, not preference
  • Don't provide only scores - Include improvement guidance

Integration with T2

Use cases in T2:

  • Code review automation (evaluate PRs)
  • Agent output validation (ensure quality)
  • Architecture assessment (evaluate designs)
  • Documentation quality checks

Example integration:

// Evaluate agent output before accepting
let evaluation = evaluate_agent_output(
agent_output,
&evaluation_criteria,
llm_service
).await?;

if evaluation.weighted_average < 3.0 {
// Reject and request improvements
return Err(AgentError::OutputQualityTooLow {
score: evaluation.weighted_average,
issues: evaluation.weaknesses,
});
}

Templates

See templates/evaluation_rubrics.md for pre-built rubrics.

Rubric Selection Guide

Choose the right rubric based on what you're evaluating:

EvaluatingRecommended RubricKey CriteriaWeight Distribution
Code (functions/modules)Code Quality RubricCorrectness, Structure, Error HandlingCorrectness 30%, Structure 20%
System architectureArchitecture Quality RubricScalability, Maintainability, Fault ToleranceScalability 25%, Maintainability 20%
Multi-agent outputMulti-Agent System RubricCoordination, Token Economics, DelegationCoordination 25%, Error Cascade 20%
DocumentationDocumentation RubricCompleteness, Accuracy, ClarityCompleteness 35%, Accuracy 30%
API designAPI Quality RubricConsistency, Usability, PerformanceConsistency 30%, Usability 25%
Test coverageTesting RubricCoverage, Edge Cases, MaintainabilityCoverage 35%, Edge Cases 25%

Rubric Customization Decision Tree:

Start with standard rubric

├── Domain-specific requirements?
│ └── Yes → Add domain criteria (security, compliance, etc.)

├── Team has specific quality gates?
│ └── Yes → Adjust weights to match gates

├── Comparing implementations?
│ └── Yes → Use ComparativeEvaluation with identical criteria

└── Single evaluation or batch?
├── Single → Full rubric, detailed feedback
└── Batch → Simplified rubric, aggregate scoring

Minimum Viable Rubric (3 criteria):

  1. Correctness (40%) - Does it work as intended?
  2. Quality (35%) - Is it well-structured and maintainable?
  3. Completeness (25%) - Does it cover all requirements?

Multi-Context Window Support

This skill supports long-running evaluation workflows across multiple context windows using Claude 4.5's enhanced state management capabilities.

State Tracking

Evaluation Progress State (JSON):

{
"checkpoint_id": "ckpt_20251129_153000",
"evaluations_completed": [
{"target": "code_quality", "score": 4.2, "status": "complete"},
{"target": "architecture", "score": 3.8, "status": "complete"},
{"target": "multi_agent", "score": 0.0, "status": "pending"}
],
"rubrics_created": ["code_quality", "architecture", "multi_agent_system"],
"llm_judge_results": {
"total_evaluations": 15,
"average_score": 4.0,
"improvement_suggestions": 42
},
"token_usage": 12000,
"created_at": "2025-11-29T15:30:00Z"
}

Progress Notes (Markdown):

# Evaluation Framework Progress - 2025-11-29

## Completed
- Code quality rubric created with 7 criteria
- 15 evaluations run with LLM-as-judge
- Architecture assessment rubric defined

## In Progress
- Running multi-agent system evaluations
- Generating improvement reports

## Next Actions
- Complete multi-agent evaluations (3 remaining)
- Generate consolidated improvement report
- Create comparative analysis across evaluations

Session Recovery

When starting a fresh context window after evaluation work:

  1. Load Checkpoint State: Read .coditect/checkpoints/evaluation-latest.json
  2. Review Progress Notes: Check evaluation-progress.md for context
  3. Verify Rubrics: Review created rubrics for completeness
  4. Check Evaluation Results: Load JSON results from completed evaluations
  5. Resume Evaluations: Continue from last pending evaluation

Recovery Commands:

# 1. Check latest checkpoint
cat .coditect/checkpoints/evaluation-latest.json | jq '.evaluations_completed'

# 2. Review progress
tail -30 evaluation-progress.md

# 3. Check evaluation results
cat evaluation-results.json | jq '.llm_judge_results'

# 4. List pending evaluations
cat .coditect/checkpoints/evaluation-latest.json | jq '.evaluations_completed[] | select(.status=="pending")'

# 5. Resume from next pending evaluation
# Continue evaluation workflow

State Management Best Practices

Checkpoint Files (JSON Schema):

  • Store in .coditect/checkpoints/evaluation-{timestamp}.json
  • Include evaluation results with scores and justifications
  • Track rubric definitions and criteria weights
  • Record improvement suggestions generated

Progress Tracking (Markdown Narrative):

  • Maintain evaluation-progress.md with evaluation status
  • Document rubric design decisions
  • Note unexpected evaluation results for review
  • List next evaluations to run

Git Integration:

  • Create checkpoint after major rubric creation
  • Commit results with: docs(eval): Add code quality evaluation results
  • Tag evaluation milestones: git tag eval-batch-1-complete

Progress Checkpoints

Natural Breaking Points:

  1. After each rubric created and validated
  2. After batch of 5-10 evaluations completed
  3. After improvement reports generated
  4. Before comparative analysis phase
  5. After all evaluations validated by human reviewer

Checkpoint Creation Pattern:

# Automatic checkpoint after evaluation batch
if evaluations_complete >= 10 or improvement_suggestions > 30:
create_checkpoint({
"evaluations": completed_evaluations,
"rubrics": created_rubrics,
"results": llm_judge_output,
"tokens": current_tokens
});

Example: Multi-Context Evaluation Workflow

Context Window 1: Rubric Creation & Initial Evaluations

{
"checkpoint_id": "ckpt_eval_batch1",
"phase": "initial_evaluations_complete",
"rubrics": ["code_quality", "architecture"],
"evaluations": 10,
"next_action": "Run multi-agent evaluations",
"token_usage": 7500
}

Context Window 2: Remaining Evaluations & Reports

# Load checkpoint
cat .coditect/checkpoints/ckpt_eval_batch1.json

# Continue with multi-agent evaluations
# Token savings: ~8000 tokens (rubrics already defined, results cached)

Token Savings Analysis:

  • Without checkpoint: 12000 tokens (re-create rubrics + re-run evaluations)
  • With checkpoint: 7500 tokens (resume from cached results)
  • Savings: 38% reduction (12000 → 7500 tokens)

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: evaluation-framework

Completed:
- [x] Code quality rubric created with 7 weighted criteria
- [x] Architecture quality rubric with scalability focus defined
- [x] LLM-as-judge prompt templates operational
- [x] Evaluation pipeline with automated scoring deployed
- [x] Comparative analysis framework for multiple implementations functional
- [x] 15 code evaluations completed with actionable feedback

Outputs:
- rubrics/code-quality-rubric.json (7 criteria with 5-level scoring)
- rubrics/architecture-rubric.json (6 criteria with scalability focus)
- prompts/llm-judge-template.md (Structured evaluation prompt)
- src/evaluation/llm_judge.py (LLM-as-judge implementation)
- src/evaluation/rubric_generator.py (Rubric creation utilities)
- reports/evaluation-results.json (15 evaluations with scores)

Evaluation Metrics:
- Average evaluation score: 4.0/5.0 (80% quality threshold met)
- Consistency: 92% (scores within ±0.3 across similar code)
- Actionable feedback rate: 95% (concrete improvement suggestions)
- Time savings: 75% faster than manual review (20min → 5min)

Completion Checklist

Before marking this skill as complete, verify:

  • Rubrics define 5-level scoring (Excellent, Good, Adequate, Poor, Failing)
  • Each criterion has weight (sum to 100%)
  • Scoring guide provides specific, measurable descriptions per level
  • LLM-as-judge prompt includes output format requirements
  • Prompt requires justification with evidence/examples
  • Evaluation parser extracts scores from LLM response
  • Weighted average calculation correct (score * weight summed)
  • Individual criterion results include justification and improvement suggestions
  • Comparative evaluation ranks multiple outputs correctly
  • Evaluation reports human-readable (markdown format)
  • Quality gates reject outputs below threshold (e.g., <3.0/5.0)
  • All outputs exist at expected locations and pass validation

Failure Indicators

This skill has FAILED if:

  • ❌ Rubric criteria vague or non-measurable ("good code" without specifics)
  • ❌ Scoring levels overlap or unclear (cannot distinguish Adequate from Good)
  • ❌ LLM-as-judge provides scores without justification
  • ❌ Evaluation results inconsistent (same code scores differently on re-evaluation)
  • ❌ No evidence/examples quoted from evaluated output
  • ❌ Improvement suggestions generic ("make it better") instead of actionable
  • ❌ Weighted average calculation incorrect
  • ❌ Parser fails to extract scores from LLM response
  • ❌ Quality gates not enforced (low-quality outputs accepted)
  • ❌ Comparative evaluation ranks incorrectly
  • ❌ No verification of evaluation accuracy (spot-checking by human)

When NOT to Use

Do NOT use this skill when:

  • Simple pass/fail check sufficient (unit tests, linters)
  • Subjective aesthetic judgments required (UI design preferences)
  • Real-time interactive review needed (pair programming, live code review)
  • Single one-off evaluation (rubric setup overhead not justified)
  • Evaluation criteria cannot be objectified (art, creative writing)
  • Deterministic scoring required (use static analysis tools instead)
  • Human expert review mandated (security audits, legal compliance)

Alternative approaches:

  • Pass/fail only: Use automated tests, linters, type checkers
  • Subjective review: Human code review with qualitative feedback
  • Real-time: Live pairing session, interactive Q&A
  • Deterministic: Static analysis tools (SonarQube, ESLint)
  • Single evaluation: Manual review without rubric formalization

Anti-Patterns (Avoid)

Anti-PatternProblemSolution
Vague criteria ("good quality")Not measurable, inconsistent scoringDefine specific, objective criteria with examples
No scoring guideLLM assigns arbitrary scoresProvide 5-level descriptions for each criterion
Missing justification requirementScores lack credibilityMandate evidence and quotes from evaluated output
Equal weighting for all criteriaMisrepresents importanceWeight criteria by significance (e.g., correctness 30%, docs 10%)
Generic improvement suggestionsNot actionableRequire specific recommendations with examples
No consistency checksSame code scores differentlyRun evaluation multiple times, verify variance < threshold
Ignoring context/constraintsUnfair evaluationInclude context (time limits, requirements) in prompt
No human validationLLM errors go undetectedSpot-check 10% of evaluations manually
Returning only scoresLoses learning opportunityAlways include strengths, weaknesses, and improvements
Evaluation without baselineCannot measure improvementEstablish baseline score, track progress over time

Principles

This skill embodies:

  • #1 Evidence-Based Assessment - Every score backed by quoted examples from evaluated output
  • #2 Consistency Through Structure - Rubrics ensure repeatable, objective evaluations
  • #5 Eliminate Ambiguity - Clear 5-level scoring guides remove subjective interpretation
  • #6 Clear, Understandable, Explainable - Justifications explain reasoning behind scores
  • #7 Optimize for Context - Evaluation criteria adapt to domain (code vs architecture vs multi-agent)
  • #8 No Assumptions - Verify evaluation accuracy through spot-checking and variance analysis
  • #10 Automation First - LLM-as-judge automates 75% of review time
  • #11 Continuous Improvement - Actionable feedback drives iterative enhancement

Full Standard: CODITECT-STANDARD-AUTOMATION.md