Evaluation Framework

How to Use This Skill

Review the patterns and examples below
Apply the relevant patterns to your implementation
Follow the best practices outlined in this skill

Expert skill for creating evaluation rubrics, implementing LLM-as-judge patterns, and assessing quality.

When to Use

✅ Use this skill when:

Creating evaluation rubrics for code/outputs (consistent scoring)
Implementing LLM-as-judge patterns (automated review)
Quality assessment frameworks (standardized criteria)
Automated code review systems (scalable evaluation)
Output validation and scoring (objective assessment)
Creating grading criteria (5-level scoring guides)
Time savings: 75% faster reviews (20→5 min per evaluation)

❌ Don't use this skill when:

Simple pass/fail checks (use basic validation instead)
Subjective aesthetic judgments (rubrics work best for objective criteria)
Real-time interactive reviews (LLM-as-judge is async)
Single evaluation (not worth rubric setup overhead)

LLM-as-Judge Pattern

Core Concept

Use an LLM to evaluate outputs based on defined criteria, providing:

Structured scoring
Consistent evaluation
Detailed feedback
Comparative analysis

Evaluation Template

from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum

class ScoreLevel(Enum):
    """Standardized score levels"""
    EXCELLENT = 5  # Exceeds all criteria
    GOOD = 4       # Meets all criteria well
    ADEQUATE = 3   # Meets minimum criteria
    POOR = 2       # Below minimum criteria
    FAILING = 1    # Does not meet criteria

@dataclass
class EvaluationCriterion:
    """Single evaluation criterion"""
    name: str
    description: str
    weight: float  # 0.0 - 1.0
    scoring_guide: Dict[ScoreLevel, str]

@dataclass
class EvaluationResult:
    """Result of evaluation"""
    criterion: str
    score: ScoreLevel
    justification: str
    examples: List[str]
    improvement_suggestions: List[str]

@dataclass
class OverallEvaluation:
    """Complete evaluation"""
    individual_scores: List[EvaluationResult]
    weighted_average: float
    overall_assessment: str
    strengths: List[str]
    weaknesses: List[str]
    actionable_feedback: List[str]

LLM-as-Judge Prompt Template

# Evaluation Task

Evaluate the following OUTPUT against the specified CRITERIA.

## Output to Evaluate

{output_text}

## Evaluation Criteria

{criterion_1_name} (Weight: {weight}%)
- Excellent (5): {excellent_description}
- Good (4): {good_description}
- Adequate (3): {adequate_description}
- Poor (2): {poor_description}
- Failing (1): {failing_description}

{criterion_2_name} (Weight: {weight}%)
[...repeat for all criteria...]

## Required Response Format

For EACH criterion, provide:

1. **Score**: {1-5}
2. **Justification**: {detailed explanation referencing specific parts of output}
3. **Evidence**: {quote specific examples from output}
4. **Improvement Suggestions**: {actionable recommendations}

## Final Summary

- **Weighted Average Score**: {calculated from individual scores}
- **Overall Assessment**: {holistic evaluation}
- **Top 3 Strengths**: {bullet list}
- **Top 3 Weaknesses**: {bullet list}
- **Priority Improvements**: {ranked list of most impactful changes}

## Important
- Be objective and evidence-based
- Quote specific examples
- Provide actionable feedback
- Consider context and constraints
- Be consistent across evaluations

Code Quality Rubric

Criteria

1. Correctness (Weight: 30%)

Excellent (5): Handles all cases including edge cases, no bugs
Good (4): Handles main cases correctly, minor edge case issues
Adequate (3): Core functionality works, some edge case bugs
Poor (2): Core functionality has bugs
Failing (1): Does not work as intended

2. Code Structure (Weight: 20%)

Excellent (5): Well-organized, clear separation of concerns, DRY
Good (4): Organized, minor repetition
Adequate (3): Functional organization, some repetition
Poor (2): Poor organization, significant repetition
Failing (1): Unstructured, unmaintainable

3. Error Handling (Weight: 15%)

Excellent (5): Comprehensive error handling with recovery, detailed messages
Good (4): Good error handling, clear messages
Adequate (3): Basic error handling present
Poor (2): Minimal error handling
Failing (1): No error handling

4. Documentation (Weight: 10%)

Excellent (5): Comprehensive docs, examples, edge cases documented
Good (4): Good documentation coverage
Adequate (3): Basic documentation present
Poor (2): Minimal documentation
Failing (1): No documentation

5. Type Safety (Weight: 10%)

Excellent (5): Full type hints, passes strict type checking
Good (4): Good type coverage (>80%)
Adequate (3): Basic type hints (>50%)
Poor (2): Minimal type hints (<50%)
Failing (1): No type hints

6. Performance (Weight: 10%)

Excellent (5): Optimal algorithms, efficient implementation
Good (4): Good performance, room for minor optimization
Adequate (3): Acceptable performance
Poor (2): Performance issues
Failing (1): Unacceptable performance

7. Security (Weight: 5%)

Excellent (5): Security best practices, input validation, no vulnerabilities
Good (4): Good security practices
Adequate (3): Basic security measures
Poor (2): Security concerns present
Failing (1): Critical security issues

Architecture Quality Rubric

1. Scalability (Weight: 25%)

Excellent (5): Scales horizontally, handles 10x growth
Good (4): Scales with minor modifications
Adequate (3): Handles current load
Poor (2): Scaling issues likely
Failing (1): Cannot scale

2. Maintainability (Weight: 20%)

Excellent (5): Clear boundaries, easy to modify, well-tested
Good (4): Generally maintainable
Adequate (3): Can be maintained with effort
Poor (2): Difficult to maintain
Failing (1): Unmaintainable

3. Observability (Weight: 15%)

Excellent (5): Comprehensive metrics, logging, tracing
Good (4): Good observability coverage
Adequate (3): Basic logging/metrics
Poor (2): Minimal observability
Failing (1): No observability

4. Fault Tolerance (Weight: 15%)

Excellent (5): Circuit breakers, retries, graceful degradation
Good (4): Good error recovery
Adequate (3): Basic error handling
Poor (2): Poor fault tolerance
Failing (1): No fault tolerance

5. Security (Weight: 15%)

Excellent (5): Defense in depth, least privilege, validated inputs
Good (4): Good security practices
Adequate (3): Basic security
Poor (2): Security gaps
Failing (1): Critical vulnerabilities

6. Documentation (Weight: 10%)

Excellent (5): Architecture diagrams, ADRs, runbooks
Good (4): Good documentation
Adequate (3): Basic documentation
Poor (2): Minimal documentation
Failing (1): No documentation

Multi-Agent System Rubric

1. Coordination Efficiency (Weight: 25%)

Excellent (5): Minimal coordination overhead, async patterns
Good (4): Efficient coordination
Adequate (3): Acceptable coordination
Poor (2): High coordination overhead
Failing (1): Coordination bottleneck

2. Error Cascade Prevention (Weight: 20%)

Excellent (5): Circuit breakers, bulkheads, timeouts everywhere
Good (4): Good isolation
Adequate (3): Basic isolation
Poor (2): Error cascade risk
Failing (1): No isolation

3. Token Economics (Weight: 15%)

Excellent (5): Optimized token usage, checkpointing, compression
Good (4): Good token management
Adequate (3): Acceptable token usage
Poor (2): High token consumption
Failing (1): Excessive token waste

4. Observability (Weight: 15%)

Excellent (5): Full tracing, agent state visibility, debug tools
Good (4): Good observability
Adequate (3): Basic logging
Poor (2): Limited visibility
Failing (1): No observability

5. Delegation Clarity (Weight: 15%)

Excellent (5): Clear responsibilities, typed interfaces, boundaries
Good (4): Clear delegation
Adequate (3): Understandable delegation
Poor (2): Unclear responsibilities
Failing (1): Chaotic delegation

6. Checkpoint/Resume (Weight: 10%)

Excellent (5): Comprehensive checkpointing, resume from any state
Good (4): Good checkpoint coverage
Adequate (3): Basic checkpointing
Poor (2): Limited checkpointing
Failing (1): No checkpointing

Evaluation Process

Step 1: Define Criteria

code_quality_criteria = [
    EvaluationCriterion(
        name="Correctness",
        description="Code works as intended, handles edge cases",
        weight=0.30,
        scoring_guide={
            ScoreLevel.EXCELLENT: "Handles all cases including edge cases, no bugs",
            ScoreLevel.GOOD: "Handles main cases correctly, minor edge case issues",
            ScoreLevel.ADEQUATE: "Core functionality works, some edge case bugs",
            ScoreLevel.POOR: "Core functionality has bugs",
            ScoreLevel.FAILING: "Does not work as intended",
        }
    ),
    # ... more criteria
]

Step 2: Generate Evaluation Prompt

def generate_evaluation_prompt(
    output: str,
    criteria: List[EvaluationCriterion]
) -> str:
    """Generate LLM-as-judge prompt"""
    prompt = f"""# Evaluation Task

Evaluate the following OUTPUT against the specified CRITERIA.

## Output to Evaluate

{output}

## Evaluation Criteria

"""
    for criterion in criteria:
        prompt += f"\n{criterion.name} (Weight: {criterion.weight * 100}%)\n"
        for level, description in criterion.scoring_guide.items():
            prompt += f"- {level.name.title()} ({level.value}): {description}\n"

    prompt += """
## Required Response Format

[Format instructions...]
"""
    return prompt

Step 3: Parse Evaluation Response

def parse_evaluation_response(
    response: str,
    criteria: List[EvaluationCriterion]
) -> OverallEvaluation:
    """Parse LLM evaluation response"""
    # Extract individual scores
    # Calculate weighted average
    # Generate overall assessment
    pass

Step 4: Generate Report

def generate_evaluation_report(eval: OverallEvaluation) -> str:
    """Generate human-readable report"""
    report = f"""
# Evaluation Report

**Overall Score**: {eval.weighted_average:.2f}/5.0

## Strengths
{chr(10).join(f"- {s}" for s in eval.strengths)}

## Weaknesses
{chr(10).join(f"- {w}" for w in eval.weaknesses)}

## Detailed Scores

"""
    for result in eval.individual_scores:
        report += f"""
### {result.criterion}
**Score**: {result.score.value}/5 ({result.score.name})

**Justification**: {result.justification}

**Examples**:
{chr(10).join(f"- {ex}" for ex in result.examples)}

**Improvements**:
{chr(10).join(f"- {imp}" for imp in result.improvement_suggestions)}
"""
    return report

Comparative Evaluation

For comparing multiple implementations:

@dataclass
class ComparativeEvaluation:
    """Compare multiple outputs"""
    outputs: List[str]
    criteria: List[EvaluationCriterion]
    individual_evaluations: List[OverallEvaluation]
    rankings: Dict[str, int]  # output_id -> rank
    best_practices: List[str]
    common_issues: List[str]

Executable Scripts

See core/llm_as_judge.py for LLM-as-judge implementation. See core/rubric_generator.py for rubric generation utilities.

Best Practices

✅ DO

Define clear criteria - Specific, measurable, actionable
Use weighted scoring - Reflect importance of criteria
Provide scoring guides - Clear descriptions for each level
Require evidence - Quote specific examples
Give actionable feedback - Specific improvement suggestions
Be consistent - Apply same standards across evaluations

❌ DON'T

Don't use vague criteria - "Good code" is not measurable
Don't skip justifications - Always explain scores
Don't ignore context - Consider constraints and requirements
Don't be subjective - Base on evidence, not preference
Don't provide only scores - Include improvement guidance

Integration with T2

Use cases in T2:

Code review automation (evaluate PRs)
Agent output validation (ensure quality)
Architecture assessment (evaluate designs)
Documentation quality checks

Example integration:

// Evaluate agent output before accepting
let evaluation = evaluate_agent_output(
    agent_output,
    &evaluation_criteria,
    llm_service
).await?;

if evaluation.weighted_average < 3.0 {
    // Reject and request improvements
    return Err(AgentError::OutputQualityTooLow {
        score: evaluation.weighted_average,
        issues: evaluation.weaknesses,
    });
}

Templates

See templates/evaluation_rubrics.md for pre-built rubrics.

Rubric Selection Guide

Choose the right rubric based on what you're evaluating:

Evaluating	Recommended Rubric	Key Criteria	Weight Distribution
Code (functions/modules)	Code Quality Rubric	Correctness, Structure, Error Handling	Correctness 30%, Structure 20%
System architecture	Architecture Quality Rubric	Scalability, Maintainability, Fault Tolerance	Scalability 25%, Maintainability 20%
Multi-agent output	Multi-Agent System Rubric	Coordination, Token Economics, Delegation	Coordination 25%, Error Cascade 20%
Documentation	Documentation Rubric	Completeness, Accuracy, Clarity	Completeness 35%, Accuracy 30%
API design	API Quality Rubric	Consistency, Usability, Performance	Consistency 30%, Usability 25%
Test coverage	Testing Rubric	Coverage, Edge Cases, Maintainability	Coverage 35%, Edge Cases 25%

Rubric Customization Decision Tree:

Start with standard rubric
│
├── Domain-specific requirements?
│   └── Yes → Add domain criteria (security, compliance, etc.)
│
├── Team has specific quality gates?
│   └── Yes → Adjust weights to match gates
│
├── Comparing implementations?
│   └── Yes → Use ComparativeEvaluation with identical criteria
│
└── Single evaluation or batch?
    ├── Single → Full rubric, detailed feedback
    └── Batch → Simplified rubric, aggregate scoring

Minimum Viable Rubric (3 criteria):

Correctness (40%) - Does it work as intended?
Quality (35%) - Is it well-structured and maintainable?
Completeness (25%) - Does it cover all requirements?

Multi-Context Window Support

This skill supports long-running evaluation workflows across multiple context windows using Claude 4.5's enhanced state management capabilities.

State Tracking

Evaluation Progress State (JSON):

{
  "checkpoint_id": "ckpt_20251129_153000",
  "evaluations_completed": [
    {"target": "code_quality", "score": 4.2, "status": "complete"},
    {"target": "architecture", "score": 3.8, "status": "complete"},
    {"target": "multi_agent", "score": 0.0, "status": "pending"}
  ],
  "rubrics_created": ["code_quality", "architecture", "multi_agent_system"],
  "llm_judge_results": {
    "total_evaluations": 15,
    "average_score": 4.0,
    "improvement_suggestions": 42
  },
  "token_usage": 12000,
  "created_at": "2025-11-29T15:30:00Z"
}

Progress Notes (Markdown):

# Evaluation Framework Progress - 2025-11-29

## Completed
- Code quality rubric created with 7 criteria
- 15 evaluations run with LLM-as-judge
- Architecture assessment rubric defined

## In Progress
- Running multi-agent system evaluations
- Generating improvement reports

## Next Actions
- Complete multi-agent evaluations (3 remaining)
- Generate consolidated improvement report
- Create comparative analysis across evaluations

Session Recovery

When starting a fresh context window after evaluation work:

Load Checkpoint State: Read .coditect/checkpoints/evaluation-latest.json
Review Progress Notes: Check evaluation-progress.md for context
Verify Rubrics: Review created rubrics for completeness
Check Evaluation Results: Load JSON results from completed evaluations
Resume Evaluations: Continue from last pending evaluation

Recovery Commands:

# 1. Check latest checkpoint
cat .coditect/checkpoints/evaluation-latest.json | jq '.evaluations_completed'

# 2. Review progress
tail -30 evaluation-progress.md

# 3. Check evaluation results
cat evaluation-results.json | jq '.llm_judge_results'

# 4. List pending evaluations
cat .coditect/checkpoints/evaluation-latest.json | jq '.evaluations_completed[] | select(.status=="pending")'

# 5. Resume from next pending evaluation
# Continue evaluation workflow

State Management Best Practices

Checkpoint Files (JSON Schema):

Store in .coditect/checkpoints/evaluation-{timestamp}.json
Include evaluation results with scores and justifications
Track rubric definitions and criteria weights
Record improvement suggestions generated

Progress Tracking (Markdown Narrative):

Maintain evaluation-progress.md with evaluation status
Document rubric design decisions
Note unexpected evaluation results for review
List next evaluations to run

Git Integration:

Create checkpoint after major rubric creation
Commit results with: docs(eval): Add code quality evaluation results
Tag evaluation milestones: git tag eval-batch-1-complete

Progress Checkpoints

Natural Breaking Points:

After each rubric created and validated
After batch of 5-10 evaluations completed
After improvement reports generated
Before comparative analysis phase
After all evaluations validated by human reviewer

Checkpoint Creation Pattern:

# Automatic checkpoint after evaluation batch
if evaluations_complete >= 10 or improvement_suggestions > 30:
    create_checkpoint({
        "evaluations": completed_evaluations,
        "rubrics": created_rubrics,
        "results": llm_judge_output,
        "tokens": current_tokens
    });

Example: Multi-Context Evaluation Workflow

Context Window 1: Rubric Creation & Initial Evaluations

{
  "checkpoint_id": "ckpt_eval_batch1",
  "phase": "initial_evaluations_complete",
  "rubrics": ["code_quality", "architecture"],
  "evaluations": 10,
  "next_action": "Run multi-agent evaluations",
  "token_usage": 7500
}

Context Window 2: Remaining Evaluations & Reports

# Load checkpoint
cat .coditect/checkpoints/ckpt_eval_batch1.json

# Continue with multi-agent evaluations
# Token savings: ~8000 tokens (rubrics already defined, results cached)

Token Savings Analysis:

Without checkpoint: 12000 tokens (re-create rubrics + re-run evaluations)
With checkpoint: 7500 tokens (resume from cached results)
Savings: 38% reduction (12000 → 7500 tokens)

Success Output

When successful, this skill MUST output:

✅ SKILL COMPLETE: evaluation-framework

Completed:
- [x] Code quality rubric created with 7 weighted criteria
- [x] Architecture quality rubric with scalability focus defined
- [x] LLM-as-judge prompt templates operational
- [x] Evaluation pipeline with automated scoring deployed
- [x] Comparative analysis framework for multiple implementations functional
- [x] 15 code evaluations completed with actionable feedback

Outputs:
- rubrics/code-quality-rubric.json (7 criteria with 5-level scoring)
- rubrics/architecture-rubric.json (6 criteria with scalability focus)
- prompts/llm-judge-template.md (Structured evaluation prompt)
- src/evaluation/llm_judge.py (LLM-as-judge implementation)
- src/evaluation/rubric_generator.py (Rubric creation utilities)
- reports/evaluation-results.json (15 evaluations with scores)

Evaluation Metrics:
- Average evaluation score: 4.0/5.0 (80% quality threshold met)
- Consistency: 92% (scores within ±0.3 across similar code)
- Actionable feedback rate: 95% (concrete improvement suggestions)
- Time savings: 75% faster than manual review (20min → 5min)

Completion Checklist

Before marking this skill as complete, verify:

Failure Indicators

This skill has FAILED if:

❌ Rubric criteria vague or non-measurable ("good code" without specifics)
❌ Scoring levels overlap or unclear (cannot distinguish Adequate from Good)
❌ LLM-as-judge provides scores without justification
❌ Evaluation results inconsistent (same code scores differently on re-evaluation)
❌ No evidence/examples quoted from evaluated output
❌ Improvement suggestions generic ("make it better") instead of actionable
❌ Weighted average calculation incorrect
❌ Parser fails to extract scores from LLM response
❌ Quality gates not enforced (low-quality outputs accepted)
❌ Comparative evaluation ranks incorrectly
❌ No verification of evaluation accuracy (spot-checking by human)

When NOT to Use

Do NOT use this skill when:

Simple pass/fail check sufficient (unit tests, linters)
Subjective aesthetic judgments required (UI design preferences)
Real-time interactive review needed (pair programming, live code review)
Single one-off evaluation (rubric setup overhead not justified)
Evaluation criteria cannot be objectified (art, creative writing)
Deterministic scoring required (use static analysis tools instead)
Human expert review mandated (security audits, legal compliance)

Alternative approaches:

Pass/fail only: Use automated tests, linters, type checkers
Subjective review: Human code review with qualitative feedback
Real-time: Live pairing session, interactive Q&A
Deterministic: Static analysis tools (SonarQube, ESLint)
Single evaluation: Manual review without rubric formalization

Anti-Patterns (Avoid)

Anti-Pattern	Problem	Solution
Vague criteria ("good quality")	Not measurable, inconsistent scoring	Define specific, objective criteria with examples
No scoring guide	LLM assigns arbitrary scores	Provide 5-level descriptions for each criterion
Missing justification requirement	Scores lack credibility	Mandate evidence and quotes from evaluated output
Equal weighting for all criteria	Misrepresents importance	Weight criteria by significance (e.g., correctness 30%, docs 10%)
Generic improvement suggestions	Not actionable	Require specific recommendations with examples
No consistency checks	Same code scores differently	Run evaluation multiple times, verify variance < threshold
Ignoring context/constraints	Unfair evaluation	Include context (time limits, requirements) in prompt
No human validation	LLM errors go undetected	Spot-check 10% of evaluations manually
Returning only scores	Loses learning opportunity	Always include strengths, weaknesses, and improvements
Evaluation without baseline	Cannot measure improvement	Establish baseline score, track progress over time

Principles

This skill embodies:

#1 Evidence-Based Assessment - Every score backed by quoted examples from evaluated output
#2 Consistency Through Structure - Rubrics ensure repeatable, objective evaluations
#5 Eliminate Ambiguity - Clear 5-level scoring guides remove subjective interpretation
#6 Clear, Understandable, Explainable - Justifications explain reasoning behind scores
#7 Optimize for Context - Evaluation criteria adapt to domain (code vs architecture vs multi-agent)
#8 No Assumptions - Verify evaluation accuracy through spot-checking and variance analysis
#10 Automation First - LLM-as-judge automates 75% of review time
#11 Continuous Improvement - Actionable feedback drives iterative enhancement

Full Standard: CODITECT-STANDARD-AUTOMATION.md

How to Use This Skill​

When to Use​

LLM-as-Judge Pattern​

Core Concept​

Evaluation Template​

LLM-as-Judge Prompt Template​

Code Quality Rubric​

Criteria​

Architecture Quality Rubric​

Multi-Agent System Rubric​

Evaluation Process​

Step 1: Define Criteria​

Step 2: Generate Evaluation Prompt​

Step 3: Parse Evaluation Response​

Step 4: Generate Report​

Comparative Evaluation​

Executable Scripts​

Best Practices​

✅ DO​

❌ DON'T​

Integration with T2​

Templates​

Rubric Selection Guide​

Multi-Context Window Support​

State Tracking​

Session Recovery​

State Management Best Practices​

Progress Checkpoints​

Example: Multi-Context Evaluation Workflow​

Success Output​

Completion Checklist​

Failure Indicators​

When NOT to Use​

Anti-Patterns (Avoid)​

Principles​