Evaluation Framework
Evaluation Framework
How to Use This Skill
- Review the patterns and examples below
- Apply the relevant patterns to your implementation
- Follow the best practices outlined in this skill
Expert skill for creating evaluation rubrics, implementing LLM-as-judge patterns, and assessing quality.
When to Use
✅ Use this skill when:
- Creating evaluation rubrics for code/outputs (consistent scoring)
- Implementing LLM-as-judge patterns (automated review)
- Quality assessment frameworks (standardized criteria)
- Automated code review systems (scalable evaluation)
- Output validation and scoring (objective assessment)
- Creating grading criteria (5-level scoring guides)
- Time savings: 75% faster reviews (20→5 min per evaluation)
❌ Don't use this skill when:
- Simple pass/fail checks (use basic validation instead)
- Subjective aesthetic judgments (rubrics work best for objective criteria)
- Real-time interactive reviews (LLM-as-judge is async)
- Single evaluation (not worth rubric setup overhead)
LLM-as-Judge Pattern
Core Concept
Use an LLM to evaluate outputs based on defined criteria, providing:
- Structured scoring
- Consistent evaluation
- Detailed feedback
- Comparative analysis
Evaluation Template
from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum
class ScoreLevel(Enum):
"""Standardized score levels"""
EXCELLENT = 5 # Exceeds all criteria
GOOD = 4 # Meets all criteria well
ADEQUATE = 3 # Meets minimum criteria
POOR = 2 # Below minimum criteria
FAILING = 1 # Does not meet criteria
@dataclass
class EvaluationCriterion:
"""Single evaluation criterion"""
name: str
description: str
weight: float # 0.0 - 1.0
scoring_guide: Dict[ScoreLevel, str]
@dataclass
class EvaluationResult:
"""Result of evaluation"""
criterion: str
score: ScoreLevel
justification: str
examples: List[str]
improvement_suggestions: List[str]
@dataclass
class OverallEvaluation:
"""Complete evaluation"""
individual_scores: List[EvaluationResult]
weighted_average: float
overall_assessment: str
strengths: List[str]
weaknesses: List[str]
actionable_feedback: List[str]
LLM-as-Judge Prompt Template
# Evaluation Task
Evaluate the following OUTPUT against the specified CRITERIA.
## Output to Evaluate
{output_text}
## Evaluation Criteria
{criterion_1_name} (Weight: {weight}%)
- Excellent (5): {excellent_description}
- Good (4): {good_description}
- Adequate (3): {adequate_description}
- Poor (2): {poor_description}
- Failing (1): {failing_description}
{criterion_2_name} (Weight: {weight}%)
[...repeat for all criteria...]
## Required Response Format
For EACH criterion, provide:
1. **Score**: {1-5}
2. **Justification**: {detailed explanation referencing specific parts of output}
3. **Evidence**: {quote specific examples from output}
4. **Improvement Suggestions**: {actionable recommendations}
## Final Summary
- **Weighted Average Score**: {calculated from individual scores}
- **Overall Assessment**: {holistic evaluation}
- **Top 3 Strengths**: {bullet list}
- **Top 3 Weaknesses**: {bullet list}
- **Priority Improvements**: {ranked list of most impactful changes}
## Important
- Be objective and evidence-based
- Quote specific examples
- Provide actionable feedback
- Consider context and constraints
- Be consistent across evaluations
Code Quality Rubric
Criteria
1. Correctness (Weight: 30%)
- Excellent (5): Handles all cases including edge cases, no bugs
- Good (4): Handles main cases correctly, minor edge case issues
- Adequate (3): Core functionality works, some edge case bugs
- Poor (2): Core functionality has bugs
- Failing (1): Does not work as intended
2. Code Structure (Weight: 20%)
- Excellent (5): Well-organized, clear separation of concerns, DRY
- Good (4): Organized, minor repetition
- Adequate (3): Functional organization, some repetition
- Poor (2): Poor organization, significant repetition
- Failing (1): Unstructured, unmaintainable
3. Error Handling (Weight: 15%)
- Excellent (5): Comprehensive error handling with recovery, detailed messages
- Good (4): Good error handling, clear messages
- Adequate (3): Basic error handling present
- Poor (2): Minimal error handling
- Failing (1): No error handling
4. Documentation (Weight: 10%)
- Excellent (5): Comprehensive docs, examples, edge cases documented
- Good (4): Good documentation coverage
- Adequate (3): Basic documentation present
- Poor (2): Minimal documentation
- Failing (1): No documentation
5. Type Safety (Weight: 10%)
- Excellent (5): Full type hints, passes strict type checking
- Good (4): Good type coverage (>80%)
- Adequate (3): Basic type hints (>50%)
- Poor (2): Minimal type hints (<50%)
- Failing (1): No type hints
6. Performance (Weight: 10%)
- Excellent (5): Optimal algorithms, efficient implementation
- Good (4): Good performance, room for minor optimization
- Adequate (3): Acceptable performance
- Poor (2): Performance issues
- Failing (1): Unacceptable performance
7. Security (Weight: 5%)
- Excellent (5): Security best practices, input validation, no vulnerabilities
- Good (4): Good security practices
- Adequate (3): Basic security measures
- Poor (2): Security concerns present
- Failing (1): Critical security issues
Architecture Quality Rubric
1. Scalability (Weight: 25%)
- Excellent (5): Scales horizontally, handles 10x growth
- Good (4): Scales with minor modifications
- Adequate (3): Handles current load
- Poor (2): Scaling issues likely
- Failing (1): Cannot scale
2. Maintainability (Weight: 20%)
- Excellent (5): Clear boundaries, easy to modify, well-tested
- Good (4): Generally maintainable
- Adequate (3): Can be maintained with effort
- Poor (2): Difficult to maintain
- Failing (1): Unmaintainable
3. Observability (Weight: 15%)
- Excellent (5): Comprehensive metrics, logging, tracing
- Good (4): Good observability coverage
- Adequate (3): Basic logging/metrics
- Poor (2): Minimal observability
- Failing (1): No observability
4. Fault Tolerance (Weight: 15%)
- Excellent (5): Circuit breakers, retries, graceful degradation
- Good (4): Good error recovery
- Adequate (3): Basic error handling
- Poor (2): Poor fault tolerance
- Failing (1): No fault tolerance
5. Security (Weight: 15%)
- Excellent (5): Defense in depth, least privilege, validated inputs
- Good (4): Good security practices
- Adequate (3): Basic security
- Poor (2): Security gaps
- Failing (1): Critical vulnerabilities
6. Documentation (Weight: 10%)
- Excellent (5): Architecture diagrams, ADRs, runbooks
- Good (4): Good documentation
- Adequate (3): Basic documentation
- Poor (2): Minimal documentation
- Failing (1): No documentation
Multi-Agent System Rubric
1. Coordination Efficiency (Weight: 25%)
- Excellent (5): Minimal coordination overhead, async patterns
- Good (4): Efficient coordination
- Adequate (3): Acceptable coordination
- Poor (2): High coordination overhead
- Failing (1): Coordination bottleneck
2. Error Cascade Prevention (Weight: 20%)
- Excellent (5): Circuit breakers, bulkheads, timeouts everywhere
- Good (4): Good isolation
- Adequate (3): Basic isolation
- Poor (2): Error cascade risk
- Failing (1): No isolation
3. Token Economics (Weight: 15%)
- Excellent (5): Optimized token usage, checkpointing, compression
- Good (4): Good token management
- Adequate (3): Acceptable token usage
- Poor (2): High token consumption
- Failing (1): Excessive token waste
4. Observability (Weight: 15%)
- Excellent (5): Full tracing, agent state visibility, debug tools
- Good (4): Good observability
- Adequate (3): Basic logging
- Poor (2): Limited visibility
- Failing (1): No observability
5. Delegation Clarity (Weight: 15%)
- Excellent (5): Clear responsibilities, typed interfaces, boundaries
- Good (4): Clear delegation
- Adequate (3): Understandable delegation
- Poor (2): Unclear responsibilities
- Failing (1): Chaotic delegation
6. Checkpoint/Resume (Weight: 10%)
- Excellent (5): Comprehensive checkpointing, resume from any state
- Good (4): Good checkpoint coverage
- Adequate (3): Basic checkpointing
- Poor (2): Limited checkpointing
- Failing (1): No checkpointing
Evaluation Process
Step 1: Define Criteria
code_quality_criteria = [
EvaluationCriterion(
name="Correctness",
description="Code works as intended, handles edge cases",
weight=0.30,
scoring_guide={
ScoreLevel.EXCELLENT: "Handles all cases including edge cases, no bugs",
ScoreLevel.GOOD: "Handles main cases correctly, minor edge case issues",
ScoreLevel.ADEQUATE: "Core functionality works, some edge case bugs",
ScoreLevel.POOR: "Core functionality has bugs",
ScoreLevel.FAILING: "Does not work as intended",
}
),
# ... more criteria
]
Step 2: Generate Evaluation Prompt
def generate_evaluation_prompt(
output: str,
criteria: List[EvaluationCriterion]
) -> str:
"""Generate LLM-as-judge prompt"""
prompt = f"""# Evaluation Task
Evaluate the following OUTPUT against the specified CRITERIA.
## Output to Evaluate
{output}
## Evaluation Criteria
"""
for criterion in criteria:
prompt += f"\n{criterion.name} (Weight: {criterion.weight * 100}%)\n"
for level, description in criterion.scoring_guide.items():
prompt += f"- {level.name.title()} ({level.value}): {description}\n"
prompt += """
## Required Response Format
[Format instructions...]
"""
return prompt
Step 3: Parse Evaluation Response
def parse_evaluation_response(
response: str,
criteria: List[EvaluationCriterion]
) -> OverallEvaluation:
"""Parse LLM evaluation response"""
# Extract individual scores
# Calculate weighted average
# Generate overall assessment
pass
Step 4: Generate Report
def generate_evaluation_report(eval: OverallEvaluation) -> str:
"""Generate human-readable report"""
report = f"""
# Evaluation Report
**Overall Score**: {eval.weighted_average:.2f}/5.0
## Strengths
{chr(10).join(f"- {s}" for s in eval.strengths)}
## Weaknesses
{chr(10).join(f"- {w}" for w in eval.weaknesses)}
## Detailed Scores
"""
for result in eval.individual_scores:
report += f"""
### {result.criterion}
**Score**: {result.score.value}/5 ({result.score.name})
**Justification**: {result.justification}
**Examples**:
{chr(10).join(f"- {ex}" for ex in result.examples)}
**Improvements**:
{chr(10).join(f"- {imp}" for imp in result.improvement_suggestions)}
"""
return report
Comparative Evaluation
For comparing multiple implementations:
@dataclass
class ComparativeEvaluation:
"""Compare multiple outputs"""
outputs: List[str]
criteria: List[EvaluationCriterion]
individual_evaluations: List[OverallEvaluation]
rankings: Dict[str, int] # output_id -> rank
best_practices: List[str]
common_issues: List[str]
Executable Scripts
See core/llm_as_judge.py for LLM-as-judge implementation.
See core/rubric_generator.py for rubric generation utilities.
Best Practices
✅ DO
- Define clear criteria - Specific, measurable, actionable
- Use weighted scoring - Reflect importance of criteria
- Provide scoring guides - Clear descriptions for each level
- Require evidence - Quote specific examples
- Give actionable feedback - Specific improvement suggestions
- Be consistent - Apply same standards across evaluations
❌ DON'T
- Don't use vague criteria - "Good code" is not measurable
- Don't skip justifications - Always explain scores
- Don't ignore context - Consider constraints and requirements
- Don't be subjective - Base on evidence, not preference
- Don't provide only scores - Include improvement guidance
Integration with T2
Use cases in T2:
- Code review automation (evaluate PRs)
- Agent output validation (ensure quality)
- Architecture assessment (evaluate designs)
- Documentation quality checks
Example integration:
// Evaluate agent output before accepting
let evaluation = evaluate_agent_output(
agent_output,
&evaluation_criteria,
llm_service
).await?;
if evaluation.weighted_average < 3.0 {
// Reject and request improvements
return Err(AgentError::OutputQualityTooLow {
score: evaluation.weighted_average,
issues: evaluation.weaknesses,
});
}
Templates
See templates/evaluation_rubrics.md for pre-built rubrics.
Rubric Selection Guide
Choose the right rubric based on what you're evaluating:
| Evaluating | Recommended Rubric | Key Criteria | Weight Distribution |
|---|---|---|---|
| Code (functions/modules) | Code Quality Rubric | Correctness, Structure, Error Handling | Correctness 30%, Structure 20% |
| System architecture | Architecture Quality Rubric | Scalability, Maintainability, Fault Tolerance | Scalability 25%, Maintainability 20% |
| Multi-agent output | Multi-Agent System Rubric | Coordination, Token Economics, Delegation | Coordination 25%, Error Cascade 20% |
| Documentation | Documentation Rubric | Completeness, Accuracy, Clarity | Completeness 35%, Accuracy 30% |
| API design | API Quality Rubric | Consistency, Usability, Performance | Consistency 30%, Usability 25% |
| Test coverage | Testing Rubric | Coverage, Edge Cases, Maintainability | Coverage 35%, Edge Cases 25% |
Rubric Customization Decision Tree:
Start with standard rubric
│
├── Domain-specific requirements?
│ └── Yes → Add domain criteria (security, compliance, etc.)
│
├── Team has specific quality gates?
│ └── Yes → Adjust weights to match gates
│
├── Comparing implementations?
│ └── Yes → Use ComparativeEvaluation with identical criteria
│
└── Single evaluation or batch?
├── Single → Full rubric, detailed feedback
└── Batch → Simplified rubric, aggregate scoring
Minimum Viable Rubric (3 criteria):
- Correctness (40%) - Does it work as intended?
- Quality (35%) - Is it well-structured and maintainable?
- Completeness (25%) - Does it cover all requirements?
Multi-Context Window Support
This skill supports long-running evaluation workflows across multiple context windows using Claude 4.5's enhanced state management capabilities.
State Tracking
Evaluation Progress State (JSON):
{
"checkpoint_id": "ckpt_20251129_153000",
"evaluations_completed": [
{"target": "code_quality", "score": 4.2, "status": "complete"},
{"target": "architecture", "score": 3.8, "status": "complete"},
{"target": "multi_agent", "score": 0.0, "status": "pending"}
],
"rubrics_created": ["code_quality", "architecture", "multi_agent_system"],
"llm_judge_results": {
"total_evaluations": 15,
"average_score": 4.0,
"improvement_suggestions": 42
},
"token_usage": 12000,
"created_at": "2025-11-29T15:30:00Z"
}
Progress Notes (Markdown):
# Evaluation Framework Progress - 2025-11-29
## Completed
- Code quality rubric created with 7 criteria
- 15 evaluations run with LLM-as-judge
- Architecture assessment rubric defined
## In Progress
- Running multi-agent system evaluations
- Generating improvement reports
## Next Actions
- Complete multi-agent evaluations (3 remaining)
- Generate consolidated improvement report
- Create comparative analysis across evaluations
Session Recovery
When starting a fresh context window after evaluation work:
- Load Checkpoint State: Read
.coditect/checkpoints/evaluation-latest.json - Review Progress Notes: Check
evaluation-progress.mdfor context - Verify Rubrics: Review created rubrics for completeness
- Check Evaluation Results: Load JSON results from completed evaluations
- Resume Evaluations: Continue from last pending evaluation
Recovery Commands:
# 1. Check latest checkpoint
cat .coditect/checkpoints/evaluation-latest.json | jq '.evaluations_completed'
# 2. Review progress
tail -30 evaluation-progress.md
# 3. Check evaluation results
cat evaluation-results.json | jq '.llm_judge_results'
# 4. List pending evaluations
cat .coditect/checkpoints/evaluation-latest.json | jq '.evaluations_completed[] | select(.status=="pending")'
# 5. Resume from next pending evaluation
# Continue evaluation workflow
State Management Best Practices
Checkpoint Files (JSON Schema):
- Store in
.coditect/checkpoints/evaluation-{timestamp}.json - Include evaluation results with scores and justifications
- Track rubric definitions and criteria weights
- Record improvement suggestions generated
Progress Tracking (Markdown Narrative):
- Maintain
evaluation-progress.mdwith evaluation status - Document rubric design decisions
- Note unexpected evaluation results for review
- List next evaluations to run
Git Integration:
- Create checkpoint after major rubric creation
- Commit results with:
docs(eval): Add code quality evaluation results - Tag evaluation milestones:
git tag eval-batch-1-complete
Progress Checkpoints
Natural Breaking Points:
- After each rubric created and validated
- After batch of 5-10 evaluations completed
- After improvement reports generated
- Before comparative analysis phase
- After all evaluations validated by human reviewer
Checkpoint Creation Pattern:
# Automatic checkpoint after evaluation batch
if evaluations_complete >= 10 or improvement_suggestions > 30:
create_checkpoint({
"evaluations": completed_evaluations,
"rubrics": created_rubrics,
"results": llm_judge_output,
"tokens": current_tokens
});
Example: Multi-Context Evaluation Workflow
Context Window 1: Rubric Creation & Initial Evaluations
{
"checkpoint_id": "ckpt_eval_batch1",
"phase": "initial_evaluations_complete",
"rubrics": ["code_quality", "architecture"],
"evaluations": 10,
"next_action": "Run multi-agent evaluations",
"token_usage": 7500
}
Context Window 2: Remaining Evaluations & Reports
# Load checkpoint
cat .coditect/checkpoints/ckpt_eval_batch1.json
# Continue with multi-agent evaluations
# Token savings: ~8000 tokens (rubrics already defined, results cached)
Token Savings Analysis:
- Without checkpoint: 12000 tokens (re-create rubrics + re-run evaluations)
- With checkpoint: 7500 tokens (resume from cached results)
- Savings: 38% reduction (12000 → 7500 tokens)
Success Output
When successful, this skill MUST output:
✅ SKILL COMPLETE: evaluation-framework
Completed:
- [x] Code quality rubric created with 7 weighted criteria
- [x] Architecture quality rubric with scalability focus defined
- [x] LLM-as-judge prompt templates operational
- [x] Evaluation pipeline with automated scoring deployed
- [x] Comparative analysis framework for multiple implementations functional
- [x] 15 code evaluations completed with actionable feedback
Outputs:
- rubrics/code-quality-rubric.json (7 criteria with 5-level scoring)
- rubrics/architecture-rubric.json (6 criteria with scalability focus)
- prompts/llm-judge-template.md (Structured evaluation prompt)
- src/evaluation/llm_judge.py (LLM-as-judge implementation)
- src/evaluation/rubric_generator.py (Rubric creation utilities)
- reports/evaluation-results.json (15 evaluations with scores)
Evaluation Metrics:
- Average evaluation score: 4.0/5.0 (80% quality threshold met)
- Consistency: 92% (scores within ±0.3 across similar code)
- Actionable feedback rate: 95% (concrete improvement suggestions)
- Time savings: 75% faster than manual review (20min → 5min)
Completion Checklist
Before marking this skill as complete, verify:
- Rubrics define 5-level scoring (Excellent, Good, Adequate, Poor, Failing)
- Each criterion has weight (sum to 100%)
- Scoring guide provides specific, measurable descriptions per level
- LLM-as-judge prompt includes output format requirements
- Prompt requires justification with evidence/examples
- Evaluation parser extracts scores from LLM response
- Weighted average calculation correct (score * weight summed)
- Individual criterion results include justification and improvement suggestions
- Comparative evaluation ranks multiple outputs correctly
- Evaluation reports human-readable (markdown format)
- Quality gates reject outputs below threshold (e.g., <3.0/5.0)
- All outputs exist at expected locations and pass validation
Failure Indicators
This skill has FAILED if:
- ❌ Rubric criteria vague or non-measurable ("good code" without specifics)
- ❌ Scoring levels overlap or unclear (cannot distinguish Adequate from Good)
- ❌ LLM-as-judge provides scores without justification
- ❌ Evaluation results inconsistent (same code scores differently on re-evaluation)
- ❌ No evidence/examples quoted from evaluated output
- ❌ Improvement suggestions generic ("make it better") instead of actionable
- ❌ Weighted average calculation incorrect
- ❌ Parser fails to extract scores from LLM response
- ❌ Quality gates not enforced (low-quality outputs accepted)
- ❌ Comparative evaluation ranks incorrectly
- ❌ No verification of evaluation accuracy (spot-checking by human)
When NOT to Use
Do NOT use this skill when:
- Simple pass/fail check sufficient (unit tests, linters)
- Subjective aesthetic judgments required (UI design preferences)
- Real-time interactive review needed (pair programming, live code review)
- Single one-off evaluation (rubric setup overhead not justified)
- Evaluation criteria cannot be objectified (art, creative writing)
- Deterministic scoring required (use static analysis tools instead)
- Human expert review mandated (security audits, legal compliance)
Alternative approaches:
- Pass/fail only: Use automated tests, linters, type checkers
- Subjective review: Human code review with qualitative feedback
- Real-time: Live pairing session, interactive Q&A
- Deterministic: Static analysis tools (SonarQube, ESLint)
- Single evaluation: Manual review without rubric formalization
Anti-Patterns (Avoid)
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Vague criteria ("good quality") | Not measurable, inconsistent scoring | Define specific, objective criteria with examples |
| No scoring guide | LLM assigns arbitrary scores | Provide 5-level descriptions for each criterion |
| Missing justification requirement | Scores lack credibility | Mandate evidence and quotes from evaluated output |
| Equal weighting for all criteria | Misrepresents importance | Weight criteria by significance (e.g., correctness 30%, docs 10%) |
| Generic improvement suggestions | Not actionable | Require specific recommendations with examples |
| No consistency checks | Same code scores differently | Run evaluation multiple times, verify variance < threshold |
| Ignoring context/constraints | Unfair evaluation | Include context (time limits, requirements) in prompt |
| No human validation | LLM errors go undetected | Spot-check 10% of evaluations manually |
| Returning only scores | Loses learning opportunity | Always include strengths, weaknesses, and improvements |
| Evaluation without baseline | Cannot measure improvement | Establish baseline score, track progress over time |
Principles
This skill embodies:
- #1 Evidence-Based Assessment - Every score backed by quoted examples from evaluated output
- #2 Consistency Through Structure - Rubrics ensure repeatable, objective evaluations
- #5 Eliminate Ambiguity - Clear 5-level scoring guides remove subjective interpretation
- #6 Clear, Understandable, Explainable - Justifications explain reasoning behind scores
- #7 Optimize for Context - Evaluation criteria adapt to domain (code vs architecture vs multi-agent)
- #8 No Assumptions - Verify evaluation accuracy through spot-checking and variance analysis
- #10 Automation First - LLM-as-judge automates 75% of review time
- #11 Continuous Improvement - Actionable feedback drives iterative enhancement
Full Standard: CODITECT-STANDARD-AUTOMATION.md