Self Improving Eval Loop
Self-Improving Eval Loop Skill (H.3.6)
Automated evaluation and improvement cycle where Claude:
- Runs evals against current prompt/configuration
- Computes F1 scores and identifies failures
- Uses a critic agent to analyze failure patterns
- Proposes prompt and/or eval improvements
- Iterates until metrics stabilize or target reached
When to Use This Skill
- Prompt Engineering: Iteratively improve classification prompts
- Eval Development: Expand and refine evaluation test cases
- Quality Gates: Automated quality checks in CI/CD
- Regression Testing: Catch quality degradation early
Quick Start
# Run single eval round (no improvement)
python skills/self-improving-eval/scripts/eval_runner.py \
--eval-path data/evals.jsonl \
--prompt-path prompts/classifier.txt
# Run improvement loop
python skills/self-improving-eval/scripts/eval_loop.py \
--eval-path data/evals.jsonl \
--prompt-path prompts/classifier.txt \
--rounds 5 \
--target-f1 0.90
# CI mode (fail on low score)
python skills/self-improving-eval/scripts/run_ci.py \
--eval-path data/evals.jsonl \
--prompt-path prompts/classifier.txt \
--min-f1 0.85
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ EVAL IMPROVEMENT LOOP │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Load Evals │───▶│ Run Model │───▶│ Score F1 │ │
│ │ (JSONL) │ │ (Anthropic) │ │ (micro/mac) │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌────────────────────▼─────────┐ │
│ │ Target F1 Reached? │ │
│ └────────────────┬─────────────┘ │
│ NO ◄───────┴───────► YES │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Applier │◄───│ Critic │ │ DONE │ │
│ │ (Updates) │ │ (Analysis) │ │ Return │ │
│ └──────┬───────┘ └──────────────┘ └──────────┘ │
│ │ │
│ └──────────────────────┐ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Next Round (loop) │ │
│ └────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Components
1. Eval Runner (scripts/eval_runner.py)
- Loads JSONL eval cases with
inputandgoldfields - Calls Anthropic API with prompt template
- Computes per-case and aggregate metrics (accuracy, F1)
- Returns structured
EvalResultobjects
2. Critic Agent (scripts/critic_agent.py)
- Analyzes failure patterns from worst-K cases
- Proposes structured improvements as JSON:
new_prompt: Improved prompt templateupdated_evals: New/fixed eval cases
- Uses constrained JSON output to avoid drift
3. Improvement Applier (scripts/improvement_applier.py)
- Validates proposed changes
- Applies prompt updates
- Merges eval case additions/corrections
- Maintains hold-out set integrity
4. Eval Loop Orchestrator (scripts/eval_loop.py)
- Main improvement cycle
- Tracks F1 per iteration
- Stops on convergence or max rounds
- Logs all changes for auditability
5. CI Integration (scripts/run_ci.py)
- Single-round eval for CI gates
- Fails job if score below threshold
- Outputs machine-readable results
JSONL Format
{"input": {"text": "Document content here..."}, "gold": {"classification": "skill", "confidence": 0.95}}
{"input": {"text": "Another document..."}, "gold": {"classification": "command", "confidence": 0.90}}
Metrics
| Metric | Description | Usage |
|---|---|---|
| Micro F1 | Global TP/FP/FN aggregation | Overall accuracy |
| Macro F1 | Per-class average | Class balance |
| Accuracy | Exact match ratio | Simple threshold |
Best Practices
- Separate App vs Critic Model: Use same or cheaper model for evals
- Hold-out Set: Keep test cases critic never sees
- Version Everything: Track prompt/eval changes in git
- Track Per-Category: Catch regressions even when global F1 improves
- Guard Against Overfitting: Check for hard-coded or brittle patterns
Configuration
Environment variables:
ANTHROPIC_API_KEY: Anthropic API keyOPENAI_API_KEY: OpenAI API key (optional)EVAL_MODEL: Model to use (default: claude-sonnet-4)
Related
- /moe-judges - Multi-model judge panel
- persona_loader.py - Judge personas
- multi_model_client.py - Multi-provider client