Self Improving Eval Loop

Self-Improving Eval Loop Skill (H.3.6)

Automated evaluation and improvement cycle where Claude:

Runs evals against current prompt/configuration
Computes F1 scores and identifies failures
Uses a critic agent to analyze failure patterns
Proposes prompt and/or eval improvements
Iterates until metrics stabilize or target reached

When to Use This Skill

Prompt Engineering: Iteratively improve classification prompts
Eval Development: Expand and refine evaluation test cases
Quality Gates: Automated quality checks in CI/CD
Regression Testing: Catch quality degradation early

Quick Start

# Run single eval round (no improvement)
python skills/self-improving-eval/scripts/eval_runner.py \
  --eval-path data/evals.jsonl \
  --prompt-path prompts/classifier.txt

# Run improvement loop
python skills/self-improving-eval/scripts/eval_loop.py \
  --eval-path data/evals.jsonl \
  --prompt-path prompts/classifier.txt \
  --rounds 5 \
  --target-f1 0.90

# CI mode (fail on low score)
python skills/self-improving-eval/scripts/run_ci.py \
  --eval-path data/evals.jsonl \
  --prompt-path prompts/classifier.txt \
  --min-f1 0.85

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    EVAL IMPROVEMENT LOOP                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │
│   │  Load Evals  │───▶│  Run Model   │───▶│  Score F1    │     │
│   │  (JSONL)     │    │  (Anthropic) │    │  (micro/mac) │     │
│   └──────────────┘    └──────────────┘    └──────┬───────┘     │
│                                                   │              │
│                              ┌────────────────────▼─────────┐   │
│                              │     Target F1 Reached?       │   │
│                              └────────────────┬─────────────┘   │
│                                    NO ◄───────┴───────► YES     │
│                                    │                     │      │
│                                    ▼                     ▼      │
│   ┌──────────────┐    ┌──────────────┐         ┌──────────┐    │
│   │   Applier    │◄───│   Critic     │         │  DONE    │    │
│   │  (Updates)   │    │  (Analysis)  │         │  Return  │    │
│   └──────┬───────┘    └──────────────┘         └──────────┘    │
│          │                                                      │
│          └──────────────────────┐                               │
│                                 ▼                               │
│                    ┌────────────────────────┐                   │
│                    │   Next Round (loop)    │                   │
│                    └────────────────────────┘                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Components

1. Eval Runner (`scripts/eval_runner.py`)

Loads JSONL eval cases with input and gold fields
Calls Anthropic API with prompt template
Computes per-case and aggregate metrics (accuracy, F1)
Returns structured EvalResult objects

2. Critic Agent (`scripts/critic_agent.py`)

Analyzes failure patterns from worst-K cases
Proposes structured improvements as JSON:
- new_prompt: Improved prompt template
- updated_evals: New/fixed eval cases
Uses constrained JSON output to avoid drift

3. Improvement Applier (`scripts/improvement_applier.py`)

Validates proposed changes
Applies prompt updates
Merges eval case additions/corrections
Maintains hold-out set integrity

4. Eval Loop Orchestrator (`scripts/eval_loop.py`)

Main improvement cycle
Tracks F1 per iteration
Stops on convergence or max rounds
Logs all changes for auditability

5. CI Integration (`scripts/run_ci.py`)

Single-round eval for CI gates
Fails job if score below threshold
Outputs machine-readable results

JSONL Format

{"input": {"text": "Document content here..."}, "gold": {"classification": "skill", "confidence": 0.95}}
{"input": {"text": "Another document..."}, "gold": {"classification": "command", "confidence": 0.90}}

Metrics

Metric	Description	Usage
Micro F1	Global TP/FP/FN aggregation	Overall accuracy
Macro F1	Per-class average	Class balance
Accuracy	Exact match ratio	Simple threshold

Best Practices

Separate App vs Critic Model: Use same or cheaper model for evals
Hold-out Set: Keep test cases critic never sees
Version Everything: Track prompt/eval changes in git
Track Per-Category: Catch regressions even when global F1 improves
Guard Against Overfitting: Check for hard-coded or brittle patterns

Configuration

Environment variables:

ANTHROPIC_API_KEY: Anthropic API key
OPENAI_API_KEY: OpenAI API key (optional)
EVAL_MODEL: Model to use (default: claude-sonnet-4)

/moe-judges - Multi-model judge panel
persona_loader.py - Judge personas
multi_model_client.py - Multi-provider client

When to Use This Skill​

Quick Start​

Architecture​

Components​

1. Eval Runner (scripts/eval_runner.py)​

2. Critic Agent (scripts/critic_agent.py)​

3. Improvement Applier (scripts/improvement_applier.py)​

4. Eval Loop Orchestrator (scripts/eval_loop.py)​

5. CI Integration (scripts/run_ci.py)​

JSONL Format​

Metrics​

Best Practices​

Configuration​

Related​

When to Use This Skill

Quick Start

Architecture

Components

1. Eval Runner (`scripts/eval_runner.py`)

2. Critic Agent (`scripts/critic_agent.py`)

3. Improvement Applier (`scripts/improvement_applier.py`)

4. Eval Loop Orchestrator (`scripts/eval_loop.py`)

5. CI Integration (`scripts/run_ci.py`)

JSONL Format

Metrics

Best Practices

Configuration

Related