Skip to main content

Self Improving Eval Loop

Self-Improving Eval Loop Skill (H.3.6)

Automated evaluation and improvement cycle where Claude:

  1. Runs evals against current prompt/configuration
  2. Computes F1 scores and identifies failures
  3. Uses a critic agent to analyze failure patterns
  4. Proposes prompt and/or eval improvements
  5. Iterates until metrics stabilize or target reached

When to Use This Skill

  • Prompt Engineering: Iteratively improve classification prompts
  • Eval Development: Expand and refine evaluation test cases
  • Quality Gates: Automated quality checks in CI/CD
  • Regression Testing: Catch quality degradation early

Quick Start

# Run single eval round (no improvement)
python skills/self-improving-eval/scripts/eval_runner.py \
--eval-path data/evals.jsonl \
--prompt-path prompts/classifier.txt

# Run improvement loop
python skills/self-improving-eval/scripts/eval_loop.py \
--eval-path data/evals.jsonl \
--prompt-path prompts/classifier.txt \
--rounds 5 \
--target-f1 0.90

# CI mode (fail on low score)
python skills/self-improving-eval/scripts/run_ci.py \
--eval-path data/evals.jsonl \
--prompt-path prompts/classifier.txt \
--min-f1 0.85

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ EVAL IMPROVEMENT LOOP │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Load Evals │───▶│ Run Model │───▶│ Score F1 │ │
│ │ (JSONL) │ │ (Anthropic) │ │ (micro/mac) │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌────────────────────▼─────────┐ │
│ │ Target F1 Reached? │ │
│ └────────────────┬─────────────┘ │
│ NO ◄───────┴───────► YES │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Applier │◄───│ Critic │ │ DONE │ │
│ │ (Updates) │ │ (Analysis) │ │ Return │ │
│ └──────┬───────┘ └──────────────┘ └──────────┘ │
│ │ │
│ └──────────────────────┐ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ Next Round (loop) │ │
│ └────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Components

1. Eval Runner (scripts/eval_runner.py)

  • Loads JSONL eval cases with input and gold fields
  • Calls Anthropic API with prompt template
  • Computes per-case and aggregate metrics (accuracy, F1)
  • Returns structured EvalResult objects

2. Critic Agent (scripts/critic_agent.py)

  • Analyzes failure patterns from worst-K cases
  • Proposes structured improvements as JSON:
    • new_prompt: Improved prompt template
    • updated_evals: New/fixed eval cases
  • Uses constrained JSON output to avoid drift

3. Improvement Applier (scripts/improvement_applier.py)

  • Validates proposed changes
  • Applies prompt updates
  • Merges eval case additions/corrections
  • Maintains hold-out set integrity

4. Eval Loop Orchestrator (scripts/eval_loop.py)

  • Main improvement cycle
  • Tracks F1 per iteration
  • Stops on convergence or max rounds
  • Logs all changes for auditability

5. CI Integration (scripts/run_ci.py)

  • Single-round eval for CI gates
  • Fails job if score below threshold
  • Outputs machine-readable results

JSONL Format

{"input": {"text": "Document content here..."}, "gold": {"classification": "skill", "confidence": 0.95}}
{"input": {"text": "Another document..."}, "gold": {"classification": "command", "confidence": 0.90}}

Metrics

MetricDescriptionUsage
Micro F1Global TP/FP/FN aggregationOverall accuracy
Macro F1Per-class averageClass balance
AccuracyExact match ratioSimple threshold

Best Practices

  1. Separate App vs Critic Model: Use same or cheaper model for evals
  2. Hold-out Set: Keep test cases critic never sees
  3. Version Everything: Track prompt/eval changes in git
  4. Track Per-Category: Catch regressions even when global F1 improves
  5. Guard Against Overfitting: Check for hard-coded or brittle patterns

Configuration

Environment variables:

  • ANTHROPIC_API_KEY: Anthropic API key
  • OPENAI_API_KEY: OpenAI API key (optional)
  • EVAL_MODEL: Model to use (default: claude-sonnet-4)