anthropic claude code eval loop example

A common “Claude Code eval loop” pattern is: Claude writes evals → runs them → inspects failures → updates prompts/evals → repeats until metrics stabilize. Below is a minimal but production-adaptable example you can port into your own agentic stack.¹²

Core eval loop idea

You want an automated loop where Claude (or your orchestrator) can:

Generate or update eval cases for a task (e.g., classification, extraction, tool use).²¹
Run the evals against the current prompt / system configuration and compute scores.³⁴
Reflect on failures and modify either the base prompt or the evals themselves, then iterate.⁵¹

Pseudocode loop (Python-style SDK)

Assume:

run_model(prompt, input) → calls Anthropic API / Claude Code context.
score_output(gold, pred) → returns numeric score for a test case.

def run_eval_round(prompt_template, eval_cases):
    results = []
    for case in eval_cases:
        prompt = prompt_template.format(**case["input"])
        pred = run_model(prompt, case["input"])
        score = score_output(case["gold"], pred)
        results.append({**case, "pred": pred, "score": score})
    return results

def summarize_and_propose_changes(prompt_template, eval_results):
    # Let Claude act as a critic / editor
    system_msg = (
        "You are an evaluation critic. "
        "Given eval failures, improve either the prompt or the eval cases."
    )
    user_msg = {
        "prompt_template": prompt_template,
        "eval_results": eval_results,
    }
    suggestion = run_model(system_msg, user_msg)  # returns JSON-ish text
    return json.loads(suggestion)

def eval_improvement_loop(initial_prompt, seed_evals, max_rounds=10, target_score=0.9):
    prompt = initial_prompt
    evals = seed_evals

    for round_idx in range(max_rounds):
        eval_results = run_eval_round(prompt, evals)
        avg_score = sum(r["score"] for r in eval_results) / len(eval_results)

        if avg_score >= target_score:
            break

        suggestion = summarize_and_propose_changes(prompt, eval_results)

        if "new_prompt" in suggestion:
            prompt = suggestion["new_prompt"]
        if "updated_evals" in suggestion:
            evals = suggestion["updated_evals"]

    return {"final_prompt": prompt, "final_evals": evals, "final_score": avg_score}

This matches the pattern described by practitioners using Claude Code for “self-improving eval loops”: Claude writes evals, runs them, and then edits both prompts and evals based on failures until quality converges.¹⁵

Example prompt to hand Claude Code

Here is a tight prompt you can paste into Claude Code itself to bootstrap such a loop in your stack:

“You are an evaluation-orchestrator agent. Generate a minimal Python module that:

Loads a JSONL file of eval cases (fields: input, gold).

Calls the Anthropic API for each case with a prompt_template string.

Computes per-case and average accuracy/F1.

As a second step, given the worst 10 cases, proposes a JSON object with new_prompt and optionally updated_evals. The code should expose a function eval_improvement_loop(initial_prompt, eval_path, rounds, target_score) that implements the full loop. Assume the Anthropic Python SDK is available as anthropic and use type hints and clear separation between ‘run evals’ and ‘reflect and propose changes’ functions.”

If you share your target task (e.g., code-gen, extraction, routing), a concrete example of eval_cases, and your current Anthropic client setup, a more tailored eval loop (with real SDK calls and metrics) can be sketched next. ⁶⁷⁸⁹¹⁰¹¹¹²¹³¹⁴¹⁵¹⁶¹⁷¹⁸¹⁹²⁰

⁂

Show a simple eval loop example for Claude Code with tests

How to build a critic agent for Claude Code evals Example prompt and eval pair for parsing job descriptions Best practices for automated self improving eval loops How to run Claude Code evals continuously in a CI pipeline. “You are an evaluation-orchestrator agent. Generate a minimal Python module that: Loads a JSONL file of eval cases (fields: input, gold). Calls the Anthropic API for each case with a prompt_template string. Computes per-case and average accuracy/F1. As a second step, given the worst 10 cases, proposes a JSON object with new_prompt and optionally updated_evals. The code should expose a function eval_improvement_loop(initial_prompt, eval_path, rounds, target_score) that implements the full loop. Assume the Anthropic Python SDK is available as anthropic and use type hints and clear separation between ‘run evals’ and ‘reflect and propose changes’ functions.”

Here is a concrete, end-to-end pattern you can adapt: a minimal Python eval loop module, a Claude-based critic agent, a job‑description parsing eval pair, best practices for self-improving loops, and how to wire this into CI.

Minimal eval loop module

This is a simple, “batteries-included” eval loop that:

Loads input / gold from JSONL.
Calls Claude via the Anthropic Python SDK with a prompt_template.
Computes accuracy and macro-F1 (binary or multi‑label).
Uses a Claude critic to suggest new_prompt and updated_evals.

from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
from typing import List, Dict, Any, Tuple

import anthropic
from sklearn.metrics import f1_score  # or roll your own if you want no deps


@dataclass
class EvalCase:
    input: Dict[str, Any]
    gold: Dict[str, Any]  # e.g., {"skills": [...], "seniority": "mid"}

@dataclass
class EvalResult(EvalCase):
    pred: Dict[str, Any]
    correct: bool
    f1: float


class ClaudeEvaluator:
    def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model

    def _call_model(self, system_prompt: str, user_content: Any) -> str:
        msg = self.client.messages.create(
            model=self.model,
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": user_content}],
        )
        # assumes single text block
        return msg.content[^2_0].text

    def run_eval_round(
        self,
        prompt_template: str,
        cases: List[EvalCase],
    ) -> List[EvalResult]:
        results: List[EvalResult] = []
        for case in cases:
            user_prompt = prompt_template.format(**case.input)
            raw = self._call_model(
                system_prompt=(
                    "You are a JSON-only evaluation model. "
                    "Return a valid JSON object matching the expected schema."
                ),
                user_content=user_prompt,
            )
            pred = json.loads(raw)
            correct, f1 = self._score_case(case.gold, pred)
            results.append(EvalResult(
                input=case.input,
                gold=case.gold,
                pred=pred,
                correct=correct,
                f1=f1,
            ))
        return results

    def _score_case(self, gold: Dict[str, Any], pred: Dict[str, Any]) -> Tuple[bool, float]:
        # simple example: treat skills as multi-label, seniority as exact match
        gold_skills = sorted(set(gold.get("skills", [])))
        pred_skills = sorted(set(pred.get("skills", [])))
        y_true = [1 if s in gold_skills else 0 for s in gold_skills]
        y_pred = [1 if s in pred_skills else 0 for s in gold_skills]
        f1 = f1_score(y_true, y_pred) if gold_skills else 1.0
        correct = (gold.get("seniority") == pred.get("seniority")) and f1 == 1.0
        return correct, float(f1)

    def propose_prompt_and_eval_changes(
        self,
        prompt_template: str,
        eval_results: List[EvalResult],
        worst_k: int = 10,
    ) -> Dict[str, Any]:
        sorted_results = sorted(eval_results, key=lambda r: r.f1)
        worst = sorted_results[:worst_k]
        payload = {
            "prompt_template": prompt_template,
            "worst_examples": [
                {
                    "input": r.input,
                    "gold": r.gold,
                    "pred": r.pred,
                    "f1": r.f1,
                    "correct": r.correct,
                }
                for r in worst
            ],
        }
        critic_system = (
            "You are a critic agent for evaluation loops.\n"
            "- Inspect failure patterns.\n"
            "- Propose ONLY a JSON object with keys:\n"
            '  "new_prompt": string (optional),\n'
            '  "updated_evals": list of eval cases with fields "input" and "gold" (optional).\n'
            "- Do not include explanations, just JSON."
        )
        raw = self._call_model(critic_system, json.dumps(payload, ensure_ascii=False))
        return json.loads(raw)


def load_eval_cases(path: str | Path) -> List[EvalCase]:
    cases: List[EvalCase] = []
    with Path(path).open() as f:
        for line in f:
            if not line.strip():
                continue
            obj = json.loads(line)
            cases.append(EvalCase(input=obj["input"], gold=obj["gold"]))
    return cases


def eval_improvement_loop(
    initial_prompt: str,
    eval_path: str | Path,
    rounds: int,
    target_score: float,
    api_key: str,
    model: str = "claude-3-5-sonnet-20241022",
) -> Dict[str, Any]:
    evaluator = ClaudeEvaluator(api_key=api_key, model=model)
    cases = load_eval_cases(eval_path)
    prompt = initial_prompt
    final_score = 0.0

    for r in range(rounds):
        results = evaluator.run_eval_round(prompt, cases)
        avg_f1 = sum(res.f1 for res in results) / len(results)
        final_score = avg_f1
        print(f"[round {r}] avg F1={avg_f1:.3f}")

        if avg_f1 >= target_score:
            break

        suggestions = evaluator.propose_prompt_and_eval_changes(prompt, results)
        if "new_prompt" in suggestions and suggestions["new_prompt"]:
            prompt = suggestions["new_prompt"]
        if "updated_evals" in suggestions and suggestions["updated_evals"]:
            cases = [
                EvalCase(input=e["input"], gold=e["gold"])
                for e in suggestions["updated_evals"]
            ]

    return {"final_prompt": prompt, "final_score": final_score, "num_cases": len(cases)}

This matches the workflow described by teams using Claude Code for “fully automated self-improving eval loops” where Claude both runs tests and adjusts prompts/evals.²¹²²²³

Critic agent pattern for Claude Code

A critic agent should:

See failures plus current prompt and scoring rubric.
Output only a constrained JSON with proposed new_prompt and updated_evals, no prose.

Example critic system prompt:

You are an evaluation critic for a job-description parsing system. Input: a JSON object containing:

prompt_template: the current prompt string used for the model.

worst_examples: a list of objects with input, gold, pred, f1, and correct. Your goals:

Detect systematic error patterns (e.g., missing benefits, misclassified seniority).

Propose an improved new_prompt that clarifies instructions and gives 1–3 concise in‑prompt examples.

Optionally propose an updated_evals list where you:

Fix bad or ambiguous gold labels.

Add 2–5 new hard cases targeting observed weaknesses. Respond with only a JSON object of the form: { "new_prompt": string, "updated_evals": [ { "input": {...}, "gold": {...} }, ... ] } Do not include explanations, comments, or Markdown.

This follows best-practice guidance for skill authoring and tightly scoped JSON tools.²⁴²⁵

Job description parsing prompt & eval example

Here is a minimal example tailored to parsing job descriptions into structured fields, consistent with how Claude is used for HR tasks like screening and JD generation.²⁶²⁷

Prompt template (string in your module):

PROMPT_TEMPLATE = """
You are an information extraction engine for job descriptions.

Task:
- Read the following job description text.
- Extract a structured JSON object with fields:
  - title: string
  - seniority: one of ["intern", "junior", "mid", "senior", "lead", "director", "vp", "c-level", "unknown"]
  - skills: list of up to 10 key technical or domain skills in lower_snake_case.
  - location: string (city/region if present, otherwise "remote" or "unspecified").
  - remote_friendly: boolean.
  - compensation_mentioned: boolean (true if any salary/comp range or equity is mentioned).

Return ONLY a JSON object, no extra text.

Job description:
---
{job_text}
---
"""

Example JSONL eval case (job_evals.jsonl):

{"input": {
  "job_text": "We are hiring a Senior Backend Engineer to join our payments team in São Paulo or remote within Brazil. You will work with Go, Kubernetes, and PostgreSQL. Experience with payment systems and high throughput APIs is required. Competitive salary and stock options."
}, "gold": {
  "title": "Senior Backend Engineer",
  "seniority": "senior",
  "skills": ["go", "kubernetes", "postgresql", "payment_systems", "high_throughput_apis"],
  "location": "São Paulo or remote within Brazil",
  "remote_friendly": true,
  "compensation_mentioned": true
}}

You can bootstrap a small seed set by manually writing 10–30 such examples, then let the critic agent expand and refine them in the loop, similar to approaches documented for self-improving evals.²²²¹

Best practices for self-improving eval loops

Patterns emerging from LLM eval literature and practitioners:

Separate “app model” vs “eval/critic model”
- Use the same or a cheaper model for evals; reserve high‑end models where correctness matters.
- Keep evaluator prompts explicit and harsh to avoid “sympathetic grading”.²⁸²¹²²
Guard against overfitting and leakage
- Maintain a hold‑out eval set that the loop never edits or sees while proposing changes.²⁹²¹
- Have the critic check for “cheating” patterns like hard‑coded answers or brittle regex‑like behavior, as highlighted in real-world Claude Code eval loops.²³
Track metrics and drift
- Log per‑category metrics (e.g., remote roles, specific seniority levels) to catch regressions even when global F1 improves.³⁰²⁹
- Version prompts, eval datasets, and critic instructions; store them in git so the loop’s “learning” is auditable and reversible.²⁹²⁸

Running eval loops in CI with Claude Code

For continuous evals in CI, the typical pattern is:

CLI/runner
- Use a thin Python or shell wrapper that calls eval_improvement_loop with rounds=1 for each PR or commit, treating self-improvement as an offline or scheduled job, not in the main CI critical path.³¹³²³³
- Enforce thresholds: if final_score < MIN_F1, fail the job and surface a concise report in CI logs or a comment.
Scheduled / continuous loops
- Run a separate “continuous Claude” process (cron, GitHub Actions schedule, or a Temporal workflow) that executes multi-round loops with rounds>1, updates prompt/eval files in the repo, and opens PRs, similar to Continuous Claude/looped Claude Code setups.³²³³³⁴³¹

Example GitHub Actions step:

- name: Run Claude evals
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: |
    python -m evals.run_ci \
      --eval-path data/job_evals.jsonl \
      --prompt-path prompts/job_parser.txt \
      --min-f1 0.85

In evals/run_ci.py, you would call eval_improvement_loop with rounds=1 (no auto-edit) and exit non‑zero if the score falls below threshold, leaving the full multi‑round self‑improvement loop to a scheduled pipeline or an internal “eval-orchestrator” agent.

If you want, the next step can be: generate a concrete repo layout (/prompts, /evals, /critics, /ci) and the exact run_ci.py and workflow files wired for GCP/GitHub or your existing Langfuse-based observability stack. ³⁵³⁶³⁷³⁸³⁹⁴⁰

⁂

Show minimal Python eval loop using Anthropic SDK for Claude Code

Provide a JSONL format example with sample input and gold fields How to compute micro and macro F1 for text classification in Python Implement reflection function that proposes new prompts from failures Show unit tests for eval_improvement_loop covering edge cases

Below is a compact, copy‑pasteable setup that hits all five asks: minimal eval loop with Anthropic SDK, JSONL format, micro/macro F1, a reflection function, and unit tests.

Minimal Python eval loop (Anthropic SDK)

This uses the official Anthropic Python SDK messages.create call pattern.⁴¹⁴²⁴³

# eval_loop.py
from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
from typing import List, Dict, Any, Tuple

from anthropic import Anthropic
from sklearn.metrics import f1_score  # for micro/macro


@dataclass
class EvalCase:
    label: str
    text: str


@dataclass
class EvalResult:
    case: EvalCase
    pred_label: str


class ClaudeClassifier:
    def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
        self.client = Anthropic(api_key=api_key)
        self.model = model

    def classify(self, prompt_template: str, text: str) -> str:
        prompt = prompt_template.format(text=text)
        msg = self.client.messages.create(
            model=self.model,
            max_tokens=128,
            system=(
                "You are a strict classifier.\n"
                "Return ONLY a single label token from the allowed set."
            ),
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
        )
        raw = msg.content[^3_0].text.strip()
        return raw  # assume it's already one of the allowed labels


def load_eval_cases(path: str | Path) -> List[EvalCase]:
    cases: List[EvalCase] = []
    with Path(path).open() as f:
        for line in f:
            if not line.strip():
                continue
            obj = json.loads(line)
            cases.append(EvalCase(label=obj["gold"], text=obj["input"]))
    return cases


def run_eval_round(
    client: ClaudeClassifier,
    prompt_template: str,
    cases: List[EvalCase],
    labels: List[str],
) -> Tuple[List[EvalResult], float, float]:
    results: List[EvalResult] = []
    y_true: List[str] = []
    y_pred: List[str] = []

    for case in cases:
        pred = client.classify(prompt_template, case.text)
        results.append(EvalResult(case=case, pred_label=pred))
        y_true.append(case.label)
        y_pred.append(pred)

    micro_f1 = f1_score(y_true, y_pred, labels=labels, average="micro")  # micro F1.[web:59][web:56]
    macro_f1 = f1_score(y_true, y_pred, labels=labels, average="macro")  # macro F1.[web:59][web:56]
    return results, float(micro_f1), float(macro_f1)


def eval_improvement_loop(
    api_key: str,
    prompt_template: str,
    eval_path: str | Path,
    labels: List[str],
    rounds: int = 3,
    target_macro_f1: float = 0.9,
    model: str = "claude-3-5-sonnet-20241022",
) -> Dict[str, Any]:
    classifier = ClaudeClassifier(api_key=api_key, model=model)
    cases = load_eval_cases(eval_path)
    prompt = prompt_template
    last_micro, last_macro = 0.0, 0.0

    for r in range(rounds):
        results, micro_f1, macro_f1 = run_eval_round(
            classifier, prompt, cases, labels
        )
        last_micro, last_macro = micro_f1, macro_f1
        print(f"[round {r}] micro={micro_f1:.3f} macro={macro_f1:.3f}")

        if macro_f1 >= target_macro_f1:
            break

        prompt = reflect_and_propose_new_prompt(
            api_key=api_key,
            model=model,
            current_prompt=prompt,
            labels=labels,
            results=results,
        )

    return {
        "final_prompt": prompt,
        "micro_f1": last_micro,
        "macro_f1": last_macro,
    }

JSONL eval format example

JSONL is a line‑delimited JSON format where each line is a single example.⁴⁴⁴⁵ For simple text classification:

{"input": "We are hiring a Senior Backend Engineer with Go and Kubernetes.", "gold": "tech_job"}
{"input": "Our clinic is looking for a pediatric nurse in Belo Horizonte.", "gold": "healthcare_job"}
{"input": "Marketing specialist needed for SaaS growth team.", "gold": "marketing_job"}

Each line has input (the raw text) and gold (the gold label).

Computing micro and macro F1 in Python

sklearn.metrics.f1_score provides both micro and macro averaging modes.⁴⁶⁴⁷⁴⁸

from sklearn.metrics import f1_score

def micro_macro_f1(y_true, y_pred, labels):
    micro = f1_score(y_true, y_pred, labels=labels, average="micro")
    macro = f1_score(y_true, y_pred, labels=labels, average="macro")
    return float(micro), float(macro)

Micro F1 aggregates TP/FP/FN globally, aligning with overall accuracy.⁴⁷⁴⁶
Macro F1 averages per‑class F1, treating all classes equally and penalizing poor performance on rare classes.⁴⁶⁴⁷

Reflection function to propose new prompts

This function inspects failures and asks Claude to output only a new prompt string, constrained to JSON to avoid drift, similar to LLM‑as‑evaluator tooling patterns.⁴⁹⁵⁰

# eval_loop.py (continued)
def reflect_and_propose_new_prompt(
    api_key: str,
    model: str,
    current_prompt: str,
    labels: List[str],
    results: List[EvalResult],
    max_examples: int = 10,
) -> str:
    client = Anthropic(api_key=api_key)

    # Keep only misclassified examples
    failures = [
        {
            "input": r.case.text,
            "gold": r.case.label,
            "pred": r.pred_label,
        }
        for r in results
        if r.case.label != r.pred_label
    ][:max_examples]

    payload = {
        "current_prompt": current_prompt,
        "labels": labels,
        "failures": failures,
    }

    system_prompt = (
        "You are a prompt engineer for a text classification system.\n"
        "You receive the current prompt and misclassified examples.\n"
        "Your job is to propose a SINGLE improved prompt that:\n"
        "- Makes label definitions clearer.\n"
        "- Adds 1-3 short in-prompt examples.\n"
        "Respond ONLY with JSON: {\"new_prompt\": \"...\"}"
    )

    msg = client.messages.create(
        model=model,
        max_tokens=512,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": json.dumps(payload, ensure_ascii=False),
            }
        ],
    )
    raw = msg.content[^3_0].text.strip()
    obj = json.loads(raw)
    return obj.get("new_prompt", current_prompt)

Unit tests for `eval_improvement_loop`

These tests avoid calling the real API by injecting a fake classifier and monkeypatching reflection, which is a common practice when testing LLM eval pipelines.⁵¹⁵⁰

# test_eval_loop.py
import json
from pathlib import Path
from typing import List

import pytest

from eval_loop import (
    EvalCase,
    EvalResult,
    run_eval_round,
    eval_improvement_loop,
)


class FakeClaudeClassifier:
    def __init__(self, label_map):
        self.label_map = label_map  # text -> label

    def classify(self, prompt_template: str, text: str) -> str:
        # deterministic label based on substring matching
        for key, label in self.label_map.items():
            if key in text:
                return label
        return "other"


def write_tmp_jsonl(tmp_path: Path, examples: List[dict]) -> Path:
    p = tmp_path / "evals.jsonl"
    with p.open("w") as f:
        for ex in examples:
            f.write(json.dumps(ex) + "\n")
    return p


def test_run_eval_round_micro_and_macro(tmp_path, monkeypatch):
    examples = [
        {"input": "Senior Backend Engineer", "gold": "tech_job"},
        {"input": "Pediatric nurse", "gold": "healthcare_job"},
    ]
    path = write_tmp_jsonl(tmp_path, examples)
    cases = [EvalCase(label=e["gold"], text=e["input"]) for e in examples]

    # fake classifier: always predicts tech_job
    fake = FakeClaudeClassifier(label_map={"": "tech_job"})
    labels = ["tech_job", "healthcare_job"]

    results, micro_f1, macro_f1 = run_eval_round(
        fake, "{text}", cases, labels
    )

    assert len(results) == 2
    # one correct, one incorrect -> micro F1 0.5, macro F1 < 1
    assert 0.4 < micro_f1 < 0.6
    assert macro_f1 < 1.0


def test_eval_improvement_loop_breaks_on_target(tmp_path, monkeypatch):
    examples = [
        {"input": "Senior Backend Engineer", "gold": "tech_job"},
        {"input": "Another Backend Engineer", "gold": "tech_job"},
    ]
    path = write_tmp_jsonl(tmp_path, examples)

    # monkeypatch load_eval_cases to avoid IO in this test
    from eval_loop import load_eval_cases as real_load

    def fake_load_eval_cases(p):
        return [EvalCase(label=e["gold"], text=e["input"]) for e in examples]

    monkeypatch.setattr("eval_loop.load_eval_cases", fake_load_eval_cases)

    # monkeypatch ClaudeClassifier to always be correct
    from eval_loop import ClaudeClassifier as RealClassifier

    class PerfectClassifier(RealClassifier):
        def __init__(self, *args, **kwargs):
            pass

        def classify(self, prompt_template: str, text: str) -> str:
            return "tech_job"

    monkeypatch.setattr("eval_loop.ClaudeClassifier", PerfectClassifier)

    # monkeypatch reflection to just echo the prompt
    from eval_loop import reflect_and_propose_new_prompt as real_reflect

    def fake_reflect(*args, **kwargs):
        return kwargs.get("current_prompt", "{text}")

    monkeypatch.setattr("eval_loop.reflect_and_propose_new_prompt", fake_reflect)

    result = eval_improvement_loop(
        api_key="dummy",
        prompt_template="{text}",
        eval_path=path,
        labels=["tech_job"],
        rounds=5,
        target_macro_f1=0.9,
    )

    assert result["macro_f1"] >= 0.9
    assert result["final_prompt"] == "{text}"


def test_eval_improvement_loop_handles_no_failures(tmp_path, monkeypatch):
    examples = [
        {"input": "Senior Backend Engineer", "gold": "tech_job"},
    ]
    path = write_tmp_jsonl(tmp_path, examples)

    from eval_loop import load_eval_cases as real_load

    def fake_load_eval_cases(p):
        return [EvalCase(label=e["gold"], text=e["input"]) for e in examples]

    monkeypatch.setattr("eval_loop.load_eval_cases", fake_load_eval_cases)

    from eval_loop import ClaudeClassifier as RealClassifier

    class PerfectClassifier(RealClassifier):
        def __init__(self, *args, **kwargs):
            pass

        def classify(self, prompt_template: str, text: str) -> str:
            return "tech_job"

    monkeypatch.setattr("eval_loop.ClaudeClassifier", PerfectClassifier)

    # reflection should never be called if target reached in first round
    from eval_loop import reflect_and_propose_new_prompt

    called = {"value": False}

    def fake_reflect(*args, **kwargs):
        called["value"] = True
        return kwargs.get("current_prompt", "{text}")

    monkeypatch.setattr("eval_loop.reflect_and_propose_new_prompt", fake_reflect)

    res = eval_improvement_loop(
        api_key="dummy",
        prompt_template="{text}",
        eval_path=path,
        labels=["tech_job"],
        rounds=3,
        target_macro_f1=0.99,
    )

    assert res["macro_f1"] >= 0.99
    # If perfect in first round, loop breaks and reflect is not needed.
    assert called["value"] is False

This pattern aligns with typical LLM‑eval designs: JSONL datasets, micro/macro F1 via scikit‑learn, Anthropic SDK messages.create calls, a constrained reflection step, and tests that validate convergence and edge‑case behavior without hitting the live API.⁴²⁴⁸⁵⁰⁴¹⁴⁴ ⁵²⁵³⁵⁴⁵⁵⁵⁶⁵⁷⁵⁸⁵⁹⁶⁰

⁂

Core eval loop idea​

Pseudocode loop (Python-style SDK)​

Example prompt to hand Claude Code​

Show a simple eval loop example for Claude Code with tests

Minimal eval loop module​

Critic agent pattern for Claude Code​

Job description parsing prompt & eval example​

Best practices for self-improving eval loops​

Running eval loops in CI with Claude Code​

Show minimal Python eval loop using Anthropic SDK for Claude Code

Minimal Python eval loop (Anthropic SDK)​

JSONL eval format example​

Computing micro and macro F1 in Python​

Reflection function to propose new prompts​

Unit tests for eval_improvement_loop​

Footnotes​