anthropic claude code eval loop example
A common “Claude Code eval loop” pattern is: Claude writes evals → runs them → inspects failures → updates prompts/evals → repeats until metrics stabilize. Below is a minimal but production-adaptable example you can port into your own agentic stack.12
Core eval loop idea
You want an automated loop where Claude (or your orchestrator) can:
- Generate or update eval cases for a task (e.g., classification, extraction, tool use).21
- Run the evals against the current prompt / system configuration and compute scores.34
- Reflect on failures and modify either the base prompt or the evals themselves, then iterate.51
Pseudocode loop (Python-style SDK)
Assume:
run_model(prompt, input)→ calls Anthropic API / Claude Code context.score_output(gold, pred)→ returns numeric score for a test case.
def run_eval_round(prompt_template, eval_cases):
results = []
for case in eval_cases:
prompt = prompt_template.format(**case["input"])
pred = run_model(prompt, case["input"])
score = score_output(case["gold"], pred)
results.append({**case, "pred": pred, "score": score})
return results
def summarize_and_propose_changes(prompt_template, eval_results):
# Let Claude act as a critic / editor
system_msg = (
"You are an evaluation critic. "
"Given eval failures, improve either the prompt or the eval cases."
)
user_msg = {
"prompt_template": prompt_template,
"eval_results": eval_results,
}
suggestion = run_model(system_msg, user_msg) # returns JSON-ish text
return json.loads(suggestion)
def eval_improvement_loop(initial_prompt, seed_evals, max_rounds=10, target_score=0.9):
prompt = initial_prompt
evals = seed_evals
for round_idx in range(max_rounds):
eval_results = run_eval_round(prompt, evals)
avg_score = sum(r["score"] for r in eval_results) / len(eval_results)
if avg_score >= target_score:
break
suggestion = summarize_and_propose_changes(prompt, eval_results)
if "new_prompt" in suggestion:
prompt = suggestion["new_prompt"]
if "updated_evals" in suggestion:
evals = suggestion["updated_evals"]
return {"final_prompt": prompt, "final_evals": evals, "final_score": avg_score}
This matches the pattern described by practitioners using Claude Code for “self-improving eval loops”: Claude writes evals, runs them, and then edits both prompts and evals based on failures until quality converges.15
Example prompt to hand Claude Code
Here is a tight prompt you can paste into Claude Code itself to bootstrap such a loop in your stack:
“You are an evaluation-orchestrator agent. Generate a minimal Python module that:
- Loads a JSONL file of eval cases (fields:
input,gold).- Calls the Anthropic API for each case with a
prompt_templatestring.- Computes per-case and average accuracy/F1.
- As a second step, given the worst 10 cases, proposes a JSON object with
new_promptand optionallyupdated_evals. The code should expose a functioneval_improvement_loop(initial_prompt, eval_path, rounds, target_score)that implements the full loop. Assume the Anthropic Python SDK is available asanthropicand use type hints and clear separation between ‘run evals’ and ‘reflect and propose changes’ functions.”
If you share your target task (e.g., code-gen, extraction, routing), a concrete example of eval_cases, and your current Anthropic client setup, a more tailored eval loop (with real SDK calls and metrics) can be sketched next.
Show a simple eval loop example for Claude Code with tests
How to build a critic agent for Claude Code evals Example prompt and eval pair for parsing job descriptions Best practices for automated self improving eval loops How to run Claude Code evals continuously in a CI pipeline. “You are an evaluation-orchestrator agent. Generate a minimal Python module that: Loads a JSONL file of eval cases (fields: input, gold). Calls the Anthropic API for each case with a prompt_template string. Computes per-case and average accuracy/F1. As a second step, given the worst 10 cases, proposes a JSON object with new_prompt and optionally updated_evals. The code should expose a function eval_improvement_loop(initial_prompt, eval_path, rounds, target_score) that implements the full loop. Assume the Anthropic Python SDK is available as anthropic and use type hints and clear separation between ‘run evals’ and ‘reflect and propose changes’ functions.”
Here is a concrete, end-to-end pattern you can adapt: a minimal Python eval loop module, a Claude-based critic agent, a job‑description parsing eval pair, best practices for self-improving loops, and how to wire this into CI.
Minimal eval loop module
This is a simple, “batteries-included” eval loop that:
- Loads
input/goldfrom JSONL. - Calls Claude via the Anthropic Python SDK with a
prompt_template. - Computes accuracy and macro-F1 (binary or multi‑label).
- Uses a Claude critic to suggest
new_promptandupdated_evals.
from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
from typing import List, Dict, Any, Tuple
import anthropic
from sklearn.metrics import f1_score # or roll your own if you want no deps
@dataclass
class EvalCase:
input: Dict[str, Any]
gold: Dict[str, Any] # e.g., {"skills": [...], "seniority": "mid"}
@dataclass
class EvalResult(EvalCase):
pred: Dict[str, Any]
correct: bool
f1: float
class ClaudeEvaluator:
def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
self.client = anthropic.Anthropic(api_key=api_key)
self.model = model
def _call_model(self, system_prompt: str, user_content: Any) -> str:
msg = self.client.messages.create(
model=self.model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_content}],
)
# assumes single text block
return msg.content[^2_0].text
def run_eval_round(
self,
prompt_template: str,
cases: List[EvalCase],
) -> List[EvalResult]:
results: List[EvalResult] = []
for case in cases:
user_prompt = prompt_template.format(**case.input)
raw = self._call_model(
system_prompt=(
"You are a JSON-only evaluation model. "
"Return a valid JSON object matching the expected schema."
),
user_content=user_prompt,
)
pred = json.loads(raw)
correct, f1 = self._score_case(case.gold, pred)
results.append(EvalResult(
input=case.input,
gold=case.gold,
pred=pred,
correct=correct,
f1=f1,
))
return results
def _score_case(self, gold: Dict[str, Any], pred: Dict[str, Any]) -> Tuple[bool, float]:
# simple example: treat skills as multi-label, seniority as exact match
gold_skills = sorted(set(gold.get("skills", [])))
pred_skills = sorted(set(pred.get("skills", [])))
y_true = [1 if s in gold_skills else 0 for s in gold_skills]
y_pred = [1 if s in pred_skills else 0 for s in gold_skills]
f1 = f1_score(y_true, y_pred) if gold_skills else 1.0
correct = (gold.get("seniority") == pred.get("seniority")) and f1 == 1.0
return correct, float(f1)
def propose_prompt_and_eval_changes(
self,
prompt_template: str,
eval_results: List[EvalResult],
worst_k: int = 10,
) -> Dict[str, Any]:
sorted_results = sorted(eval_results, key=lambda r: r.f1)
worst = sorted_results[:worst_k]
payload = {
"prompt_template": prompt_template,
"worst_examples": [
{
"input": r.input,
"gold": r.gold,
"pred": r.pred,
"f1": r.f1,
"correct": r.correct,
}
for r in worst
],
}
critic_system = (
"You are a critic agent for evaluation loops.\n"
"- Inspect failure patterns.\n"
"- Propose ONLY a JSON object with keys:\n"
' "new_prompt": string (optional),\n'
' "updated_evals": list of eval cases with fields "input" and "gold" (optional).\n'
"- Do not include explanations, just JSON."
)
raw = self._call_model(critic_system, json.dumps(payload, ensure_ascii=False))
return json.loads(raw)
def load_eval_cases(path: str | Path) -> List[EvalCase]:
cases: List[EvalCase] = []
with Path(path).open() as f:
for line in f:
if not line.strip():
continue
obj = json.loads(line)
cases.append(EvalCase(input=obj["input"], gold=obj["gold"]))
return cases
def eval_improvement_loop(
initial_prompt: str,
eval_path: str | Path,
rounds: int,
target_score: float,
api_key: str,
model: str = "claude-3-5-sonnet-20241022",
) -> Dict[str, Any]:
evaluator = ClaudeEvaluator(api_key=api_key, model=model)
cases = load_eval_cases(eval_path)
prompt = initial_prompt
final_score = 0.0
for r in range(rounds):
results = evaluator.run_eval_round(prompt, cases)
avg_f1 = sum(res.f1 for res in results) / len(results)
final_score = avg_f1
print(f"[round {r}] avg F1={avg_f1:.3f}")
if avg_f1 >= target_score:
break
suggestions = evaluator.propose_prompt_and_eval_changes(prompt, results)
if "new_prompt" in suggestions and suggestions["new_prompt"]:
prompt = suggestions["new_prompt"]
if "updated_evals" in suggestions and suggestions["updated_evals"]:
cases = [
EvalCase(input=e["input"], gold=e["gold"])
for e in suggestions["updated_evals"]
]
return {"final_prompt": prompt, "final_score": final_score, "num_cases": len(cases)}
This matches the workflow described by teams using Claude Code for “fully automated self-improving eval loops” where Claude both runs tests and adjusts prompts/evals.212223
Critic agent pattern for Claude Code
A critic agent should:
- See failures plus current prompt and scoring rubric.
- Output only a constrained JSON with proposed
new_promptandupdated_evals, no prose.
Example critic system prompt:
You are an evaluation critic for a job-description parsing system. Input: a JSON object containing:
prompt_template: the current prompt string used for the model.worst_examples: a list of objects withinput,gold,pred,f1, andcorrect. Your goals:
- Detect systematic error patterns (e.g., missing benefits, misclassified seniority).
- Propose an improved
new_promptthat clarifies instructions and gives 1–3 concise in‑prompt examples.- Optionally propose an
updated_evalslist where you:
- Fix bad or ambiguous gold labels.
- Add 2–5 new hard cases targeting observed weaknesses. Respond with only a JSON object of the form:
{ "new_prompt": string, "updated_evals": [ { "input": {...}, "gold": {...} }, ... ] }Do not include explanations, comments, or Markdown.
This follows best-practice guidance for skill authoring and tightly scoped JSON tools.2425
Job description parsing prompt & eval example
Here is a minimal example tailored to parsing job descriptions into structured fields, consistent with how Claude is used for HR tasks like screening and JD generation.2627
Prompt template (string in your module):
PROMPT_TEMPLATE = """
You are an information extraction engine for job descriptions.
Task:
- Read the following job description text.
- Extract a structured JSON object with fields:
- title: string
- seniority: one of ["intern", "junior", "mid", "senior", "lead", "director", "vp", "c-level", "unknown"]
- skills: list of up to 10 key technical or domain skills in lower_snake_case.
- location: string (city/region if present, otherwise "remote" or "unspecified").
- remote_friendly: boolean.
- compensation_mentioned: boolean (true if any salary/comp range or equity is mentioned).
Return ONLY a JSON object, no extra text.
Job description:
---
{job_text}
---
"""
Example JSONL eval case (job_evals.jsonl):
{"input": {
"job_text": "We are hiring a Senior Backend Engineer to join our payments team in São Paulo or remote within Brazil. You will work with Go, Kubernetes, and PostgreSQL. Experience with payment systems and high throughput APIs is required. Competitive salary and stock options."
}, "gold": {
"title": "Senior Backend Engineer",
"seniority": "senior",
"skills": ["go", "kubernetes", "postgresql", "payment_systems", "high_throughput_apis"],
"location": "São Paulo or remote within Brazil",
"remote_friendly": true,
"compensation_mentioned": true
}}
You can bootstrap a small seed set by manually writing 10–30 such examples, then let the critic agent expand and refine them in the loop, similar to approaches documented for self-improving evals.2221
Best practices for self-improving eval loops
Patterns emerging from LLM eval literature and practitioners:
- Separate “app model” vs “eval/critic model”
- Guard against overfitting and leakage
- Track metrics and drift
Running eval loops in CI with Claude Code
For continuous evals in CI, the typical pattern is:
- CLI/runner
- Use a thin Python or shell wrapper that calls
eval_improvement_loopwithrounds=1for each PR or commit, treating self-improvement as an offline or scheduled job, not in the main CI critical path.313233 - Enforce thresholds: if
final_score < MIN_F1, fail the job and surface a concise report in CI logs or a comment.
- Use a thin Python or shell wrapper that calls
- Scheduled / continuous loops
Example GitHub Actions step:
- name: Run Claude evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python -m evals.run_ci \
--eval-path data/job_evals.jsonl \
--prompt-path prompts/job_parser.txt \
--min-f1 0.85
In evals/run_ci.py, you would call eval_improvement_loop with rounds=1 (no auto-edit) and exit non‑zero if the score falls below threshold, leaving the full multi‑round self‑improvement loop to a scheduled pipeline or an internal “eval-orchestrator” agent.
If you want, the next step can be: generate a concrete repo layout (/prompts, /evals, /critics, /ci) and the exact run_ci.py and workflow files wired for GCP/GitHub or your existing Langfuse-based observability stack.
Show minimal Python eval loop using Anthropic SDK for Claude Code
Provide a JSONL format example with sample input and gold fields How to compute micro and macro F1 for text classification in Python Implement reflection function that proposes new prompts from failures Show unit tests for eval_improvement_loop covering edge cases
Below is a compact, copy‑pasteable setup that hits all five asks: minimal eval loop with Anthropic SDK, JSONL format, micro/macro F1, a reflection function, and unit tests.
Minimal Python eval loop (Anthropic SDK)
This uses the official Anthropic Python SDK messages.create call pattern.414243
# eval_loop.py
from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
from typing import List, Dict, Any, Tuple
from anthropic import Anthropic
from sklearn.metrics import f1_score # for micro/macro
@dataclass
class EvalCase:
label: str
text: str
@dataclass
class EvalResult:
case: EvalCase
pred_label: str
class ClaudeClassifier:
def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
self.client = Anthropic(api_key=api_key)
self.model = model
def classify(self, prompt_template: str, text: str) -> str:
prompt = prompt_template.format(text=text)
msg = self.client.messages.create(
model=self.model,
max_tokens=128,
system=(
"You are a strict classifier.\n"
"Return ONLY a single label token from the allowed set."
),
messages=[
{
"role": "user",
"content": prompt,
}
],
)
raw = msg.content[^3_0].text.strip()
return raw # assume it's already one of the allowed labels
def load_eval_cases(path: str | Path) -> List[EvalCase]:
cases: List[EvalCase] = []
with Path(path).open() as f:
for line in f:
if not line.strip():
continue
obj = json.loads(line)
cases.append(EvalCase(label=obj["gold"], text=obj["input"]))
return cases
def run_eval_round(
client: ClaudeClassifier,
prompt_template: str,
cases: List[EvalCase],
labels: List[str],
) -> Tuple[List[EvalResult], float, float]:
results: List[EvalResult] = []
y_true: List[str] = []
y_pred: List[str] = []
for case in cases:
pred = client.classify(prompt_template, case.text)
results.append(EvalResult(case=case, pred_label=pred))
y_true.append(case.label)
y_pred.append(pred)
micro_f1 = f1_score(y_true, y_pred, labels=labels, average="micro") # micro F1.[web:59][web:56]
macro_f1 = f1_score(y_true, y_pred, labels=labels, average="macro") # macro F1.[web:59][web:56]
return results, float(micro_f1), float(macro_f1)
def eval_improvement_loop(
api_key: str,
prompt_template: str,
eval_path: str | Path,
labels: List[str],
rounds: int = 3,
target_macro_f1: float = 0.9,
model: str = "claude-3-5-sonnet-20241022",
) -> Dict[str, Any]:
classifier = ClaudeClassifier(api_key=api_key, model=model)
cases = load_eval_cases(eval_path)
prompt = prompt_template
last_micro, last_macro = 0.0, 0.0
for r in range(rounds):
results, micro_f1, macro_f1 = run_eval_round(
classifier, prompt, cases, labels
)
last_micro, last_macro = micro_f1, macro_f1
print(f"[round {r}] micro={micro_f1:.3f} macro={macro_f1:.3f}")
if macro_f1 >= target_macro_f1:
break
prompt = reflect_and_propose_new_prompt(
api_key=api_key,
model=model,
current_prompt=prompt,
labels=labels,
results=results,
)
return {
"final_prompt": prompt,
"micro_f1": last_micro,
"macro_f1": last_macro,
}
JSONL eval format example
JSONL is a line‑delimited JSON format where each line is a single example.4445 For simple text classification:
{"input": "We are hiring a Senior Backend Engineer with Go and Kubernetes.", "gold": "tech_job"}
{"input": "Our clinic is looking for a pediatric nurse in Belo Horizonte.", "gold": "healthcare_job"}
{"input": "Marketing specialist needed for SaaS growth team.", "gold": "marketing_job"}
Each line has input (the raw text) and gold (the gold label).
Computing micro and macro F1 in Python
sklearn.metrics.f1_score provides both micro and macro averaging modes.464748
from sklearn.metrics import f1_score
def micro_macro_f1(y_true, y_pred, labels):
micro = f1_score(y_true, y_pred, labels=labels, average="micro")
macro = f1_score(y_true, y_pred, labels=labels, average="macro")
return float(micro), float(macro)
- Micro F1 aggregates TP/FP/FN globally, aligning with overall accuracy.4746
- Macro F1 averages per‑class F1, treating all classes equally and penalizing poor performance on rare classes.4647
Reflection function to propose new prompts
This function inspects failures and asks Claude to output only a new prompt string, constrained to JSON to avoid drift, similar to LLM‑as‑evaluator tooling patterns.4950
# eval_loop.py (continued)
def reflect_and_propose_new_prompt(
api_key: str,
model: str,
current_prompt: str,
labels: List[str],
results: List[EvalResult],
max_examples: int = 10,
) -> str:
client = Anthropic(api_key=api_key)
# Keep only misclassified examples
failures = [
{
"input": r.case.text,
"gold": r.case.label,
"pred": r.pred_label,
}
for r in results
if r.case.label != r.pred_label
][:max_examples]
payload = {
"current_prompt": current_prompt,
"labels": labels,
"failures": failures,
}
system_prompt = (
"You are a prompt engineer for a text classification system.\n"
"You receive the current prompt and misclassified examples.\n"
"Your job is to propose a SINGLE improved prompt that:\n"
"- Makes label definitions clearer.\n"
"- Adds 1-3 short in-prompt examples.\n"
"Respond ONLY with JSON: {\"new_prompt\": \"...\"}"
)
msg = client.messages.create(
model=model,
max_tokens=512,
system=system_prompt,
messages=[
{
"role": "user",
"content": json.dumps(payload, ensure_ascii=False),
}
],
)
raw = msg.content[^3_0].text.strip()
obj = json.loads(raw)
return obj.get("new_prompt", current_prompt)
Unit tests for eval_improvement_loop
These tests avoid calling the real API by injecting a fake classifier and monkeypatching reflection, which is a common practice when testing LLM eval pipelines.5150
# test_eval_loop.py
import json
from pathlib import Path
from typing import List
import pytest
from eval_loop import (
EvalCase,
EvalResult,
run_eval_round,
eval_improvement_loop,
)
class FakeClaudeClassifier:
def __init__(self, label_map):
self.label_map = label_map # text -> label
def classify(self, prompt_template: str, text: str) -> str:
# deterministic label based on substring matching
for key, label in self.label_map.items():
if key in text:
return label
return "other"
def write_tmp_jsonl(tmp_path: Path, examples: List[dict]) -> Path:
p = tmp_path / "evals.jsonl"
with p.open("w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
return p
def test_run_eval_round_micro_and_macro(tmp_path, monkeypatch):
examples = [
{"input": "Senior Backend Engineer", "gold": "tech_job"},
{"input": "Pediatric nurse", "gold": "healthcare_job"},
]
path = write_tmp_jsonl(tmp_path, examples)
cases = [EvalCase(label=e["gold"], text=e["input"]) for e in examples]
# fake classifier: always predicts tech_job
fake = FakeClaudeClassifier(label_map={"": "tech_job"})
labels = ["tech_job", "healthcare_job"]
results, micro_f1, macro_f1 = run_eval_round(
fake, "{text}", cases, labels
)
assert len(results) == 2
# one correct, one incorrect -> micro F1 0.5, macro F1 < 1
assert 0.4 < micro_f1 < 0.6
assert macro_f1 < 1.0
def test_eval_improvement_loop_breaks_on_target(tmp_path, monkeypatch):
examples = [
{"input": "Senior Backend Engineer", "gold": "tech_job"},
{"input": "Another Backend Engineer", "gold": "tech_job"},
]
path = write_tmp_jsonl(tmp_path, examples)
# monkeypatch load_eval_cases to avoid IO in this test
from eval_loop import load_eval_cases as real_load
def fake_load_eval_cases(p):
return [EvalCase(label=e["gold"], text=e["input"]) for e in examples]
monkeypatch.setattr("eval_loop.load_eval_cases", fake_load_eval_cases)
# monkeypatch ClaudeClassifier to always be correct
from eval_loop import ClaudeClassifier as RealClassifier
class PerfectClassifier(RealClassifier):
def __init__(self, *args, **kwargs):
pass
def classify(self, prompt_template: str, text: str) -> str:
return "tech_job"
monkeypatch.setattr("eval_loop.ClaudeClassifier", PerfectClassifier)
# monkeypatch reflection to just echo the prompt
from eval_loop import reflect_and_propose_new_prompt as real_reflect
def fake_reflect(*args, **kwargs):
return kwargs.get("current_prompt", "{text}")
monkeypatch.setattr("eval_loop.reflect_and_propose_new_prompt", fake_reflect)
result = eval_improvement_loop(
api_key="dummy",
prompt_template="{text}",
eval_path=path,
labels=["tech_job"],
rounds=5,
target_macro_f1=0.9,
)
assert result["macro_f1"] >= 0.9
assert result["final_prompt"] == "{text}"
def test_eval_improvement_loop_handles_no_failures(tmp_path, monkeypatch):
examples = [
{"input": "Senior Backend Engineer", "gold": "tech_job"},
]
path = write_tmp_jsonl(tmp_path, examples)
from eval_loop import load_eval_cases as real_load
def fake_load_eval_cases(p):
return [EvalCase(label=e["gold"], text=e["input"]) for e in examples]
monkeypatch.setattr("eval_loop.load_eval_cases", fake_load_eval_cases)
from eval_loop import ClaudeClassifier as RealClassifier
class PerfectClassifier(RealClassifier):
def __init__(self, *args, **kwargs):
pass
def classify(self, prompt_template: str, text: str) -> str:
return "tech_job"
monkeypatch.setattr("eval_loop.ClaudeClassifier", PerfectClassifier)
# reflection should never be called if target reached in first round
from eval_loop import reflect_and_propose_new_prompt
called = {"value": False}
def fake_reflect(*args, **kwargs):
called["value"] = True
return kwargs.get("current_prompt", "{text}")
monkeypatch.setattr("eval_loop.reflect_and_propose_new_prompt", fake_reflect)
res = eval_improvement_loop(
api_key="dummy",
prompt_template="{text}",
eval_path=path,
labels=["tech_job"],
rounds=3,
target_macro_f1=0.99,
)
assert res["macro_f1"] >= 0.99
# If perfect in first round, loop breaks and reflect is not needed.
assert called["value"] is False
This pattern aligns with typical LLM‑eval designs: JSONL datasets, micro/macro F1 via scikit‑learn, Anthropic SDK messages.create calls, a constrained reflection step, and tests that validate convergence and edge‑case behavior without hitting the live API.4248504144