Skip to main content

/adk-eval - Evaluate ADK Agents

Evaluate Google ADK agent performance against test datasets.

Usage

/adk-eval <agent-path> <dataset>     # Run evaluation
/adk-eval --create <output> # Create eval dataset template
/adk-eval --compare <run1> <run2> # Compare evaluation runs
/adk-eval --report <results> # Generate HTML report

Examples

Run Evaluation

# Basic evaluation
/adk-eval agents/my_agent eval_sets/test.json

# With specific metrics
/adk-eval agents/my_agent eval_sets/test.json --metrics accuracy,latency

# Set minimum accuracy threshold
/adk-eval agents/my_agent eval_sets/test.json --min-accuracy 0.85

# Save results
/adk-eval agents/my_agent eval_sets/test.json --output results/run_001.json

Create Dataset Template

# Generate template
/adk-eval --create eval_sets/new_dataset.json

# Interactive creation
/adk-eval --create eval_sets/new_dataset.json --interactive

Compare Runs

# Compare two evaluation runs
/adk-eval --compare results/v1.json results/v2.json

# Output:
# Metric | v1 | v2 | Change
# accuracy | 0.85 | 0.92 | +8.2%
# tool_accuracy | 0.90 | 0.95 | +5.6%
# avg_latency | 1.2s | 0.9s | -25%

Generate Report

# HTML report
/adk-eval --report results/run_001.json --format html

# Opens in browser with charts and details

Dataset Format

[
{
"input": "User message or query",
"expected_output": "Expected agent response (optional)",
"expected_tool_calls": ["tool1", "tool2"],
"expected_tool_args": {
"tool1": {"arg1": "value1"}
},
"context": {"key": "value"},
"tags": ["category", "difficulty"]
}
]

Metrics

MetricDescription
accuracyOutput matches expected
tool_accuracyCorrect tools called
tool_arg_accuracyCorrect tool arguments
latencyResponse time (ms)
token_countTotal tokens used
costEstimated API cost
safetyContent safety score

Output

ADK Evaluation Results
======================

Agent: my_agent
Dataset: eval_sets/test.json
Cases: 50

Results:
accuracy: 92.0%
tool_accuracy: 95.0%
avg_latency: 1.2s
total_cost: $0.15

Failed Cases: 4
- Case 12: Wrong tool called (search vs lookup)
- Case 23: Missing required argument
- Case 34: Incorrect output format
- Case 45: Timeout exceeded

Saved: results/run_001.json

CI/CD Usage

# Fail if accuracy below threshold
/adk-eval agents/my_agent eval_sets/regression.json --min-accuracy 0.85 --exit-on-fail

# Exit codes:
# 0 - All thresholds passed
# 1 - Threshold violation
# 2 - Execution error
  • Agent: adk-orchestrator
  • Skill: adk-evaluation
  • Command: /adk-run
  • Repository: google/adk-python

Version: 1.0.0 Created: 2026-01-13