/adk-eval - Evaluate ADK Agents
Evaluate Google ADK agent performance against test datasets.
Usage
/adk-eval <agent-path> <dataset> # Run evaluation
/adk-eval --create <output> # Create eval dataset template
/adk-eval --compare <run1> <run2> # Compare evaluation runs
/adk-eval --report <results> # Generate HTML report
Examples
Run Evaluation
# Basic evaluation
/adk-eval agents/my_agent eval_sets/test.json
# With specific metrics
/adk-eval agents/my_agent eval_sets/test.json --metrics accuracy,latency
# Set minimum accuracy threshold
/adk-eval agents/my_agent eval_sets/test.json --min-accuracy 0.85
# Save results
/adk-eval agents/my_agent eval_sets/test.json --output results/run_001.json
Create Dataset Template
# Generate template
/adk-eval --create eval_sets/new_dataset.json
# Interactive creation
/adk-eval --create eval_sets/new_dataset.json --interactive
Compare Runs
# Compare two evaluation runs
/adk-eval --compare results/v1.json results/v2.json
# Output:
# Metric | v1 | v2 | Change
# accuracy | 0.85 | 0.92 | +8.2%
# tool_accuracy | 0.90 | 0.95 | +5.6%
# avg_latency | 1.2s | 0.9s | -25%
Generate Report
# HTML report
/adk-eval --report results/run_001.json --format html
# Opens in browser with charts and details
Dataset Format
[
{
"input": "User message or query",
"expected_output": "Expected agent response (optional)",
"expected_tool_calls": ["tool1", "tool2"],
"expected_tool_args": {
"tool1": {"arg1": "value1"}
},
"context": {"key": "value"},
"tags": ["category", "difficulty"]
}
]
Metrics
| Metric | Description |
|---|---|
accuracy | Output matches expected |
tool_accuracy | Correct tools called |
tool_arg_accuracy | Correct tool arguments |
latency | Response time (ms) |
token_count | Total tokens used |
cost | Estimated API cost |
safety | Content safety score |
Output
ADK Evaluation Results
======================
Agent: my_agent
Dataset: eval_sets/test.json
Cases: 50
Results:
accuracy: 92.0%
tool_accuracy: 95.0%
avg_latency: 1.2s
total_cost: $0.15
Failed Cases: 4
- Case 12: Wrong tool called (search vs lookup)
- Case 23: Missing required argument
- Case 34: Incorrect output format
- Case 45: Timeout exceeded
Saved: results/run_001.json
CI/CD Usage
# Fail if accuracy below threshold
/adk-eval agents/my_agent eval_sets/regression.json --min-accuracy 0.85 --exit-on-fail
# Exit codes:
# 0 - All thresholds passed
# 1 - Threshold violation
# 2 - Execution error
Related
- Agent:
adk-orchestrator - Skill:
adk-evaluation - Command:
/adk-run - Repository: google/adk-python
Version: 1.0.0 Created: 2026-01-13