Canvas 2 — Capability taxonomy + evaluation hooks
Capability taxonomy (paper’s three meta-capabilities)
1) External resource access
Concrete actions:
apt-get install ...pip install ...curl/wget/requests.get(...)- scraping via
BeautifulSoup, automation viaselenium
Primary value:
- extends knowledge/ability surface area beyond base model + base image.
Failure modes:
- dependency conflicts, network flakiness, long install times, supply-chain risk.
Signals/metrics:
- external-usage-rate = turns with network/package ops / total turns
- unique-domains contacted, install success rate, cache hit rate
2) File management
Concrete actions:
ls,find,cat,grep,sed,head/tail- Python I/O:
open,json.load,pd.read_csv - path ops:
pathlib,glob
Primary value:
- scalable long-context handling (search/select rather than ingest all),
- persistence across turns (scratchpad as files),
- deterministic intermediate artifacts.
Failure modes:
- poor retrieval strategy (grep the wrong thing),
- hallucinated filenames/paths,
- brittle parsing.
Signals/metrics:
- file-usage-rate
- “document coverage”: which files were touched vs available
- time-to-first-relevant-snippet (TTFRS) proxy via command logs
3) Code execution (computation + verification)
Concrete actions:
- Python scripts, numerical solvers, loops, simulation
- constraint checking (length, overlap, formatting)
- generating actual deliverables (.png/.html/.wav/.mp4)
Primary value:
- verifiable compute, exact constraints, exhaustive search when needed.
Failure modes:
- overfitting to brute-force, runaway compute, tool misuse.
Signals/metrics:
- compute-usage-rate
- pass@k across re-runs with different seeds
- verification coverage: how often the model checks its own output
Cross-cutting: exploration quality vs wandering
“Wandering” pattern (as described)
- high turns, low capability usage, low progress.
Operationalization:
- progress-per-turn = (tool-informative actions) / turns
- ratio: (turns) / (files opened + commands executed + tests run)
Intervention levers (mechanical, not rhetorical):
- enforce I/O contract (final answer must be read from file)
- discourage natural-language “thinking prints”
- early stopping + penalize non-submission
Δ metric (agentic lift)
Definition:
- Δ = score(sandbox mode) − score(vanilla mode)
Interpretation:
- positive: model converts environment affordances into task performance
- negative: model incurs interaction overhead without effective tool use
Extensions:
- Δ_by_domain, Δ_by_task_type, Δ_by_context_length
- correlate Δ with capability-usage-rates to diagnose bottlenecks
Evaluation grid implied by the paper
Axes:
- Domains: math/physics/chem/biomed/long-context/IF/SWE
- Input placement: prompt vs sandbox files
- Model class: strong agentic vs weak
- Resource regime: network on/off, package installs allowed/blocked
- Turn budget and token budget
Outputs:
- accuracy/task score
- tokens (prompt + model + env)
- latency/throughput
- tool-use traces (capability rates, TTFRS, failure taxonomy)