Canvas 2 — Capability taxonomy + evaluation hooks

Capability taxonomy (paper’s three meta-capabilities)

1) External resource access

Concrete actions:

apt-get install ...
pip install ...
curl/wget/requests.get(...)
scraping via BeautifulSoup, automation via selenium

Primary value:

extends knowledge/ability surface area beyond base model + base image.

Failure modes:

dependency conflicts, network flakiness, long install times, supply-chain risk.

Signals/metrics:

external-usage-rate = turns with network/package ops / total turns
unique-domains contacted, install success rate, cache hit rate

2) File management

Concrete actions:

ls, find, cat, grep, sed, head/tail
Python I/O: open, json.load, pd.read_csv
path ops: pathlib, glob

Primary value:

scalable long-context handling (search/select rather than ingest all),
persistence across turns (scratchpad as files),
deterministic intermediate artifacts.

Failure modes:

poor retrieval strategy (grep the wrong thing),
hallucinated filenames/paths,
brittle parsing.

Signals/metrics:

file-usage-rate
“document coverage”: which files were touched vs available
time-to-first-relevant-snippet (TTFRS) proxy via command logs

3) Code execution (computation + verification)

Concrete actions:

Python scripts, numerical solvers, loops, simulation
constraint checking (length, overlap, formatting)
generating actual deliverables (.png/.html/.wav/.mp4)

Primary value:

verifiable compute, exact constraints, exhaustive search when needed.

Failure modes:

overfitting to brute-force, runaway compute, tool misuse.

Signals/metrics:

compute-usage-rate
pass@k across re-runs with different seeds
verification coverage: how often the model checks its own output

Cross-cutting: exploration quality vs wandering

“Wandering” pattern (as described)

high turns, low capability usage, low progress.

Operationalization:

progress-per-turn = (tool-informative actions) / turns
ratio: (turns) / (files opened + commands executed + tests run)

Intervention levers (mechanical, not rhetorical):

enforce I/O contract (final answer must be read from file)
discourage natural-language “thinking prints”
early stopping + penalize non-submission

Δ metric (agentic lift)

Definition:

Δ = score(sandbox mode) − score(vanilla mode)

Interpretation:

positive: model converts environment affordances into task performance
negative: model incurs interaction overhead without effective tool use

Extensions:

Δ_by_domain, Δ_by_task_type, Δ_by_context_length
correlate Δ with capability-usage-rates to diagnose bottlenecks

Evaluation grid implied by the paper

Axes:

Domains: math/physics/chem/biomed/long-context/IF/SWE
Input placement: prompt vs sandbox files
Model class: strong agentic vs weak
Resource regime: network on/off, package installs allowed/blocked
Turn budget and token budget

Outputs:

accuracy/task score
tokens (prompt + model + env)
latency/throughput
tool-use traces (capability rates, TTFRS, failure taxonomy)

Capability taxonomy (paper’s three meta-capabilities)​

1) External resource access​

2) File management​

3) Code execution (computation + verification)​

Cross-cutting: exploration quality vs wandering​

“Wandering” pattern (as described)​

Δ metric (agentic lift)​

Evaluation grid implied by the paper​

Capability taxonomy (paper’s three meta-capabilities)

1) External resource access

2) File management

3) Code execution (computation + verification)

Cross-cutting: exploration quality vs wandering

“Wandering” pattern (as described)

Δ metric (agentic lift)

Evaluation grid implied by the paper