Skip to main content

Canvas 2 — Capability taxonomy + evaluation hooks

Capability taxonomy (paper’s three meta-capabilities)

1) External resource access

Concrete actions:

  • apt-get install ...
  • pip install ...
  • curl/wget/requests.get(...)
  • scraping via BeautifulSoup, automation via selenium

Primary value:

  • extends knowledge/ability surface area beyond base model + base image.

Failure modes:

  • dependency conflicts, network flakiness, long install times, supply-chain risk.

Signals/metrics:

  • external-usage-rate = turns with network/package ops / total turns
  • unique-domains contacted, install success rate, cache hit rate

2) File management

Concrete actions:

  • ls, find, cat, grep, sed, head/tail
  • Python I/O: open, json.load, pd.read_csv
  • path ops: pathlib, glob

Primary value:

  • scalable long-context handling (search/select rather than ingest all),
  • persistence across turns (scratchpad as files),
  • deterministic intermediate artifacts.

Failure modes:

  • poor retrieval strategy (grep the wrong thing),
  • hallucinated filenames/paths,
  • brittle parsing.

Signals/metrics:

  • file-usage-rate
  • “document coverage”: which files were touched vs available
  • time-to-first-relevant-snippet (TTFRS) proxy via command logs

3) Code execution (computation + verification)

Concrete actions:

  • Python scripts, numerical solvers, loops, simulation
  • constraint checking (length, overlap, formatting)
  • generating actual deliverables (.png/.html/.wav/.mp4)

Primary value:

  • verifiable compute, exact constraints, exhaustive search when needed.

Failure modes:

  • overfitting to brute-force, runaway compute, tool misuse.

Signals/metrics:

  • compute-usage-rate
  • pass@k across re-runs with different seeds
  • verification coverage: how often the model checks its own output

Cross-cutting: exploration quality vs wandering

“Wandering” pattern (as described)

  • high turns, low capability usage, low progress.

Operationalization:

  • progress-per-turn = (tool-informative actions) / turns
  • ratio: (turns) / (files opened + commands executed + tests run)

Intervention levers (mechanical, not rhetorical):

  • enforce I/O contract (final answer must be read from file)
  • discourage natural-language “thinking prints”
  • early stopping + penalize non-submission

Δ metric (agentic lift)

Definition:

  • Δ = score(sandbox mode) − score(vanilla mode)

Interpretation:

  • positive: model converts environment affordances into task performance
  • negative: model incurs interaction overhead without effective tool use

Extensions:

  • Δ_by_domain, Δ_by_task_type, Δ_by_context_length
  • correlate Δ with capability-usage-rates to diagnose bottlenecks

Evaluation grid implied by the paper

Axes:

  1. Domains: math/physics/chem/biomed/long-context/IF/SWE
  2. Input placement: prompt vs sandbox files
  3. Model class: strong agentic vs weak
  4. Resource regime: network on/off, package installs allowed/blocked
  5. Turn budget and token budget

Outputs:

  • accuracy/task score
  • tokens (prompt + model + env)
  • latency/throughput
  • tool-use traces (capability rates, TTFRS, failure taxonomy)