Bundle — LLM-in-Sandbox analysis canvases

Canvas 1 — Paper map (LLM-in-Sandbox)

One-sentence thesis

Granting an LLM access to a general-purpose code sandbox (terminal + files + code execution + network) can measurably unlock broader task competence beyond coding, and can be reinforced via RL using non-agentic data.

Core problem

Text-only generation forces the model to:

“simulate” computation instead of verifying it,
carry long context inside the prompt window,
rely on fixed tools rather than acquiring new ones.

LLM-in-Sandbox reframes the model as an operator inside a virtual computer.

The proposed paradigm

Sandbox = three meta-capabilities

External resource access (network, package install)
File management (persistent storage, search, parsing)
Code execution (compute, verification, formatting constraints)

Minimal tool interface (as presented)

execute_bash: run terminal commands
str_replace_editor: create/view/edit files
submit: end and return output (e.g., /testbed/answer.txt)

Workflow (ReAct-style loop)

iterate: generate action → execute in sandbox → observe → decide next action
output is read from a file path to separate “work” from “final”.

Claimed empirical findings (structure-level, not re-verified here)

Training-free (strong models)

Many strong models spontaneously:
- install missing domain tools,
- offload long context to files and search via grep/sed,
- write scripts to satisfy strict constraints.

Why long-context improves

Store documents in sandbox files instead of prompt; use filesystem search + targeted reads.

Failure mode (weak models)

“Wandering” behavior:
- many turns,
- low effective tool usage,
- worse results than direct generation.

LLM-in-Sandbox-RL (post-training)

Goal

Train exploration skill without agentic/SWE datasets.

How

Use general “context-based tasks”
Put the context into /testbed/documents/ (not in prompt)
Reward outcome correctness only (trajectory-level reward)

Claimed generalization

Gains across multiple non-code domains and also transfers back to vanilla LLM mode (more structure + more self-verification signals).

Efficiency claims (system-level)

Long-context: token savings by moving context into files.
Environment tokens processed as prefill (fast) vs decoded tokens (slow), improving throughput in some setups.
Lightweight shared image reduces storage overhead vs per-task images.

What’s novel vs “tool use”

A general computer substrate, not a curated tool API set.
Exploration is part of inference, not only training-time.
Training leverages general data, but forces tool interaction by relocating context into the environment.
Δ = (sandbox mode) − (LLM mode) used as an “agentic lift” metric.

Key design primitives implicitly introduced

File-based I/O contract (/testbed/input, /testbed/output)
Separation of workspace and final answer
Meta-tool dominance (shell as a universal capability)
Runtime tool acquisition (install on demand)

Canvas 2 — Capability taxonomy + evaluation hooks

Capability taxonomy (paper’s three meta-capabilities)

1) External resource access

Concrete actions:

apt-get install ...
pip install ...
curl/wget/requests.get(...)
scraping via BeautifulSoup, automation via selenium

Primary value:

extends knowledge/ability surface area beyond base model + base image.

Failure modes:

dependency conflicts, network flakiness, long install times, supply-chain risk.

Signals/metrics:

external-usage-rate = turns with network/package ops / total turns
unique-domains contacted, install success rate, cache hit rate

2) File management

Concrete actions:

ls, find, cat, grep, sed, head/tail
Python I/O: open, json.load, pd.read_csv
path ops: pathlib, glob

Primary value:

scalable long-context handling (search/select rather than ingest all),
persistence across turns (scratchpad as files),
deterministic intermediate artifacts.

Failure modes:

poor retrieval strategy (grep the wrong thing),
hallucinated filenames/paths,
brittle parsing.

Signals/metrics:

file-usage-rate
“document coverage”: which files were touched vs available
time-to-first-relevant-snippet (TTFRS) proxy via command logs

3) Code execution (computation + verification)

Concrete actions:

Python scripts, numerical solvers, loops, simulation
constraint checking (length, overlap, formatting)
generating actual deliverables (.png/.html/.wav/.mp4)

Primary value:

verifiable compute, exact constraints, exhaustive search when needed.

Failure modes:

overfitting to brute-force, runaway compute, tool misuse.

Signals/metrics:

compute-usage-rate
pass@k across re-runs with different seeds
verification coverage: how often the model checks its own output

Cross-cutting: exploration quality vs wandering

“Wandering” pattern (as described)

high turns, low capability usage, low progress.

Operationalization:

progress-per-turn = (tool-informative actions) / turns
ratio: (turns) / (files opened + commands executed + tests run)

Intervention levers (mechanical, not rhetorical):

enforce I/O contract (final answer must be read from file)
discourage natural-language “thinking prints”
early stopping + penalize non-submission

Δ metric (agentic lift)

Definition:

Δ = score(sandbox mode) − score(vanilla mode)

Interpretation:

positive: model converts environment affordances into task performance
negative: model incurs interaction overhead without effective tool use

Extensions:

Δ_by_domain, Δ_by_task_type, Δ_by_context_length
correlate Δ with capability-usage-rates to diagnose bottlenecks

Evaluation grid implied by the paper

Axes:

Domains: math/physics/chem/biomed/long-context/IF/SWE
Input placement: prompt vs sandbox files
Model class: strong agentic vs weak
Resource regime: network on/off, package installs allowed/blocked
Turn budget and token budget

Outputs:

accuracy/task score
tokens (prompt + model + env)
latency/throughput
tool-use traces (capability rates, TTFRS, failure taxonomy)

Canvas 3 — Impact on agentic system design (architecture implications)

1) Inference architecture shifts

From: “LLM produces answer”

To: “LLM controls a computer substrate”

Required components:

Sandbox provisioner: ephemeral container per task or per session
I/O contract: fixed directories + final answer file extraction
Tool gateway: minimal, universal tools (shell + file editor + submit)
Trajectory store: persist action/observation logs for debugging + training

Design consequence:

the environment becomes the long-context buffer, scratchpad, and verifier.

2) Context handling: prompt budget → filesystem budget

Mechanism:

keep prompt thin (instructions + pointers)
place large context as files
rely on search + selective reads

System implications:

deterministic retrieval primitives (grep, ripgrep, sqlite, embeddings)
caching of preprocessed indices for repeated queries
document chunking as first-class preprocessing, not prompt engineering

3) Tooling philosophy: “meta-tools” dominate

The shell is a universal API surface:

install capabilities at runtime
run domain software
compose pipelines (grep → parse → compute → format)

System implications:

fewer bespoke tool integrations
higher need for policy (what is allowed) and guardrails (quotas)

4) Training pipeline implications (sandbox-native post-training)

Key idea:

train exploration using general tasks by relocating context into the sandbox

System implications:

training infrastructure must run many sandboxes concurrently
reward functions remain outcome-based; no need for step-level labels
logs become training data (trajectory replay, tool-use diagnostics)

Operational consequence:

“agentic competence” becomes a trainable, transferable skill layer.

5) Efficiency model changes

Token accounting:

prompt tokens down (files instead)
multi-turn overhead up
environment output tokens mostly “prefill”, sometimes cheaper than decoding

Engineering implications:

throughput depends on:
- ratio of env tokens vs decoded tokens,
- turn count,
- command latency (install/network).

Optimization surface:

parallelize safe environment steps (e.g., pre-index files)
package caching and pinned dependency layers
constrain tool output size; enforce truncation strategies

6) Reliability + safety envelope expands

New risk classes introduced by “computer access”:

network exfiltration and data leakage
supply-chain attacks via installs
prompt injection via local files or web content
resource exhaustion (CPU/RAM/disk), fork bombs, infinite loops

Hard controls required:

network egress policy (allowlist/denylist), DNS control
CPU/memory/time quotas; syscall filtering
read-only mounts for sensitive areas; restrict host integration
package install policy: mirror + hashes, pinned versions

Auditing requirements:

immutable logs of commands, files created/modified, outbound requests
reproducible runs (snapshot image + dependency lockfiles)

7) Product-level capability expansion (beyond text)

Sandbox makes file artifacts first-class outputs:

html dashboards, posters, charts, audio, video, datasets, code repos

System design implication:

outputs are “deliverables”, not prose.
evaluation can become artifact-based (render/execute/test).

8) Concrete blueprint (minimal viable agentic stack)

Control plane

Task router → sandbox allocator → run loop controller → artifact collector

Data plane

Container image + runtime dependency cache
/input, /documents, /output directory contract
log sink (actions, observations, timestamps, resource metrics)

Policy plane

capability toggles (network on/off; install allowed/blocked)
quotas (turns, tokens, wall time, CPU, RAM, disk)
content filters for outbound requests and sensitive file access

Outcome:

a general agent runtime where “tools” are emergent from the OS substrate.

9) What changes in “agentic system design” immediately

Treat filesystem + terminal as the default tool substrate.
Treat context placement as a systems decision, not a prompting decision.
Add first-class observability: capability-use rates + wandering detectors.
Make sandbox policy explicit per deployment tier (offline, restricted net, full net).

Canvas 1 — Paper map (LLM-in-Sandbox)

One-sentence thesis​

Core problem​

The proposed paradigm​

Sandbox = three meta-capabilities​

Minimal tool interface (as presented)​

Workflow (ReAct-style loop)​

Claimed empirical findings (structure-level, not re-verified here)​

Training-free (strong models)​

Why long-context improves​

Failure mode (weak models)​

LLM-in-Sandbox-RL (post-training)​

Goal​

How​

Claimed generalization​

Efficiency claims (system-level)​

What’s novel vs “tool use”​

Key design primitives implicitly introduced​

Canvas 2 — Capability taxonomy + evaluation hooks

Capability taxonomy (paper’s three meta-capabilities)​

1) External resource access​

2) File management​

3) Code execution (computation + verification)​

Cross-cutting: exploration quality vs wandering​

“Wandering” pattern (as described)​

Δ metric (agentic lift)​

Evaluation grid implied by the paper​

Canvas 3 — Impact on agentic system design (architecture implications)

1) Inference architecture shifts​

From: “LLM produces answer”​

To: “LLM controls a computer substrate”​

2) Context handling: prompt budget → filesystem budget​

3) Tooling philosophy: “meta-tools” dominate​

4) Training pipeline implications (sandbox-native post-training)​

5) Efficiency model changes​

6) Reliability + safety envelope expands​

7) Product-level capability expansion (beyond text)​

8) Concrete blueprint (minimal viable agentic stack)​

Control plane​

Data plane​

Policy plane​

9) What changes in “agentic system design” immediately​

One-sentence thesis

Core problem

The proposed paradigm

Sandbox = three meta-capabilities

Minimal tool interface (as presented)

Workflow (ReAct-style loop)

Claimed empirical findings (structure-level, not re-verified here)

Training-free (strong models)

Why long-context improves

Failure mode (weak models)

LLM-in-Sandbox-RL (post-training)

Goal

How

Claimed generalization

Efficiency claims (system-level)

What’s novel vs “tool use”

Key design primitives implicitly introduced

Capability taxonomy (paper’s three meta-capabilities)

1) External resource access

2) File management

3) Code execution (computation + verification)

Cross-cutting: exploration quality vs wandering

“Wandering” pattern (as described)

Δ metric (agentic lift)

Evaluation grid implied by the paper

1) Inference architecture shifts

From: “LLM produces answer”

To: “LLM controls a computer substrate”

2) Context handling: prompt budget → filesystem budget

3) Tooling philosophy: “meta-tools” dominate

4) Training pipeline implications (sandbox-native post-training)

5) Efficiency model changes

6) Reliability + safety envelope expands

7) Product-level capability expansion (beyond text)

8) Concrete blueprint (minimal viable agentic stack)

Control plane

Data plane

Policy plane

9) What changes in “agentic system design” immediately