Canvas 1 — Paper map (LLM-in-Sandbox)

One-sentence thesis

Granting an LLM access to a general-purpose code sandbox (terminal + files + code execution + network) can measurably unlock broader task competence beyond coding, and can be reinforced via RL using non-agentic data.

Core problem

Text-only generation forces the model to:

“simulate” computation instead of verifying it,
carry long context inside the prompt window,
rely on fixed tools rather than acquiring new ones.

LLM-in-Sandbox reframes the model as an operator inside a virtual computer.

The proposed paradigm

Sandbox = three meta-capabilities

External resource access (network, package install)
File management (persistent storage, search, parsing)
Code execution (compute, verification, formatting constraints)

Minimal tool interface (as presented)

execute_bash: run terminal commands
str_replace_editor: create/view/edit files
submit: end and return output (e.g., /testbed/answer.txt)

Workflow (ReAct-style loop)

iterate: generate action → execute in sandbox → observe → decide next action
output is read from a file path to separate “work” from “final”.

Claimed empirical findings (structure-level, not re-verified here)

Training-free (strong models)

Many strong models spontaneously:
- install missing domain tools,
- offload long context to files and search via grep/sed,
- write scripts to satisfy strict constraints.

Why long-context improves

Store documents in sandbox files instead of prompt; use filesystem search + targeted reads.

Failure mode (weak models)

“Wandering” behavior:
- many turns,
- low effective tool usage,
- worse results than direct generation.

LLM-in-Sandbox-RL (post-training)

Goal

Train exploration skill without agentic/SWE datasets.

How

Use general “context-based tasks”
Put the context into /testbed/documents/ (not in prompt)
Reward outcome correctness only (trajectory-level reward)

Claimed generalization

Gains across multiple non-code domains and also transfers back to vanilla LLM mode (more structure + more self-verification signals).

Efficiency claims (system-level)

Long-context: token savings by moving context into files.
Environment tokens processed as prefill (fast) vs decoded tokens (slow), improving throughput in some setups.
Lightweight shared image reduces storage overhead vs per-task images.

What’s novel vs “tool use”

A general computer substrate, not a curated tool API set.
Exploration is part of inference, not only training-time.
Training leverages general data, but forces tool interaction by relocating context into the environment.
Δ = (sandbox mode) − (LLM mode) used as an “agentic lift” metric.

Key design primitives implicitly introduced

File-based I/O contract (/testbed/input, /testbed/output)
Separation of workspace and final answer
Meta-tool dominance (shell as a universal capability)
Runtime tool acquisition (install on demand)

One-sentence thesis​

Core problem​

The proposed paradigm​

Sandbox = three meta-capabilities​

Minimal tool interface (as presented)​

Workflow (ReAct-style loop)​

Claimed empirical findings (structure-level, not re-verified here)​

Training-free (strong models)​

Why long-context improves​

Failure mode (weak models)​

LLM-in-Sandbox-RL (post-training)​

Goal​

How​

Claimed generalization​

Efficiency claims (system-level)​

What’s novel vs “tool use”​

Key design primitives implicitly introduced​