Canvas 1 — Paper map (LLM-in-Sandbox)
One-sentence thesis
Granting an LLM access to a general-purpose code sandbox (terminal + files + code execution + network) can measurably unlock broader task competence beyond coding, and can be reinforced via RL using non-agentic data.
Core problem
Text-only generation forces the model to:
- “simulate” computation instead of verifying it,
- carry long context inside the prompt window,
- rely on fixed tools rather than acquiring new ones.
LLM-in-Sandbox reframes the model as an operator inside a virtual computer.
The proposed paradigm
Sandbox = three meta-capabilities
- External resource access (network, package install)
- File management (persistent storage, search, parsing)
- Code execution (compute, verification, formatting constraints)
Minimal tool interface (as presented)
execute_bash: run terminal commandsstr_replace_editor: create/view/edit filessubmit: end and return output (e.g.,/testbed/answer.txt)
Workflow (ReAct-style loop)
- iterate: generate action → execute in sandbox → observe → decide next action
- output is read from a file path to separate “work” from “final”.
Claimed empirical findings (structure-level, not re-verified here)
Training-free (strong models)
- Many strong models spontaneously:
- install missing domain tools,
- offload long context to files and search via
grep/sed, - write scripts to satisfy strict constraints.
Why long-context improves
- Store documents in sandbox files instead of prompt; use filesystem search + targeted reads.
Failure mode (weak models)
- “Wandering” behavior:
- many turns,
- low effective tool usage,
- worse results than direct generation.
LLM-in-Sandbox-RL (post-training)
Goal
Train exploration skill without agentic/SWE datasets.
How
- Use general “context-based tasks”
- Put the context into
/testbed/documents/(not in prompt) - Reward outcome correctness only (trajectory-level reward)
Claimed generalization
- Gains across multiple non-code domains and also transfers back to vanilla LLM mode (more structure + more self-verification signals).
Efficiency claims (system-level)
- Long-context: token savings by moving context into files.
- Environment tokens processed as prefill (fast) vs decoded tokens (slow), improving throughput in some setups.
- Lightweight shared image reduces storage overhead vs per-task images.
What’s novel vs “tool use”
- A general computer substrate, not a curated tool API set.
- Exploration is part of inference, not only training-time.
- Training leverages general data, but forces tool interaction by relocating context into the environment.
- Δ = (sandbox mode) − (LLM mode) used as an “agentic lift” metric.
Key design primitives implicitly introduced
- File-based I/O contract (
/testbed/input,/testbed/output) - Separation of workspace and final answer
- Meta-tool dominance (shell as a universal capability)
- Runtime tool acquisition (install on demand)