Skip to main content

Canvas 1 — Paper map (LLM-in-Sandbox)

One-sentence thesis

Granting an LLM access to a general-purpose code sandbox (terminal + files + code execution + network) can measurably unlock broader task competence beyond coding, and can be reinforced via RL using non-agentic data.


Core problem

Text-only generation forces the model to:

  • “simulate” computation instead of verifying it,
  • carry long context inside the prompt window,
  • rely on fixed tools rather than acquiring new ones.

LLM-in-Sandbox reframes the model as an operator inside a virtual computer.


The proposed paradigm

Sandbox = three meta-capabilities

  1. External resource access (network, package install)
  2. File management (persistent storage, search, parsing)
  3. Code execution (compute, verification, formatting constraints)

Minimal tool interface (as presented)

  • execute_bash: run terminal commands
  • str_replace_editor: create/view/edit files
  • submit: end and return output (e.g., /testbed/answer.txt)

Workflow (ReAct-style loop)

  • iterate: generate action → execute in sandbox → observe → decide next action
  • output is read from a file path to separate “work” from “final”.

Claimed empirical findings (structure-level, not re-verified here)

Training-free (strong models)

  • Many strong models spontaneously:
    • install missing domain tools,
    • offload long context to files and search via grep/sed,
    • write scripts to satisfy strict constraints.

Why long-context improves

  • Store documents in sandbox files instead of prompt; use filesystem search + targeted reads.

Failure mode (weak models)

  • “Wandering” behavior:
    • many turns,
    • low effective tool usage,
    • worse results than direct generation.

LLM-in-Sandbox-RL (post-training)

Goal

Train exploration skill without agentic/SWE datasets.

How

  • Use general “context-based tasks”
  • Put the context into /testbed/documents/ (not in prompt)
  • Reward outcome correctness only (trajectory-level reward)

Claimed generalization

  • Gains across multiple non-code domains and also transfers back to vanilla LLM mode (more structure + more self-verification signals).

Efficiency claims (system-level)

  • Long-context: token savings by moving context into files.
  • Environment tokens processed as prefill (fast) vs decoded tokens (slow), improving throughput in some setups.
  • Lightweight shared image reduces storage overhead vs per-task images.

What’s novel vs “tool use”

  1. A general computer substrate, not a curated tool API set.
  2. Exploration is part of inference, not only training-time.
  3. Training leverages general data, but forces tool interaction by relocating context into the environment.
  4. Δ = (sandbox mode) − (LLM mode) used as an “agentic lift” metric.

Key design primitives implicitly introduced

  • File-based I/O contract (/testbed/input, /testbed/output)
  • Separation of workspace and final answer
  • Meta-tool dominance (shell as a universal capability)
  • Runtime tool acquisition (install on demand)