Skip to main content

Canvas 4 — Supportability of findings (LLM-in-Sandbox)

Executive summary (decision-grade)

The findings are supportable as a systems-level result: adding a general-purpose sandbox (filesystem, execution, optional network) to LLM inference improves measured task performance and enables capabilities unavailable to text-only generation. The evidence does not conclusively support claims of increased intrinsic “general intelligence” in the model itself.

Pros (supporting evidence)

  • Clear intervention: sandbox vs vanilla generation.
  • Mechanistic plausibility: execution, files, and retrieval explain observed gains.
  • Long-context handling via filesystem is directly verifiable.
  • Constraint satisfaction via code execution is objectively easier than text-only.
  • Open-source sandbox implementation enables partial replication.

Cons (limitations and risks)

  • Comparison is not intelligence-fair: sandbox adds compute, tools, and retrieval.
  • Some evaluations rely on LLM judges, introducing bias/variance.
  • Internet-enabled sandbox is a confound for knowledge benchmarks.
  • Multi-turn agents show instability and high variance.
  • Security and operational risks expand significantly.

Overall assessment

  • Supported: LLM+computer systems outperform LLM-only systems on many tasks.
  • Partially supported: sandbox-RL improves exploration and transfers skills.
  • Not established: claims of general intelligence increase independent of tools.

Next steps (concrete)

  1. Replicate using only open-weight models with pinned environments.
  2. Run capability ablations (no network / no exec / no filesystem).
  3. Replace LLM judges with deterministic verifiers where possible.
  4. Compare against strong tool-based baselines (RAG + calculators).
  5. Specify production sandbox policies (network, installs, quotas, logging).