Canvas 4 — Supportability of findings (LLM-in-Sandbox)
Executive summary (decision-grade)
The findings are supportable as a systems-level result: adding a general-purpose sandbox (filesystem, execution, optional network) to LLM inference improves measured task performance and enables capabilities unavailable to text-only generation. The evidence does not conclusively support claims of increased intrinsic “general intelligence” in the model itself.
Pros (supporting evidence)
- Clear intervention: sandbox vs vanilla generation.
- Mechanistic plausibility: execution, files, and retrieval explain observed gains.
- Long-context handling via filesystem is directly verifiable.
- Constraint satisfaction via code execution is objectively easier than text-only.
- Open-source sandbox implementation enables partial replication.
Cons (limitations and risks)
- Comparison is not intelligence-fair: sandbox adds compute, tools, and retrieval.
- Some evaluations rely on LLM judges, introducing bias/variance.
- Internet-enabled sandbox is a confound for knowledge benchmarks.
- Multi-turn agents show instability and high variance.
- Security and operational risks expand significantly.
Overall assessment
- Supported: LLM+computer systems outperform LLM-only systems on many tasks.
- Partially supported: sandbox-RL improves exploration and transfers skills.
- Not established: claims of general intelligence increase independent of tools.
Next steps (concrete)
- Replicate using only open-weight models with pinned environments.
- Run capability ablations (no network / no exec / no filesystem).
- Replace LLM judges with deterministic verifiers where possible.
- Compare against strong tool-based baselines (RAG + calculators).
- Specify production sandbox policies (network, installs, quotas, logging).