Canvas 4 — Supportability of findings (LLM-in-Sandbox)

Executive summary (decision-grade)

The findings are supportable as a systems-level result: adding a general-purpose sandbox (filesystem, execution, optional network) to LLM inference improves measured task performance and enables capabilities unavailable to text-only generation. The evidence does not conclusively support claims of increased intrinsic “general intelligence” in the model itself.

Pros (supporting evidence)

Clear intervention: sandbox vs vanilla generation.
Mechanistic plausibility: execution, files, and retrieval explain observed gains.
Long-context handling via filesystem is directly verifiable.
Constraint satisfaction via code execution is objectively easier than text-only.
Open-source sandbox implementation enables partial replication.

Cons (limitations and risks)

Comparison is not intelligence-fair: sandbox adds compute, tools, and retrieval.
Some evaluations rely on LLM judges, introducing bias/variance.
Internet-enabled sandbox is a confound for knowledge benchmarks.
Multi-turn agents show instability and high variance.
Security and operational risks expand significantly.

Overall assessment

Supported: LLM+computer systems outperform LLM-only systems on many tasks.
Partially supported: sandbox-RL improves exploration and transfers skills.
Not established: claims of general intelligence increase independent of tools.

Next steps (concrete)

Replicate using only open-weight models with pinned environments.
Run capability ablations (no network / no exec / no filesystem).
Replace LLM judges with deterministic verifiers where possible.
Compare against strong tool-based baselines (RAG + calculators).
Specify production sandbox policies (network, installs, quotas, logging).

Executive summary (decision-grade)​

Pros (supporting evidence)​

Cons (limitations and risks)​

Overall assessment​

Next steps (concrete)​

Executive summary (decision-grade)

Pros (supporting evidence)

Cons (limitations and risks)

Overall assessment

Next steps (concrete)