Harshil Agrawal — Why & How to Sandbox AI-Generated Code¶

Source index. ~38-min talk. Raw at harshil-agrawal-sandbox-ai-code-2026.

Thesis¶

AI-generated code is untrusted code from the internet. Apply capability-based-security via sandboxing. Pick isolates-vs-containers based on one question (does the code need FS/processes/packages?). Follow the 8-item universal checklist.

Structure¶

The trajectory — auto-complete → full code gen → autonomous agents, in 2 years
Reframe — we're running untrusted code with production privileges
Threat model — hallucination, "helpful" LLM, prompt injection (direct + indirect)
Why sandboxes work — browsers, OS, mobile already solved this
Capability-based security — default deny, explicit allow
Spectrum: eval → isolates → containers → VMs
Demo 1: OpenClaw-alternative using V8 isolates on Cloudflare
Demo 2: PromptMotion (video generator) using containers
Decision tree + trade-offs
8-item universal checklist
Secrets anti-pattern + proxy-through-worker fix
Cleanup discipline (try/finally, max lifetimes)

Concepts introduced¶

ai-generated-code-is-untrusted — the reframe
capability-based-security — the principle
isolates-vs-containers — the decision framework + 8-item checklist

Entities¶

harshil-agrawal — speaker
cloudflare — employer; ships both demo primitives

Memorable moves¶

"If you told someone, 'I found this code on a random website, let's run it in production,' you absolutely would not. That's security 101. But that's essentially what we're doing with LLM-generated code."
"The over-helpful LLM is dangerous precisely because its behavior looks reasonable."
"Don't enumerate what to block. Enumerate what to allow."
"Would you rather give someone a master key and a list of 10,000 rooms they can't enter — or keys to the 3 rooms they actually need?"
Secrets anti-pattern: env.API_KEY → sandbox. Never. Proxy through your own worker.

Cross-ingest links¶

agent-security-slop — peter-steinberger's adjacent worry: AI-generated CVE reports flood maintainers, which is the supply-side of the same problem
verifiable-systems-for-agents — eric-zakariasson's flip: agents need verifiable specs; humans need sandboxed execution; both are about bounding what an LLM can get wrong
isolated-agent-vms — eric-zakariasson's parallel pattern: sandbox-per-async-agent (VM-level)
parallel-agent-competitions — each parallel agent lives in its own sandbox by necessity
animals-vs-ghosts — andrej-karpathy's frame: ghosts inherit human flaws including security ones, so trust ceiling is low by default

Open questions¶

What's the container-per-user cost profile at 10k, 100k, 1M concurrent users? At what scale does the "one user one sandbox" rule start to buckle financially?
How do you audit the 8-item checklist across a codebase? Is there a linter / policy-as-code for "this API endpoint exposes an unsandboxed LLM call"?
Where do indirect prompt injections get caught — pre-sandbox (input validation) or post-sandbox (capability enforcement)? Probably both but the split matters.
The "code mode" pattern Agrawal mentions (for agent integration) — what's its status and how does it compare to OpenAI's code-interpreter?

Synthesis note¶

Three ingests from AI Engineer 2026 now form a tight triangle on the trust-boundary axis: - peter-steinberger — maintainer-side: AI-generated CVE/PR slop - mahmoud-mabrouk — judge-side: AI eval is theater unless calibrated - harshil-agrawal — execution-side: AI code is untrusted; sandbox it

Together: don't trust what AI produces (at any layer — code, review, judgment) without verified bounds.