Harshil Agrawal — Why & How to Sandbox AI-Generated Code¶
Source index. ~38-min talk. Raw at harshil-agrawal-sandbox-ai-code-2026.
Thesis¶
AI-generated code is untrusted code from the internet. Apply capability-based-security via sandboxing. Pick isolates-vs-containers based on one question (does the code need FS/processes/packages?). Follow the 8-item universal checklist.
Structure¶
- The trajectory — auto-complete → full code gen → autonomous agents, in 2 years
- Reframe — we're running untrusted code with production privileges
- Threat model — hallucination, "helpful" LLM, prompt injection (direct + indirect)
- Why sandboxes work — browsers, OS, mobile already solved this
- Capability-based security — default deny, explicit allow
- Spectrum: eval → isolates → containers → VMs
- Demo 1: OpenClaw-alternative using V8 isolates on Cloudflare
- Demo 2: PromptMotion (video generator) using containers
- Decision tree + trade-offs
- 8-item universal checklist
- Secrets anti-pattern + proxy-through-worker fix
- Cleanup discipline (
try/finally, max lifetimes)
Concepts introduced¶
- ai-generated-code-is-untrusted — the reframe
- capability-based-security — the principle
- isolates-vs-containers — the decision framework + 8-item checklist
Entities¶
- harshil-agrawal — speaker
- cloudflare — employer; ships both demo primitives
Memorable moves¶
- "If you told someone, 'I found this code on a random website, let's run it in production,' you absolutely would not. That's security 101. But that's essentially what we're doing with LLM-generated code."
- "The over-helpful LLM is dangerous precisely because its behavior looks reasonable."
- "Don't enumerate what to block. Enumerate what to allow."
- "Would you rather give someone a master key and a list of 10,000 rooms they can't enter — or keys to the 3 rooms they actually need?"
- Secrets anti-pattern:
env.API_KEY→ sandbox. Never. Proxy through your own worker.
Cross-ingest links¶
- agent-security-slop — peter-steinberger's adjacent worry: AI-generated CVE reports flood maintainers, which is the supply-side of the same problem
- verifiable-systems-for-agents — eric-zakariasson's flip: agents need verifiable specs; humans need sandboxed execution; both are about bounding what an LLM can get wrong
- isolated-agent-vms — eric-zakariasson's parallel pattern: sandbox-per-async-agent (VM-level)
- parallel-agent-competitions — each parallel agent lives in its own sandbox by necessity
- animals-vs-ghosts — andrej-karpathy's frame: ghosts inherit human flaws including security ones, so trust ceiling is low by default
Open questions¶
- What's the container-per-user cost profile at 10k, 100k, 1M concurrent users? At what scale does the "one user one sandbox" rule start to buckle financially?
- How do you audit the 8-item checklist across a codebase? Is there a linter / policy-as-code for "this API endpoint exposes an unsandboxed LLM call"?
- Where do indirect prompt injections get caught — pre-sandbox (input validation) or post-sandbox (capability enforcement)? Probably both but the split matters.
- The "code mode" pattern Agrawal mentions (for agent integration) — what's its status and how does it compare to OpenAI's code-interpreter?
Synthesis note¶
Three ingests from AI Engineer 2026 now form a tight triangle on the trust-boundary axis: - peter-steinberger — maintainer-side: AI-generated CVE/PR slop - mahmoud-mabrouk — judge-side: AI eval is theater unless calibrated - harshil-agrawal — execution-side: AI code is untrusted; sandbox it
Together: don't trust what AI produces (at any layer — code, review, judgment) without verified bounds.