Code Sandboxes for LLMs and AI Agents

While remote coding IDEs have been around for years, a new wave of remote code execution environments has been emerging. For example, when you ask a Large Language Model (LLM) like ChatGPT to perform some numerical calculations that LLMs aren’t usually good at, like calculating the number of days across a time interval, creating charts, or anything more complex, it often writes a Python script and runs it to produce the result. Where does this code run? In a sandbox!

A sandbox is an isolated environment for running untrusted code with a strictly defined security perimeter. It contains tools (e.g., a Python interpreter) and useful modules that can be used when running user code. Per-execution isolation ensures that untrusted code from one user or request doesn’t interfere with another.

How Code Sandboxes Work

When the LLM generates code, a sandbox is created in the background, the code uploaded, executed, and its result returned:

LLM AI Agent Code Execution Sandbox Flow

Depending on the use case, the sandbox can persist across different code execution calls, and you may want to have a persistent sandbox per user. This is useful when an AI agent needs to execute code that needs access to the user’s uploaded files.

As usual, the devil’s in the details…

We could certainly shell out to Python and eval() the code and call it a day, but it’s not the 90s anymore. While the code is generated by the LLM, it’s no guarantee that it’s safe. There are a couple of methods at our disposal for implementing a secure sandbox:

Containers: Standard Linux containers (LXC) achieve isolation without performance penalty compared to running code outside a sandbox. This is what Docker uses out of the box.
User-mode kernels: User-mode kernel process intercepts and services Linux system calls, isolating the application from the host kernel.
Virtual machines: Lightweight hypervisors provide isolation through hardware virtualization, at a slight performance hit compared to containers.

I should also mention WebAssembly, which is a virtual stack machine, along with the JVM and others. It only supports running code that is specifically compiled for it. Yes, Python and Node.js can be compiled to Wasm, but they often rely on native modules that may not be compatible.

Our choices boil down to these specific implementations:

Linux containers
- Docker (using default LXC backend)
User-mode kernels
- Docker (using gVisor backend)
Virtual machines

Although VMs have a performance penalty due to the boundary between the guest and the host, they add an additional layer of security. You also need to think ahead when planning for capacity, since each VM needs to be assigned a dedicated amount of CPU, RAM, and a root file system. You can alleviate some issues by sharing a read-only root file system among multiple VMs and using overlay file systems for writable storage, but the orchestration gets hairy quickly.

gVisor is a user-space kernel that provides strong isolation by intercepting system calls made by processes inside the virtualized environment. It can run standalone or as a Docker runtime. For our purposes, I think it’s a perfect middle ground for running untrusted code. Integrating it only requires a simple Docker daemon configuration change.

Next, we need to decide on the sandbox environment itself, how to get the code into it, and how to extract execution results. Should the sandbox persist in the background? If so, when to terminate it?

An initial approach might involve writing small wrappers for each supported language. If the LLM generates code, it might be in a language like Python, JavaScript, TypeScript, or Bash. The result might be printed to the console or written to a file. Languages have multiple frameworks, and it would be onerous to figure out how to get code in and results out of each supported language.

Fortunately, there’s Jupyter Notebook. It lets you create interactive notebooks of code and data and has been popular for years in the data science community. It has the concept of a kernel, which executes code in the target environment (e.g., Python, Node.js, etc.). Each code cell can output text, images, or arbitrary HTML.

Instead of building what would amount to a custom REPL, we can leverage Jupyter Notebooks and programmatically, via its API, submit code to execute and return the results. There’s no need to reinvent the wheel! Thus, our sandbox environment should include a properly configured Jupyter Notebook installation with the appropriate kernels.

The Jupyter Notebook backend may not be ideal for long-running code like a web app server, but it works for most use cases. Regardless, Jupyter is an implementation detail that doesn’t need to be exposed to the consumer (LLM, agent, etc.). For that reason, we’ll need a minimal API server running in front of Jupyter Notebook, which accepts code execution requests, submits them to a Jupyter Notebook instance, and extracts the results. This way, the LLM or AI agent can consume a set of simple HTTP endpoints that handle generated code execution.

Proof of Concept

Install gVisor and configure Docker to use it by default.

Let’s write a quick and dirty PoC that starts a Jupyter kernel and executes code that returns text and a plot in the form of an image:

import asyncio
from jupyter_client.manager import AsyncKernelManager

async def main():
    km = AsyncKernelManager()
    await km.start_kernel()
    kc = km.client()
    kc.start_channels()
    await kc.wait_for_ready()

    msg_id = kc.execute("""
import matplotlib.pyplot as plt

print("hello")

fig, ax = plt.subplots()
ax.plot([1, 2])

plt.show()

print("world")
""")

    while True:
        reply = await kc.get_iopub_msg()
        if reply["parent_header"]["msg_id"] == msg_id:
            msg_type = reply["msg_type"]
            if msg_type == "stream":
                print(f'TEXT: {reply["content"]["text"]}')
            elif msg_type == "display_data":
                content = reply["content"]
                if "image/png" in content["data"]:
                    print(f'IMAGE: {content["data"]["image/png"]}\n')
            elif msg_type == "error":
                print(f'ERROR: {reply["content"]["traceback"][0]}')
                break
            elif msg_type == "status" and reply["content"]["execution_state"] == "idle":
                break    

    kc.shutdown()
    
asyncio.run(main())

Running this results in:

TEXT: hello

IMAGE: <LONG BASE64 BLOB>

TEXT: world

Beautiful.

Due to an unresolved issue, it’s difficult to remotely start a Jupyter Notebook kernel, so we’ll need to run the above code inside the container. The simplest way to do this is to run an HTTP server inside the container, as well as a proxy server outside it, to forward requests to the appropriate inner server. Nevertheless, I think this is how you would do it in production anyway.

In light of this, all we need to do now is wrap a FastAPI server around it, run it inside a Docker container via gVisor, and write another FastAPI server to orchestrate those containers!

Sandbox Server Solution

High-Level Design

LLM AI Agent Code Execution Sandbox Architecture

The Sandbox Manager in our design exposes a lightweight API server with HTTP endpoints for launching sandboxes, executing code, and returning results.

I don’t want to expose the code execution engine’s (Jupyter Notebook) interface directly to the consumer (e.g., an agent) because it’s just an implementation detail, and we don’t want to tie ourselves to Jupyter Notebook only.

In-Sandbox API Server

Each sandbox will run its own API server that vends a single /execute endpoint:

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from jupyter_client.manager import AsyncKernelManager
import asyncio
import json
from io import BytesIO

app = FastAPI()

async def execute_code(code: str):
    km = AsyncKernelManager()
    await km.start_kernel()
    kc = km.client()
    kc.start_channels()
    await kc.wait_for_ready()

    msg_id = kc.execute(code)

    async def stream_results():
        try:
            while True:
                reply = await kc.get_iopub_msg()
                msg_type = reply["msg_type"]
                if msg_type == 'stream':
                    yield json.dumps({"text": reply['content']['text']}) + "\n"
                elif msg_type == 'display_data':
                    data = reply['content']['data']
                    if "image/png" in data:
                        yield json.dumps({"image": data["image/png"]}) + "\n"
                elif msg_type == "error":
                    traceback = "\n".join(reply['content']['traceback'])
                    yield json.dumps({"error": traceback}) + "\n"
                    break
                elif msg_type == "status" and reply["content"]["execution_state"] == "idle":
                    break
        except asyncio.CancelledError:
            pass
        finally:
            kc.stop_channels()
            await km.shutdown_kernel()

    return StreamingResponse(stream_results(), media_type="application/x-ndjson")

@app.post("/execute")
async def execute(request: dict):
    if "code" not in request:
        raise HTTPException(status_code=400, detail="Missing 'code' field")

    return await execute_code(request["code"])

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=8000)

Let’s package this up into a Docker image. We’ll use the Jupyter image as the base. Here’s the Dockerfile:

FROM jupyter/base-notebook

RUN pip install --no-cache-dir fastapi uvicorn jupyter_client
WORKDIR /app
COPY inside_server.py /app/server.py

EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

Let’s build and run it:

docker build -t fastapi-jupyter-server .
docker run -p 8000:8000 fastapi-jupyter-server

You should see:

INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Let’s curl it:

curl "http://localhost:8000/execute" -H "Content-Type: application/json" -d '{"code": "print(\"hello from sandbox\")"}'
{"text": "hello from sandbox\n"}

Great. But what about images?

curl "http://localhost:8000/execute" -H "Content-Type: application/json" -d \
'{"code": "import matplotlib.pyplot as plt\nfig, ax = plt.subplots()\nax.plot([1, 2])\nplt.show()"}'
...

That blew up. In the sea of VT100 escape sequences we spot the error: ModuleNotFoundError\u001b[0m: No module named 'matplotlib'. Of course, I forgot to install the dependencies! How should we do that?

We could expose another endpoint like /install that accepts a list of Python dependencies. But that’s too complex for this PoC, let’s just bake them into the image by adding matplotlib to the end of the pip install command in the Dockerfile:

RUN pip install --no-cache-dir fastapi uvicorn jupyter_client matplotlib

Rerunning the curl command makes everything work again.

Sandbox Manager API Server

Now that we have the basics working, let’s think about how to build the “outer” server. It should expose a simple HTTP surface like:

GET /sandboxes
POST /sandboxes
POST /sandboxes/<id>/execute
GET /sandboxes/<id>
DELETE /sandboxes/<id>

An elementary REST API. We’ve already implemented one of those endpoints :)

As for the rest, let’s just call out directly to the Docker API. GET /sandboxes is basically a container list operation (filtered by some label, in case you have other non-sandbox containers running), and so on.

Then there’s the issue of port forwarding. Let’s use port 0 and let Docker allocate a port, and then we fish it out of the container metadata.

Ready or not, here’s the vibe-coded implementation:

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from contextlib import asynccontextmanager
import asyncio
import time
import docker
import httpx
import uuid
import json

# configuration
IMAGE_NAME = "fastapi-jupyter-server"
CONTAINER_PREFIX = "sandbox_"
SANDBOX_PORT = 8000
IDLE_TIMEOUT = 60
CHECK_INTERVAL = 60

client = docker.from_env()
hx = httpx.AsyncClient()
last_active = {}

async def terminate_idle_sandboxes():
    while True:
        await asyncio.sleep(CHECK_INTERVAL)
        now = time.time()

        for container in await asyncio.to_thread(list_sandboxes):
            sandbox_id = container.id
            last_time = last_active.get(sandbox_id, None)

            if last_time is None:
                print(f"Terminating untracked sandbox {sandbox_id} (server restarted?)")
                try:
                    container.stop()
                    container.remove()
                except docker.errors.NotFound:
                    pass
                continue

            if now - last_time > IDLE_TIMEOUT:
                print(f"Terminating idle sandbox {sandbox_id} (idle for {now - last_time:.1f} seconds)")
                try:
                    container.stop()
                    container.remove()
                    last_active.pop(sandbox_id, None)
                except docker.errors.NotFound:
                    last_active.pop(sandbox_id, None) 

@asynccontextmanager
async def lifespan(app: FastAPI):
    asyncio.create_task(terminate_idle_sandboxes())
    yield

app = FastAPI(lifespan=lifespan)

class CreateSandboxRequest(BaseModel):
    lang: str

class ExecuteRequest(BaseModel):
    code: str

def list_sandboxes():
    return client.containers.list(filters={"label": "sbx=1"})

@app.get("/sandboxes")
async def get_sandboxes():
    sandboxes = [
        {"id": container.id, "name": container.name, "status": container.status}
        for container in list_sandboxes()
    ]
    return {"sandboxes": sandboxes}

@app.post("/sandboxes")
async def create_sandbox(request: CreateSandboxRequest):
    if request.lang.lower() != "python":
        raise HTTPException(status_code=400, detail="Only Python sandboxes are supported.")

    container_name = CONTAINER_PREFIX + str(uuid.uuid4())[:8]
    
    try:
        container = client.containers.run(
            IMAGE_NAME,
            name=container_name,
            labels={
                "sbx": "1",
                "sbx_lang": request.lang.lower()
            },
            detach=True,
            stdin_open=False,
            tty=False,
            ports={f"{SANDBOX_PORT}/tcp": 0},  # Auto-assign a port
        )
        last_active[container.id] = time.time()
        return {"id": container.id, "name": container.name, "status": container.status}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.get("/sandboxes/{sandbox_id}")
async def get_sandbox(sandbox_id: str):
    try:
        container = client.containers.get(sandbox_id)
        if "sbx" not in container.labels:
            raise HTTPException(status_code=404, detail="Sandbox not found")

        ports = container.attrs["NetworkSettings"]["Ports"]
        port_mapping = ports.get(f"{SANDBOX_PORT}/tcp", [])
        if not port_mapping:
            raise HTTPException(status_code=500, detail="No exposed port found")

        host_port = port_mapping[0]["HostPort"]

        return {
            "id": container.id,
            "name": container.name,
            "status": container.status,
            "port": host_port,
        }
    except docker.errors.NotFound:
        raise HTTPException(status_code=404, detail="Sandbox not found")

@app.post("/sandboxes/{sandbox_id}/execute")
async def execute_code(sandbox_id: str, request: ExecuteRequest):
    if not request.code.strip():
        raise HTTPException(status_code=400, detail="Code cannot be empty.")
    try:
        container = client.containers.get(sandbox_id)
        if "sbx" not in container.labels:
            raise HTTPException(status_code=404, detail="Sandbox not found")

        ports = container.attrs["NetworkSettings"]["Ports"]
        port_mapping = ports.get(f"{SANDBOX_PORT}/tcp", [])
        if not port_mapping:
            raise HTTPException(status_code=500, detail="No exposed port found")

        host_port = port_mapping[0]["HostPort"]
        sandbox_url = f"http://localhost:{host_port}/execute"

        async def stream_response():
            async with hx.stream("POST", sandbox_url, json=request.dict()) as response:
                if not response.is_success:
                    raise HTTPException(status_code=response.status_code, detail=f"Execution failed")
                async for chunk in response.aiter_bytes():
                    yield chunk
                    last_active[sandbox_id] = time.time()

        return StreamingResponse(stream_response(), media_type="application/x-ndjson")
    except docker.errors.NotFound:
        raise HTTPException(status_code=404, detail="Sandbox not found")

@app.delete("/sandboxes/{sandbox_id}")
async def delete_sandbox(sandbox_id: str):
    try:
        container = client.containers.get(sandbox_id)
        if "sbx" not in container.labels:
            raise HTTPException(status_code=404, detail="Sandbox not found")

        container.stop()
        container.remove()
        last_active.pop(sandbox_id, None)
        return {"message": f"Sandbox {sandbox_id} deleted"}
    except docker.errors.NotFound:
        raise HTTPException(status_code=404, detail="Sandbox not found")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=8000)

Let me explain how it works:

When creating the container, we set labels that we can filter when querying running sandboxes.

A background task runs every minute to check for sandboxes that are idle (i.e., no activity) and terminates them. In real-world usage, you’d obviously want to account for long-running sandboxes, etc.

I couldn’t get away with zero state in the server because container labels in Docker are static and can’t be updated dynamically, hence the last_active dict.

Production Proofing

You’ll need to do quite a bit of work to productionize this, including:

Known issues:

error handling
specifying code dependencies

Security:

auth & authz
additional VM-based isolation
audit logging

Anti-abuse:

egress filtering of network traffic
resource limits (CPU, memory, I/O, network, etc.)

Features:

file uploads and downloads
persistent storage
inbound connectivity
support for other languages/Jupyter Notebook kernels

Nevertheless, this should be a good starting point for further exploring code sandboxes. If you use any of this, I’d love to hear about it!

Summary

By now, you should have a better grasp of the different techniques to run untrusted code that are supported on Linux, as well as a concrete implementation using Docker and gVisor. I think this is a good middle ground for security and performance, but of course, as mentioned, there are many areas of improvement before deploying it into production.

If you use any of this to build something awesome, let me know! 🚀

How Code Sandboxes Work#

Proof of Concept#

Sandbox Server Solution#

High-Level Design#

In-Sandbox API Server#

Sandbox Manager API Server#

Production Proofing#

Summary#