SafeLisp: A Language-Level Sandbox for AI-Generated Code

Mar 26, 2026

SafeLisp: A language-level sandbox for AI-generated code

I’ve always enjoyed tinkering with experimental programming languages — I wrote an interpreted language called Meme back in 2009, and I’ve been drawn to language design ever since. When Andrej Karpathy released autoresearch, a pattern for letting AI agents run autonomous research loops — modify code, run an experiment, evaluate results, iterate — I wanted to try it on something other than ML training. Language design seemed like a natural fit: the research questions are well-defined, the experiments are deterministic, and I could let the loop explore the design space while I slept.

The research question I picked: can you push the security boundary for AI-generated code into the language itself? Infrastructure-level sandboxing — containers, microVMs, seccomp filters — can isolate a process, but inside that sandbox, the code still has ambient authority over everything the runtime exposes. The interesting security questions aren’t “can we isolate a process?” but “can we control what that process is allowed to do?”

That led to SafeLisp: a minimal, statically-typed, purely functional language that combines three ideas from programming language theory to create a sandbox where untrusted code can’t do anything it wasn’t explicitly allowed to do. The project is about 3,900 lines of Python for the language itself, backed by 315 tests, a structured experiment framework, and a 10,000-program fuzz campaign.

The source code, technical report, and language specification are on GitHub.

The Autoresearch Loop

Karpathy’s autoresearch pattern is elegantly simple: give an AI agent a controllable experiment, let it modify code, run for a fixed budget, evaluate results, and iterate. His version runs ML training experiments — the agent edits train.py, trains for 5 minutes, checks validation loss, and decides what to try next.

I adapted this for language design research. Instead of training a model, each loop iteration runs a structured experiment against the SafeLisp implementation:

  1. The agent picks a research question (e.g., “What’s the overhead of effect tracking?")
  2. It runs the experiment cases — programs that exercise specific language features
  3. It evaluates against success criteria (did it crash? did effects leak? was it fast enough?)
  4. Results are serialized to JSON, and the next iteration picks the next question

I defined seven research questions (RQ-001 through RQ-007) and four structured experiments (EXP-001 through EXP-004) covering 22 test cases. The loop ran these automatically, producing results I could review in the morning. It’s not fully autonomous the way Karpathy’s ML loop is — I still wrote the language implementation — but it freed up the experimental validation and helped me discover things like the zero-overhead effect tracking result (more on that below).

The Design

SafeLisp’s safety model rests on three mechanisms, each addressing a different class of risk.

Static Effect Tracking

Inspired by Koka, every function’s type signature includes the effects it may perform. A function typed (-> [Int] Int <pure>) is guaranteed to have no side effects. A function typed (-> [String] Unit <console>) can print but nothing else. The type checker — a 529-line Hindley-Milner inference engine with row-polymorphic effect variables — enforces this statically, before the code ever runs.

The technical term is row-polymorphic algebraic effects. What matters in practice: you can look at a function’s type and know exactly what it can do. If AI-generated code tries to sneak in a network call where the type says <pure>, the type checker rejects it at analysis time, not at runtime. Effect rows compose cleanly — <console | $e> means “has console, plus whatever effects $e brings” — so functions can be polymorphic over effects they don’t care about.

Capability-Based Authority

Borrowed from the E programming language, SafeLisp starts every execution with zero capabilities. The host must explicitly grant capabilities for each effect — console, filesystem, network, etc. Capabilities are unforgeable tokens that can be attenuated (restricted further) and revoked at any time.

This means user code cannot manufacture or escalate privileges. If you grant a “read from /data/input.csv” capability, that’s all the code gets — not read access to the whole filesystem. Container sandboxes can’t express policies at this granularity. They operate at the system call level; SafeLisp operates at the semantic level.

Resource Limits

A configurable budget for CPU steps (default 1M), call stack depth (default 1K), and heap allocation (default 10MB) ensures that even a well-typed program can’t consume unbounded resources. Infinite loops and memory bombs are caught and terminated cleanly. This is the simplest of the three mechanisms, but it’s essential — without it, a denial-of-service is trivial even inside a capability-restricted sandbox.

Two-Layer Enforcement

The static type checker and the runtime handler stack operate independently. The type checker catches common violations fast (sub-millisecond). The runtime handler stack provides defense-in-depth — even if the type checker has a bug, the runtime prevents effect escapes. This was a deliberate design choice: conservative static analysis plus permissive runtime semantics, so you don’t get false positives but you also don’t get escapes.

SafeLisp Two-Layer Enforcement

What It Looks Like

SafeLisp uses S-expression syntax with Hindley-Milner type inference, so you rarely need type annotations:

;; Pure function -- no annotations needed
(letrec [fact (fn [n] (if (= n 0) 1 (* n (fact (- n 1)))))]
  (fact 10))
;; => 3628800

;; Effect-performing code must be handled
(handle
  (perform! console.print "hello world")
  [console.print [msg] msg])

From Python, the embedding API lets you configure the sandbox and run untrusted code in a few lines:

from safelisp.sandbox import Sandbox, SandboxConfig

sb = Sandbox(SandboxConfig(max_steps=10_000))
result = sb.execute(untrusted_code)

if result.success:
    print(result.value)
else:
    print(f"{result.error_type}: {result.error}")

You can also statically analyze code without executing it — useful for pre-screening AI output before deciding whether to run it at all.

Three Backends

I built three evaluation backends, each with different trade-offs:

  1. Tree-walking interpreter (671 LOC) — the default. Uses a trampoline loop for tail-call optimization, so tail-recursive programs can run 50,000+ iterations without blowing the Python stack. Simplest to understand and debug.

  2. CPS evaluator (383 LOC) — continuation-passing style with delimited continuations. This is the backend that supports multi-shot effect handlers: a handler can call resume multiple times, enabling patterns like backtracking and nondeterminism. Architecturally the most interesting; slower in practice due to Python overhead.

  3. Bytecode compiler and VM (653 LOC) — compiles to a flat instruction stream and runs on a stack-based VM with 23 opcodes. Faster than the tree-walker on tight arithmetic (1.8x), but Python-level dispatch overhead means the tree-walker actually wins on recursive workloads. A C or Rust inner loop would flip this.

Having three backends turned out to be more valuable than I expected — not for performance, but for correctness. The CPS evaluator uncovered a Python recursion limit issue (solved via trampolining). The bytecode VM revealed a subtle closure-capture bug in letrec where closures were capturing frames by copy instead of by reference, so mutually recursive functions couldn’t see each other. Each backend validates the others.

How It Holds Up

I evaluated SafeLisp against eight success criteria covering memory safety, effect isolation, termination control, correctness, auditability, performance, and AI safety. The experiments ran through the autoresearch loop and produced structured JSON results.

Structured Experiments (22 cases, 100% pass rate):

  • EXP-001 (Pure Functional Core): Fibonacci, Church encodings, higher-order functions, pattern matching — all correct. Fibonacci(15) computes in 19,730 steps.
  • EXP-002 (Safety Boundaries): Infinite loops caught at the step limit. Deep recursion caught at the depth limit. Memory bombs caught at the allocation limit. Unhandled console, filesystem, and network effects all blocked — both statically and at runtime. 8/8 cases pass.
  • EXP-003 (Effect Handling): Handlers transform results, multiple effects compose, pure code needs no handler. 4/4 cases pass.
  • EXP-004 (AI Code Safety): A pure data-processing pipeline (filter/map) succeeds. A filesystem escape attempt is blocked at the type-check stage. A resource bomb is caught. Sandboxed console output is transformed by the handler. 4/4 cases pass.

Research Questions:

  • RQ-001 (Effect overhead): 0.97x — wrapping code in an effect handler has zero measurable overhead. The handler stack is only activated when perform! is encountered; pure code inside a handle block takes neither branch. This surprised me — I expected at least a 20-30% hit.
  • RQ-006 (AI code primitives): The current 17 builtins are sufficient for four representative AI code patterns (data transformation, string processing, recursive computation, accumulator).
  • RQ-007 (Error messages): 6/6 error types produce messages descriptive enough for an AI agent to self-correct.

Fuzz Testing (10,000 programs):

The fuzz testing was the most satisfying result. I built an AST fuzzer that generates random programs — atoms, binary operators, conditionals, bindings, function applications, list operations, pattern matching, records, effects — with a max depth of 5 and tight resource limits. Out of 10,000 generated programs (seed 42 for reproducibility): 8,166 executed successfully, 1,834 produced clean errors (type errors, resource exhaustion, malformed expressions), and zero crashed the runtime or escaped the sandbox. Throughput: about 24,600 programs per second.

Why Language-Level?

Container sandboxes are coarse-grained. They can restrict system calls, but they can’t express “this function may read from /data/input.csv but not /etc/passwd” or “this code may use the network but only for GET requests to a specific domain.” Those policies live at the wrong abstraction layer.

With algebraic effects and capabilities, SafeLisp can express and enforce fine-grained policies that match the semantics of the code. The type checker catches violations before execution, and the runtime enforces them during execution. The two layers are complementary — not a replacement for containers, but a defense-in-depth addition inside them.

What I Learned

A few things that weren’t obvious going in:

  • Effect tracking is free. I expected measurable overhead from wrapping code in effect handlers. There isn’t any. The handler stack is a try/except frame that’s never entered for pure code.
  • Multiple backends are a testing strategy, not just a performance strategy. Each backend found bugs the others missed. The bytecode VM’s closure-capture bug would have been very hard to find with a single evaluator.
  • S-expressions are LLM-friendly. AI agents generate correct SafeLisp more reliably than they generate correct code in languages with complex syntax. The grammar is unambiguous and the parser is 343 lines.
  • The autoresearch loop works for PL research. Defining structured experiments with machine-readable success criteria, then letting the loop run them, produced better coverage than my ad-hoc manual testing would have.

Try It

SafeLisp is a research prototype — about 3,900 lines of implementation, 2,100 lines of tests, and 315 passing test cases. There’s room to grow: compilation to WebAssembly (which would inherit WASI’s capability model naturally), concurrency via algebraic effects, refinement types for richer static guarantees. But the core thesis — that language-level guarantees complement infrastructure-level isolation — seems well-supported by the results.

The full technical report, language specification, and tutorial are on GitHub.