Feb 14, 2026

How I Built a Computer Algebra System with a Ralph Loop

Ralph Loop: an AI agent iterating inside a Docker sandbox with Claude Code CLI, a nanny monitor, and mathematical expressions

Background: What’s a Ralph Loop?

The Ralph Wiggum technique was created by Geoffrey Huntley and went viral in late 2025. In its purest form, it’s a bash loop:

while :; do cat PROMPT.md | claude-code ; done

The idea: feed the same prompt to an AI coding agent over and over. Progress doesn’t live in the LLM’s context window — it lives in files and git history. Each iteration gets a fresh context, reads the project state, picks up where the last one left off, and does one more piece of work. The snarktank/ralph repo and various community implementations built on this into a more structured pattern with PRDs, progress tracking, and stop conditions.

I took direct inspiration from CodeBun’s Medium article for the specific implementation: the prd.json format with user stories and passes: true/false flags, the prompt.md structure with a <promise>COMPLETE</promise> stop condition, the progress.txt file for cross-iteration learning, and the starting point for loop.sh. That article used Amp as the agent. We switched to Claude Code CLI.

What we added on top: a Docker sandbox, a host-side nanny monitor, preflight checks (API, C compiler, Rust toolchain), per-iteration logging, a stricter validation gate, and a safety-net commit for tracking artifacts. This post is about what happened when we put all of that together, pointed it at a real project, ran it twice, and cleaned up the results.

The Project

rust-cas: a symbolic computer algebra system in Rust. Parsing, simplification, differentiation, integration, polynomial factoring, linear system solving, root finding, number theory primitives (primality, factorization, GCD). 34 user stories across two loop runs. I wanted to see how far the loop could get unattended and what happened when I came back to clean up after it.

Rust CAS REPL session showing differentiation, root solving, integration, primality testing, and GCD

Our Setup

Before coding, I used ChatGPT to help write a product vision and roadmap document (docs/roadmap.md) to set context for the high-level direction of the project. I hoped this would help with defining better user stories, and it did — particularly when the agent later wrote its own stories based on the roadmap’s phase structure.

I ended up with four components:

prd.json — 30 user stories (later 34) with acceptance criteria and a passes: true/false flag.
loop.sh — Calls Claude Code CLI each iteration, passing prompt.md. Based on the pattern from the Medium article, modified for Claude and with added logging.
Docker sandbox — A locked-down container with a read-only filesystem, dropped capabilities, and resource limits.
nanny.sh — A host-side monitor that watches the container and restarts it if it gets stuck. Also new. Runs with fewer permissions than the instance inside the container.

Each iteration follows the same flow:

Claude reads prd.json and picks the highest-priority incomplete story.
It implements that one story.
It runs the validation gate: cargo fmt --check, cargo clippy, cargo test.
If everything passes, it commits, marks the story done, and logs what it learned.
Next iteration picks the next story.

The prompt forbids working on multiple stories per iteration. This is important because it forces atomic, testable progress and makes it straightforward to identify which iteration introduced a problem.

The Validation Gate

The validation gate is the most important part of the prompt structure. I adjusted it several times after various iterations where the agent committed code that compiled but had clippy warnings or failing tests. This was actually our second attempt; the first prompt was supposed to require passing checks, but the agent didn’t follow through in practice.

## Validation Gate (MUST pass before commit)
Run these commands IN ORDER. ALL must succeed.
Do NOT mark a story as `passes: true` or commit
unless every command exits 0.

    cargo fmt -- --check
    cargo clippy --all-targets --all-features -- -D warnings
    cargo test

Without this, mistakes compound easily.

Building the Sandbox

Running the agent directly on the host machine with enough permissions to operate unattended raises obvious security concerns.

I wanted isolation for safety (the agent runs with --dangerously-skip-permissions) and to make the environment reproducible. This turned out to be the most time-consuming part of the project.

Take 1: Nothing Works

Our first docker-run.sh accidentally had --network none and --dangerously-allow-all (which doesn’t exist). The container would start, Claude CLI would fail to reach the API, produce zero output, and the loop would burn through iterations doing nothing.

Three problems stacked on top of each other:

Wrong CLI flag. --dangerously-allow-all isn’t a real flag. The correct flag is --dangerously-skip-permissions. The initial scripts were LLM-generated, and the model hallucinated the flag name.
No network. --network none blocks all outbound traffic, including the Anthropic API. The agent needs API access. Removed.
No API key. The ANTHROPIC_API_KEY wasn’t being passed to the container. Added .env file loading and --env-file to the Docker run command.

Take 2: Silent Failure

Fixed the above. Container starts, looks healthy… and every iteration produces a 0-byte log file. Claude exits 0 but writes nothing.

The container runs as user ralph (UID 10001), but the --tmpfs /home/ralph mount was owned by root. Claude Code CLI needs to write config and cache files to $HOME. When it can’t, it silently produces empty output and exits successfully. No error message. Exit code 0.

The fix:

--tmpfs /home/ralph:rw,nosuid,uid=10001,gid=10001,size=128m

I also had to remove noexec from the /tmp mount because Node.js (which Claude CLI runs on) needs to execute files from temp directories.

Take 3: Stdin vs. Arguments

Still empty output. It was piping the prompt via stdin (cat prompt.md | claude -p), but Claude CLI’s -p flag doesn’t reliably read from a pipe in all environments. Switching to passing the prompt as an argument fixed it:

PROMPT="$(cat "$SCRIPT_DIR/prompt.md")"
OUTPUT=$(claude -p "$PROMPT" --dangerously-skip-permissions 2>&1)

Three bugs, all in the plumbing, none in the AI. This was a recurring theme: the agent was usually fine once it could actually run, but the infrastructure needed to be right first.

The Preflight Check

After getting burned by silent failures, I added a preflight check to the container entrypoint. Before launching the main loop, it sends a trivial query to the API:

PREFLIGHT=$(claude -p "Respond with exactly: PREFLIGHT_OK" \
    --dangerously-skip-permissions 2>&1) || true

if echo "$PREFLIGHT" | grep -q "PREFLIGHT_OK"; then
    echo "[run-ralph] Preflight PASSED"
else
    echo "[run-ralph] Preflight FAILED"
    exit 1
fi

This catches API key issues, network problems, and CLI bugs before wasting any iterations.

The Nanny: Watching the Watcher

Once the loop was running, I needed visibility. Is it making progress? Is it stuck? Did the container crash?

Version 1: LLM-Driven Decisions

The first nanny used Claude to decide what to do. It collected logs, container status, and git history, sent it all to Claude, and asked for an action: STALLED, RESTART, REBUILD, or HEALTHY.

This went poorly.

The nanny would call Claude, Claude would decide the container was “stalled” (because it hadn’t committed in 3 minutes — normal, since iterations take 2-5 minutes), and kill the container while the agent inside was actively writing code. Then it would rebuild the Docker image (changing nothing), restart, and the agent would start the same story over.

I watched this happen several times before realizing the nanny was destroying progress.

Version 2: Objective Metrics Only

The rewrite removed LLM decision-making from the nanny. Instead, it uses hard signals:

is_claude_active() {
    local check
    check=$(docker exec "$CONTAINER_NAME" sh -c \
        'for p in /proc/[0-9]*/cmdline; do
            tr "\0" " " < "$p" 2>/dev/null; echo
        done' 2>/dev/null || echo "")
    if echo "$check" | grep -q "claude\|node.*anthropic"; then
        return 0  # Claude is running
    fi
    return 1
}

The rules:

Never kill a container with an active Claude process.
Track git HEAD changes and log growth between checks.
Only restart after 5 consecutive idle checks (10 minutes with no activity).
Remove the REBUILD action entirely — if the Dockerfile needs changing, a human does it.
Claude is only used for generating human-readable status reports, not for action decisions.

LLMs are useful for summarizing what happened. They’re not as good at making operational decisions about transient system states they can’t directly observe.

What the Agent Built

Over approximately 27 iterations, the agent implemented:

Core CAS Engine:

Expression tokenizer with error reporting
Recursive descent parser with operator precedence
Expression evaluator with variable environments
Algebraic simplifier (constant folding, identity elimination, like-term collection)
Symbolic differentiation with chain rule, product rule, and function derivatives
Polynomial expansion with distributive law
Polynomial factorization (common factors, GCD-based)
Canonical form normalization
Symbolic integration with pattern matching and numerical fallback

Advanced Features:

Additions to the PRD were made by Claude itself based on a “final” user story to update the PRD itself. This is where a number of fairly advanced features came from that I didn’t even think to ask it to build. These were presumably influenced by the roadmap I made before any coding began which listed some high-level phases including things like partial derivatives, equation systems, and vector calculus. It also placed some things out of scope by design as non-goals, such as a GUI.

Here’s the story the agent wrote for itself:

{
      "id": "CAS-010",
      "title": "Review roadmap against current state and generate new PRD tasks",
      "acceptanceCriteria": [
        "Current implementation state is compared against roadmap phases",
        "Missing or under-specified capabilities are identified",
        "New tasks are appended to prd.json with acceptance criteria",
        "Obsolete or completed tasks are marked appropriately",
        "typecheck passes"
      ],
      "priority": 4,
      "passes": false,
      "notes": "This task keeps the roadmap alive and adaptive."
    }

Partial derivatives for multivariable expressions
Gradient computation for scalar fields
Jacobian matrix computation
2x2 and 3x3 linear system solver using Cramer’s rule
Polynomial GCD via Euclidean algorithm
Polynomial long division
Newton’s method and bisection root finding
Simpson’s rule numerical integration with error control

Tooling and Documentation:

Interactive REPL with help system
Transformation tracing system (records every simplification step)
Performance benchmarks via Criterion
Property-based test harness for random expression validation
README with usage examples
~600-line architecture document
~600-line developer setup guide

Test Suite (after round one):

341 library unit tests, 9 binary integration tests, 10 doc-tests
All passing at the time (though many were never actually run — see Round Two below)

Each story was a single commit following feat: CAS-XXX - Story title.

Hitting the Quota

After 27 of 30 stories, the $25 API credit balance ran out. The remaining stories were CAS-025 (rename the crate), CAS-028 (fix all failing tests and lint warnings), and CAS-029 (test coverage audit).

Most of the human time up to this point had been spent on infrastructure — Docker, nanny script, debugging silent failures. The actual Rust code was written by the agent.

The Last Mile: Fixing Things by Hand

With the loop out of budget, I tackled CAS-025 and CAS-028 directly in Claude Code’s interactive mode.

The Rename Fallout

The crate had been renamed from cas to rust-cas (story CAS-025), but the loop had left behind 23+ compilation errors: use cas:: imports scattered across source files, doc comments, examples, and benchmarks. The rename was marked as passing in the PRD — the agent had updated Cargo.toml and the code compiled at the time. But later stories introduced new code that imported the old name, and since CAS-025 was already marked done, nothing rechecked it.

This is a real failure mode of the one-story-at-a-time approach: a cross-cutting change like a crate rename touches many files, but regressions can creep in as later stories add code referencing the old name.

The fix was mechanical: find and replace cas:: with rust_cas:: across all .rs files, update function names that had been renamed (evaluate to eval, symbolic_integration::integrate to integrate_symbolic), and remove dead code that clippy flagged.

Incorrect Test Expectations

With compilation fixed, 12 tests were failing. They fell into three categories:

Structural vs. Numerical Comparison (10 tests): Tests in factor.rs and poly_gcd.rs compared AST structures directly. For example, asserting that factoring x^2 - 1 produces exactly (x - 1) * (x + 1) as an AST node. But the simplifier might produce (x + 1) * (x - 1) (different order) or (x + (-1)) * (x + 1) (different structure, same value). The fix: compare expressions numerically by evaluating at several test points.

fn assert_numerically_equal(a: &Expr, b: &Expr, vars: &[&str]) {
    let test_values: &[f64] = &[-2.0, -0.5, 0.0, 0.5, 1.0, 2.0, 3.0];
    for &val in test_values {
        let mut env = Environment::new();
        for &v in vars { env.insert(v.to_string(), val); }
        let va = eval(a, &env);
        let vb = eval(b, &env);
        match (va, vb) {
            (Ok(a_val), Ok(b_val)) => {
                assert!((a_val - b_val).abs() < 1e-9,
                    "Mismatch at {}={}: {} vs {}", vars[0], val, a_val, b_val);
            }
            (Err(_), Err(_)) => {}
            _ => panic!("One side errored and the other didn't"),
        }
    }
}

Numerical comparison as a test oracle is a useful pattern for symbolic math systems. Two expressions are equivalent if they agree at enough points. It’s not a proof, but it catches most real bugs.

Wrong Test Expectations (2 tests): Two linear_solver tests had incorrect expected values. One system (-x + y = 1 and x - y = -1) is actually singular — both equations describe the same line. The solver correctly returned SingularSystem, but the test expected a unique solution. The other test had a 3x3 system whose actual solution is (6/5, 8/5, 3), not (1, 2, 3) as the test claimed.

Verified with numpy:

import numpy as np
A = np.array([[2,1,-1],[1,3,2],[3,-1,1]])
b = np.array([1,12,5])
print(np.linalg.solve(A, b))  # [1.2, 1.6, 3.0]

The solver code was correct; the test cases had wrong arithmetic. This is a known risk with AI-generated tests — they provide coverage structure but the specific expected values still need review.

Clippy Cleanup

13 clippy warnings, all mechanical:

Redundant closures (.sort_by(|a, b| cmp(a, b)) instead of .sort_by(cmp))
Needless reference comparisons (&a != &b instead of a != b)
Manual range checks instead of .contains()
trace.len() > 0 instead of !trace.is_empty()
A function with 9 arguments (the 3x3 determinant function — suppressed with #[allow(clippy::too_many_arguments)])

After all fixes: 360 tests passing, 0 clippy warnings, formatting clean.

Round Two: Running the Loop Again

With CAS-028 fixed and the validation gate enforced, I added more stories to the PRD (CAS-029 through CAS-032) and launched the loop a second time. I also added CAS-033 (GCD computation) interactively and the loop picked it up on the next iteration. The second run completed all five stories:

CAS-029: Test coverage audit — added 109 new tests across integration, property, evaluator, simplify, and tokenizer modules
CAS-030: Root-finding robustness — multiple initial guesses and automatic bracket scanning for Newton’s method fallback
CAS-031: Number type predicates — is_integer, is_natural, is_prime with trial division
CAS-032: Integer prime factorization via trial division
CAS-033: Greatest common divisor for 2+ values using the Euclidean algorithm

But this second run also revealed problems the first run had papered over.

The Validation Gate Wasn’t Actually Running

The first loop run’s prompt had the validation gate documented, but the agent was committing code and marking stories done without actually running cargo clippy or cargo test. The story notes for CAS-029, CAS-030, CAS-031, and CAS-032 all contained the same tell: “cargo fmt passes. Note: cargo clippy/test blocked by missing C compiler.”

The Dockerfile had build-essential installed, so a C compiler should have been available. But the image had never been rebuilt after the fix was added. The agent noticed the missing compiler, wrote it down in the progress notes, and carried on committing anyway. The prompt said “ALL must succeed” — the agent interpreted “blocked by infrastructure” as sufficient justification to skip the gate.

This is a significant finding: a validation gate only works if the agent actually treats failure as a hard stop. Documenting the requirement wasn’t enough. The agent found a rationalization ("infrastructure issue, not my fault") and routed around the constraint.

Ninety Pre-existing Failures

When I finally ran cargo clippy and cargo test myself after the second loop run, the codebase had accumulated roughly 90 errors:

20 test compilation errors in tokenizer.rs referencing Token::Operator('+') — a variant that doesn’t exist. The Token enum uses Token::Plus, Token::Minus, etc. The agent had written tests against an API it imagined rather than the one that existed.
4 integration test errors calling symbolic_integration::integrate() — the function was renamed to integrate_symbolic() in an earlier story, but the test coverage story wrote new tests using the old name.
~50 clippy warnings for assert_eq!(expr, true) instead of assert!(expr), redundant closures, vec![] where an array literal would do, and /// doc comments on statements.
15 unused doc-comment warnings in property tests — /// comments on let bindings instead of //.
3 runtime test failures: a tokenizer that rejected underscore-prefixed identifiers (_x) despite tests expecting it to work, a root-finding bracket scan that missed exact zeros, and an integration test with wrong sign expectations for -x^2.

All of these were fixed in a single interactive Claude Code session. The irony: the agent that wrote the code couldn’t run the checks that would have caught these issues, but a different agent (or the same model in interactive mode) fixed them quickly once pointed at the actual error output.

Uncommitted Artifacts

After the second loop run completed all stories, the nanny detected 34/34 passing and shut everything down. But git diff showed uncommitted changes to progress.txt — the agent had updated the tracking file as its last act, then emitted <promise>COMPLETE</promise>, and the loop exited before committing.

The root cause was the prompt ordering. Steps 8-10 were:

8. Commit: `feat: [ID] - [Title]`
9. Update prd.json: `passes: true`
10. Append learnings to progress.txt

Steps 9 and 10 happen after the commit. If the iteration that marks the final story done is also the one that triggers the stop condition, the tracking file updates never get committed. I fixed this with two changes:

Reordered the prompt so prd.json and progress.txt are updated before the commit, not after. All file changes happen first, then one commit captures everything.
Added a safety-net commit to loop.sh that checks for dirty prd.json or progress.txt after each iteration and commits them if needed. Belt and suspenders.

This is the kind of bug you only find by running the loop to completion and checking the final state. It’s also a reminder that prompt ordering matters — the agent follows instructions sequentially, and if the sequence has the commit in the wrong place, the last iteration’s bookkeeping gets lost.

What I Learned

Most of the work is infrastructure

The agent wrote working Rust code from the first successful iteration. Getting the Docker container, environment variables, filesystem permissions, and process management right took more human time than any code issue. If you’re building a Ralph loop in a container, expect to spend most of your effort on plumbing.

The validation gate must be enforced, not just documented

Our first lesson was that without cargo fmt && cargo clippy && cargo test as a hard gate, the codebase accumulates problems. Our second lesson was harder: having the gate in the prompt isn’t enough. The second loop run proved the agent will skip the gate if the toolchain is broken and rationalize the skip in its progress notes. The gate needs to be mechanically enforced — either by the loop script itself or by a pre-commit hook — not left as an instruction the agent can choose to follow.

One story per iteration is the right granularity

Multiple stories per iteration leads to tangled commits and harder debugging. One story keeps things focused and makes failures easy to trace.

Don’t use LLMs for infrastructure control flow

The nanny v1 experience was clear on this. LLMs can summarize logs and generate status reports. They shouldn’t be deciding when to restart containers or kill processes.

AI-generated tests need the same review as AI-generated code

The agent wrote hundreds of tests, and most were correct. But the linear solver tests with wrong expected values, the tokenizer tests referencing a non-existent enum variant, and the integration test with wrong sign expectations all would have been caught by running the tests. Coverage doesn’t guarantee correctness — and tests that were never actually executed provide no coverage at all.

Cross-cutting changes are a weak spot

The one-story-at-a-time approach handles isolated features well. It handles cross-cutting changes poorly. A crate rename, a function signature change, a logging convention update — anything that touches many files can regress as later stories add new code. A periodic full-project validation (not just per-story) would help.

The agent writes against its mental model, not the actual API

The most striking class of bugs from round two was the tokenizer tests. The agent wrote Token::Operator('+') — a perfectly reasonable API for a tokenizer to have, but not the one that existed. The actual enum had Token::Plus, Token::Minus, etc. The agent never checked the type definition before writing tests against it. This happened repeatedly across 20+ test assertions.

This suggests a limitation of single-shot iteration: the agent builds a plausible mental model of the codebase from the prompt and recent context, then writes code against that model. When the model diverges from reality, the errors are systematic, not random. The fix is the same as for human developers: run the code.

Prompt ordering has mechanical consequences

The uncommitted progress.txt bug was caused by the prompt listing “commit” before “update tracking files.” The agent followed the steps in order. This isn’t a reasoning failure — it’s the prompt author’s failure to think through the sequencing. In a loop where each iteration is stateless, the only contract is the prompt and the files on disk. If the prompt says to commit before updating the tracking file, the tracking file won’t be in the commit.

Human-agent collaboration is the practical sweet spot

The first loop run (27 stories, fully autonomous) produced a working CAS but left behind compilation errors and untested code. The interactive sessions that followed (CAS-028 fixes, second loop run cleanup, infrastructure hardening) caught and fixed problems the loop couldn’t. Neither mode alone was sufficient: the loop excels at grinding through well-scoped stories, while interactive mode excels at cross-cutting fixes, debugging, and infrastructure work that requires back-and-forth judgment. The most productive workflow was using the loop for volume and interactive mode for quality.

Numbers

Metric	Value
User stories (total)	34
Completed by loop run 1	27 (CAS-001 through CAS-027)
Completed by loop run 2	5 (CAS-029 through CAS-033)
Completed with human assist	2 (CAS-025 rename, CAS-028 test fixes)
Library tests	467
Binary (REPL) tests	31
Integration tests	31
Property tests	14
Doc-tests	15
Total tests passing	558
Clippy warnings	0
API cost (loop runs)	~$30
Human time on infrastructure	~4 hours
Human time on code fixes	~2 hours (CAS-028, round 2 cleanup)

Running Your Own

The infrastructure is in the repo. To run a similar loop:

Write a prd.json with user stories and acceptance criteria.
Write a prompt.md with iteration instructions and a validation gate.
Build the Docker image: ./scripts/docker-build.sh
Set your API key in .env
Launch: ./scripts/docker-run.sh
Optionally, run the nanny: ./scripts/nanny.sh

The pieces that matter most:

Clear acceptance criteria. Vague stories produce vague implementations. Specific criteria give the agent something concrete to verify.
A validation gate the agent can’t skip. Document it in the prompt, but also consider enforcing it mechanically in loop.sh or via a pre-commit hook. If the toolchain is broken, the agent will skip the gate and rationalize it.
Prompt ordering that commits last. Update all tracking files (PRD, progress log) before the commit step, not after. Otherwise the final iteration’s bookkeeping gets left uncommitted. A safety-net commit in loop.sh catches any stragglers.
Preflight checks. Verify the API key, network, C compiler, and toolchain work before burning iterations. Silent failures waste time and credits.
Isolation. Docker sandboxing lets you run --dangerously-skip-permissions without worrying about what the agent might do to the host.
Budget awareness. Set usage limits or monitoring on your LLM API costs. 30 stories cost us ~$30; a runaway loop with failing iterations will burn credits on retries.

Prior Art and References

Geoffrey Huntley’s original Ralph Wiggum post — the origin of the technique
A Brief History of Ralph — timeline of adoption and community development
CodeBun’s Medium article — the specific implementation we based our PRD format, prompt, and loop.sh on
snarktank/ralph — community reference implementation