ranton.org - microgpt in Julia: A Port with Flux.jl

Feb 27, 2026

microgpt is Andrej Karpathy’s minimal GPT implementation—a few hundred lines of Python that trains a tiny language model on name generation. It’s a great teaching tool: no framework overhead, just the essentials. I wanted to see what it would look like in Julia, and how much faster Flux.jl’s compiled BLAS and GPU support could make it.

The result is microgpt_jl: a port with the same architecture, plus extensions for a more interesting training example and production-ready tooling.

Same Architecture, Different Language

The port preserves microgpt’s design choices:

RMSNorm instead of LayerNorm
ReLU activation (no GELU)
No biases in linear layers
Separate lm_head (unembedded output projection)

The names model is identical: 4,192 parameters, same tokenizer (character-level). The main implementation difference is that the Julia model uses batched training and explicit causal masking: the original processes one token at a time with implicit masking via KV-cache accumulation, while the port processes full sequences in parallel with an additive (T×T) lower-triangular mask—the formulation needed to support mini-batching. I used Claude Code to do most of the coding and iterated until the architecture matched and the loss curves aligned with the Python reference.

Extensions for a Richer Example

The original microgpt trains on a simple names dataset. I extended the Julia version to support tiny-shakespeare as a more engaging demo:

Mini-batched training — proper batching instead of single-sequence updates
Cosine learning rate decay — smooth schedule from initial LR to near-zero
Checkpoint persistence — save and resume training across runs

These additions make the Shakespeare example more realistic: you can train an 836K-parameter model, interrupt it, and pick up where you left off.

Performance

Porting to an accelerated library yields the expected gains: Flux.jl uses compiled BLAS (OpenBLAS or MKL) and Zygote-based automatic differentiation on CPU, versus the Python original’s hand-rolled scalar autograd. On GPU, CUDA.jl and Metal.jl provide acceleration via cuBLAS and Metal Performance Shaders respectively.

Task	Python (microgpt)	Julia (microgpt_jl)	Speedup
Names (1,000 steps)	89.2s	1.3s	70×

For Shakespeare on the 836K-param model:

CPU: ~15 minutes
CUDA (RTX 3060): ~2 minutes

GPU support is automatic: the code detects NVIDIA CUDA and Apple Metal and uses whichever is available. No configuration required—just add the CUDA.jl or Metal.jl packages when you want GPU acceleration; they’re optional extras in the project.

Code Quality

The Julia version isn’t as concise as the Python original—that’s the trade-off for adding acceleration, checkpoints, and tests. But it’s still a readable, small codebase.

125 tests across tokenizer, model, training, checkpoints, and integration
CI runs the full suite on each push
Documentation in the README for running both the names and Shakespeare examples

Try It

The project is on GitHub: ranton256/microgpt_jl.

If you’re curious about minimal GPT implementations or want to see Flux.jl in action on a real training loop, it’s a good starting point. Clone, run the names example in under two seconds, or fire up Shakespeare and watch it learn to generate text—on CPU or GPU.