microgpt in Julia: A Port with Flux.jl
microgpt is Andrej Karpathy’s minimal GPT implementation—a few hundred lines of Python that trains a tiny language model on name generation. It’s a great teaching tool: no framework overhead, just the essentials. I wanted to see what it would look like in Julia, and how much faster Flux.jl’s compiled BLAS and GPU support could make it.
The result is microgpt_jl: a port with the same architecture, plus extensions for a more interesting training example and production-ready tooling.
Same Architecture, Different Language
The port preserves microgpt’s design choices:
- RMSNorm instead of LayerNorm
- ReLU activation (no GELU)
- No biases in linear layers
- Separate lm_head (unembedded output projection)
The names model is identical: 4,192 parameters, same tokenizer (character-level). The main implementation difference is that the Julia model uses batched training and explicit causal masking: the original processes one token at a time with implicit masking via KV-cache accumulation, while the port processes full sequences in parallel with an additive (T×T) lower-triangular mask—the formulation needed to support mini-batching. I used Claude Code to do most of the coding and iterated until the architecture matched and the loss curves aligned with the Python reference.
Extensions for a Richer Example
The original microgpt trains on a simple names dataset. I extended the Julia version to support tiny-shakespeare as a more engaging demo:
- Mini-batched training — proper batching instead of single-sequence updates
- Cosine learning rate decay — smooth schedule from initial LR to near-zero
- Checkpoint persistence — save and resume training across runs
These additions make the Shakespeare example more realistic: you can train an 836K-parameter model, interrupt it, and pick up where you left off.
Performance
Porting to an accelerated library yields the expected gains: Flux.jl uses compiled BLAS (OpenBLAS or MKL) and Zygote-based automatic differentiation on CPU, versus the Python original’s hand-rolled scalar autograd. On GPU, CUDA.jl and Metal.jl provide acceleration via cuBLAS and Metal Performance Shaders respectively.
| Task | Python (microgpt) | Julia (microgpt_jl) | Speedup |
|---|---|---|---|
| Names (1,000 steps) | 89.2s | 1.3s | 70× |
For Shakespeare on the 836K-param model:
- CPU: ~15 minutes
- CUDA (RTX 3060): ~2 minutes
GPU support is automatic: the code detects NVIDIA CUDA and Apple Metal and uses whichever is available. No configuration required—just add the CUDA.jl or Metal.jl packages when you want GPU acceleration; they’re optional extras in the project.
Code Quality
The Julia version isn’t as concise as the Python original—that’s the trade-off for adding acceleration, checkpoints, and tests. But it’s still a readable, small codebase.
- 125 tests across tokenizer, model, training, checkpoints, and integration
- CI runs the full suite on each push
- Documentation in the README for running both the names and Shakespeare examples
Try It
The project is on GitHub: ranton256/microgpt_jl.
If you’re curious about minimal GPT implementations or want to see Flux.jl in action on a real training loop, it’s a good starting point. Clone, run the names example in under two seconds, or fire up Shakespeare and watch it learn to generate text—on CPU or GPU.