microgpt in Julia: A Port with Flux.jl

Feb 27, 2026

microgpt is Andrej Karpathy’s minimal GPT implementation—a few hundred lines of Python that trains a tiny language model on name generation. It’s a great teaching tool: no framework overhead, just the essentials. I wanted to see what it would look like in Julia, and how much faster Flux.jl’s compiled BLAS and GPU support could make it.

The result is microgpt_jl: a port with the same architecture, plus extensions for a more interesting training example and production-ready tooling.

Same Architecture, Different Language

The port preserves microgpt’s design choices:

  • RMSNorm instead of LayerNorm
  • ReLU activation (no GELU)
  • No biases in linear layers
  • Separate lm_head (unembedded output projection)

The names model is identical: 4,192 parameters, same tokenizer (character-level). The main implementation difference is that the Julia model uses batched training and explicit causal masking: the original processes one token at a time with implicit masking via KV-cache accumulation, while the port processes full sequences in parallel with an additive (T×T) lower-triangular mask—the formulation needed to support mini-batching. I used Claude Code to do most of the coding and iterated until the architecture matched and the loss curves aligned with the Python reference.

Extensions for a Richer Example

The original microgpt trains on a simple names dataset. I extended the Julia version to support tiny-shakespeare as a more engaging demo:

  • Mini-batched training — proper batching instead of single-sequence updates
  • Cosine learning rate decay — smooth schedule from initial LR to near-zero
  • Checkpoint persistence — save and resume training across runs

These additions make the Shakespeare example more realistic: you can train an 836K-parameter model, interrupt it, and pick up where you left off.

Performance

Porting to an accelerated library yields the expected gains: Flux.jl uses compiled BLAS (OpenBLAS or MKL) and Zygote-based automatic differentiation on CPU, versus the Python original’s hand-rolled scalar autograd. On GPU, CUDA.jl and Metal.jl provide acceleration via cuBLAS and Metal Performance Shaders respectively.

Task Python (microgpt) Julia (microgpt_jl) Speedup
Names (1,000 steps) 89.2s 1.3s 70×

For Shakespeare on the 836K-param model:

  • CPU: ~15 minutes
  • CUDA (RTX 3060): ~2 minutes

GPU support is automatic: the code detects NVIDIA CUDA and Apple Metal and uses whichever is available. No configuration required—just add the CUDA.jl or Metal.jl packages when you want GPU acceleration; they’re optional extras in the project.

Code Quality

The Julia version isn’t as concise as the Python original—that’s the trade-off for adding acceleration, checkpoints, and tests. But it’s still a readable, small codebase.

  • 125 tests across tokenizer, model, training, checkpoints, and integration
  • CI runs the full suite on each push
  • Documentation in the README for running both the names and Shakespeare examples

Try It

The project is on GitHub: ranton256/microgpt_jl.

If you’re curious about minimal GPT implementations or want to see Flux.jl in action on a real training loop, it’s a good starting point. Clone, run the names example in under two seconds, or fire up Shakespeare and watch it learn to generate text—on CPU or GPU.