Micrograd.jl
Context
I'm GPU poor. Bound to CPU, eking out every last bit of performance I can from my toy experiments is a must. I knew I wanted to replicate Karpathy's micrograd and I settled on doing it in Julia for the performance gains and because working in a different language from Python forces me to develop a better understanding of the project's core data structures. On top of the benefits to my own learning, I got an opportunity to test a question that's been nagging at me for a while: how slow is Python really?
Micrograd is a useful project for measuring host-language overhead. Each training step builds a scalar computation graph, walks it, and updates parameters one value at a time. When you do this from scratch without optimized libraries, speed is entirely dependent on the language you choose.
Methodology
Both Julia and Python ran the same 24-config grid: six two-moons dataset sizes (n=100, 200, 500, 1000, 5000, and 10000) crossed with four MLP architectures ([2,8,1], [2,16,16,1], [2,32,32,1], and [2,16,32,16,1]). I pre-generated the datasets and initialized weights in Python to avoid RNG drift. I also used the same activations, losses, learning schedule, and full-batch gradient descent loop. One tweak I made to Karpathy's original code was replacing micrograd's recursive topological sort with an iterative DFS after one trial hit Python's recursion depth limit. I ran the sweeps sequentially on my M5 MacBook Pro (48GB RAM). To reiterate, none of the Python or Julia code was optimized with NumPy, Flux, or torch.
Benchmark
Julia finished the full 24-config grid in 3.59h, with five trials per architecture/dataset pair. I stopped Python after 27.52h, with only 16/24 configs complete. The next Python config, [2,32,32,1] at n=5000, took 5.08h for a single trial; with four trials left, that one config alone was on pace to take another day. On the 16 configs both languages completed, Python was 70.3x slower than Julia at the median.
Python takes much longer than Julia overall, with the biggest gap appearing in the forward pass as graph size grows.
| Architecture | End-to-end | Forward pass | Backward pass |
|---|---|---|---|
[2,8,1] |
49x | ~220x | ~9.4x |
[2,16,16,1] |
77x | ~300x | ~6.3x |
[2,32,32,1] |
102x | ~378x | ~6.9x |
Future Work
Next, I'm curious to see if I can extract any performance gains by using Julia instead of Python as the orchestration language for an RL experiment.