Hand-rolling a transformer in Julia

Repo

The natural follow-up to Micrograd.jl was a transformer, but I wanted to scale the difficulty instead of relying on a pre-built tensor library. I was following Karpathy's guide and decided that using torch made Micrograd trivial, at least for someone who has a little experience with transformers. So, I'm writing every tensor op by hand without Flux. I can already tell you it sucks. (and yes, I did have to get a sweet treat to feel better about myself)

Status

Bigram is done now. I implemented a handwritten BigramLM with one trainable logits table, character-level tokenization, mini-batch sampling, cross-entropy loss, AdamW training, and autoregressive generation. A 100k-step tiny Shakespeare run took about 23 seconds and produced the expected Shakespeare-flavored bigram noise on my own tensor stack.

No benchmarks here yet. This page is a mostly going to be a personal worklog. I'll add any numbers or plots if there's something interesting to visualize.