Every one should go through this rite of passage work and get to the "Attention is all you need" implementation. It's a world where engineering and the academic papers are very close and reproducible and a must for you to progress in the field.
(see also andre karpathys zero to hero nn series on youtube as well its very good and similar to this work)
Is this YouTube series also “from scratch (but not really)”
Edit - it is. Not to talk down on the series. I’m sure it’s good, but it is actually “LLM with PyTorch”.
Edit - I looked again and I was actually not correct. He does ultimately use frameworks, but gives some early talk about how those function under the hood.
I appreciate you coming back and giving more details, it encourages me to look into it now. Maybe my expectations on the internet are just low, but I thought it was a virtuous act worth the effort, I wish more people would continue with skepticism but be willing to follow through and let their opinions change given solid evidence.
I would also recommend going through Callum McDougall/Neel Nanda's fantastic Transformer from Scratch tutorial. It takes a different approach to conceptualizing the model (or at least, it implements it in a way which emphasizes different characteristics of Transformers and self-attention), which I found deeply satisfying when I first explored them.
https://arena-ch1-transformers.streamlit.app/%5B1.1%5D_Trans...
Thanks for sharing. This is a nice resource
That magic moment in Karpathys first video when he gets to the loss function and calls backward for the first time - this is when it clicked for me. Highly recommended!
+1 for Karpathy, the series is really good