Llm.c – LLM training in simple, pure C/CUDA

LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython
Python has been popular for this because it’s convenient to quickly hack on and experiment with, not because it’s the most efficient thing.

The overhead really isn't that bad is it? Since the the python code is mostly about saying multiply matrix A with matrix B, and then that actual computation is done by optimized low level code.

For that stuff, yeah you're correct.

What I've seen is issues with the implementation of those libraries in a project.

I don't remember exactly, but I was playing with someone's wrapper for some kind of machine learning snake game and it was taking way longer than it should have on back of the napkin math.

The issue was using either a dict or a list in a hot loop and changing it to the other sped it up like 1000x.

So it's easy to think "yeah this library is optimized" but then you build something on top of it that is not obviously going to slow it down.

But, that's the Python tradeoff.

The issue was using either a dict or a list in a hot loop and changing it to the other sped it up like 1000x.

The programmer using the wrong data structure is not a problem with the language.

It really is with Python. There are simply too many containers and container-like concepts. Lists, arrays, sets, dicts...

What modern language doesn’t have those?

Go kind of cheats and has maps play double duty as sets.

Kinda. I guess my native tongue is C/C++ and I wouldn't expect such a huge performance difference when using an array vs a linked list or something.

It's not like I had millions of items in that structure either, it was like 100. I think it contained the batch training data from each round. I tried to find the project but couldn't.

I was just shocked that there was such a huge difference between primitive data structures. In that situation, I wouldn't have guessed it would make a difference.

That sounds irrelevant to Python and just a matter of slow code cropping up in libraries until someone runs a profiler.

But then again if your program have places where choosing the right Python primitive is important for performance, then using python is affecting performance here since even the best algorithm in Python would be slower than the equivalent C.

Most of the time it doesn't matter because there's nothing hoy on the Python side, but if there is, then Python is going to be slowing your stuff down.

I suspect that this has a high chance of running afoul of Ahmdal’s Law. Even if you can parallelise the bulk of the computation, the serial parts remain single-threaded and start to dominate the total runtime.

I don’t think the serial parts of ML training are Python’s fault, are they? It’s all “operation B depends on the output of operation A”.

It depends on how you define overhead. Runtime overhead and memory usage is absolutely marginal, and the tightest, most perfect implementation will have trouble beating it.

Instead people are trying to optimize install size of dependencies, which while maybe a fun hacking project...who really cares?

107MB of cPython defeated

Go to try for self

Step 1 download 2.4GB of CUDA

The size of CUDA really is astonishing. Any chance someone might figure out how to slim that down?

0 replies

Talking directly to the kernel / driver / firmware.

As others have said, George Hotz is doing his best in reverse-engineering and skipping layers.

0 replies

Taking a peek inside the package it seems to mostly be the libraries - CuFFT alone is about 350MB for example, twice over for the debug and release versions. I'm guessing those are probably fat binaries pre-compiled for every generation of Nvidia hardware rather than just the PTX bytecode, which would help to speed up fresh builds, at the expense of being huge.

0 replies

Nvidia is the only one who could, since they own it.

0 replies

I mean, being fair, the 2.4GB CUDA SDK is absolutely required for the cPython implementation as well

0 replies

I don't think it's about the byte size, but the inherent complexity of the implementation. 1000 lines of C code is extremely simple by any standard. Whereas a sundry collection of Python and PyTorch libraries is anything but.

0 replies

A bunch of install methods for torch via pip include ~1.5GB of lib/ because of CUDA. is like 800MB on its own

direct CUDA implementation, which will be significantly faster and probably come close to PyTorch.

It almost hurts, to read that PyTorch is faster.

But then again, with these GPU-RAM-prices, let's see how it speeds up the CPU.

We really need SO-DIMM slots on the RTX series (or AMD/Intel equivalent) so that we can expand the RAM as we need it to. Is there a technical problem to it?

Memory speed is more or less directly proportional to how close the memory is to the processor, with the fastest memory being literally inside the processor (SRAM cache), followed by memory on the same package as the processor (HBM GPUs, Apple M-series), followed by soldered down discrete memory chips (regular GPUs, games consoles), followed by socketed DIMMs in distant last place. There's not really any getting around it, the bandwidth that GPUs crave just isn't compatible with modularity.

Even CPUs are starting to move their memory closer to the core in the name of performance, as mentioned Apple is already doing it, Intel is making Xeons with on-chip memory now, and they have a version aimed at consumers on their roadmap.

That's true it's got an impact, but I think there's still space available for "slightly slower with 2x memory" models. For many local uses, new cards are way past the "fast enough" line, but having 64gb on them would be really beneficial.

It's love to see some experiments / different SKUs in this area, given people are already diy-ing extra memory on NVIDIA. ( there were stable experiments later on, but I don't have a link now)

Graphics card manufacturers believe that selling high-memory consumer graphics cards will affect the market for commercial computing cards, so they will not do so, that's all.

Nice room for a new player to disrupt then

Problem is, making a board design using an existing GPU chip and sticking more RAM into it is (relatively) simple but of course none of the GPU chip makers would allow partners to do that. Making your own GPU chip that’s competitive with Nvidia or AMD’s current offerings is a massive undertaking and pretty much impossible for a newcomer.

Just look at how much trouble Intel has had breaking into the discrete GPU market or even just how hard it’s been for AMD to compete with Nvidia even with decades of experience in the market.

And if some newcomer could make a competitive GPU with large memory capacity they’d be crazy not to sell it at datacenter prices, maybe just undercutting the others but a few grand but still way more expensive than any consumer GPU you can buy today, even a 4090.

0 replies

It's not doable at the board level. High-end GPUs use HBM in the same package, connected using a silicon interposer.

FYI, most discrete GPUs with discrete memory packages soldered to the board near the GPU are running at substantially higher memory frequencies than the on-package DRAM in Apple's chips. But running GDDR at those speeds costs a lot of power.

1 replies

I watched a presentation on this today. The presenter focused on the soldering and proximity as well. Is this really the only difference or is this transistor based memory (like L1, L2, etc.)? I get the proximity factor of course (1ft / ns EE rule of thumb). In any case, soldering and proximity don't seem like breakthrough innovations (but maybe I am wrong).

0 replies

Gpu ram is typically gddr6 or gddr6x which is a different standard to the chips used in ddr5 for example. GPUs have terrible latency to ram, but enormous throughput, and I assume the chips are internally optimized for that. Many aspects of a design change when you choose different latency or clockspeed targets translating into different power / area calculations.

For data rates, as in bandwidth per IO pin, distance is really only a secondary factor. HBM memory, for example, runs at substantially lower data rates than GDDR, yet it sits right next to the GPU die compared to centimeters for the GDDR. And high-speed serial links run at speeds that are an order of magnitude higher than even the internal register files of a CPU.

We really need SO-DIMM slots on the RTX series (or AMD/Intel equivalent) so that we can expand the RAM as we need it to. Is there a technical problem to it?

I imagine it would incur a non trivial latency and cost penalty. The memory modules are placed pretty close to the compute die right now. Cooling would also have to change (the memory modules produce a lot of heat).

But there is also no reason for any of the GPU manufacturers to do this. A skew with twice as much memory can go for a lot more than the difference in memory cost alone

And especially doing "interesting" combinations of gpu and memory.

Like lower end gpu with 16 GB of VRAM, but offering just 8 / 12 GB of VRAM in the middle class and then again 16 GB in the upper class of gpu selection.

I don't disagree but (I know nothing about this btw...) would it not benefit in terms of, say, a L3 cache kind of thing?

Imagine you could stick 2 x 64GB DDR5 DIMMS on the GPU in sockets, would that not be faster to access than the motherboard DIMMS? It won't be as fast as on-die memory of course but could it not act like a sort of halfway house?

0 replies

Check out PCB back drilling. It's a process where you remove a few hundred microns from the vias that are used to connect GDDR RAMs to the GPUs, to avoid reflections due to the impedance mismatch that's caused by the stub.

When you have a pulse coded signal traveling at close to 10GHz, everything becomes an antenna. The technical problem is that you can't do this with a flimsy connector like the ones used for DIMMs. The reason GDDR can have a bandwidth per pin that is 4 times higher than regular DDR is because they are soldered down on the PCB.

NVIDIA hates that trick.

It almost hurts, to read that PyTorch is faster.


Kind of amazing that something that can be expressed in ~1000 lines of code has completely turned the world on its head.

6 replies

Which important concept or algorithm can't be expressed in ≤1000 lines? Seems like a pretty common theme among groundbreaking ideas.

3 replies

Most modern A/V codecs won't fit in that limit by several orders of magnitude.

Even standard-compliant JPEG decoder would be hard to squeeze without some serious codegolfing. Discarding some barely used features gets you close to that limit, though [1].

Smallest popular TCP/IP stack [2] is ~20kLoC.



A JPEG decoder or TCP stack are very clearly not individual concepts though. There's obviously some subjectivity as to what constitutes a single "concept" or "algorithm", but I'm not sure either of those two examples are in a gray area.

1 replies

That's a good question. Unfortunately I think you're asking to compute the Kolmogorov complexity of every interesting concept we have that doesn't yet have an implementation less than n=1000 lines, which is equivalent to the halting problem (modulo unbounded memory).

If you could exhaustively list all the interesting algorithms (hard but feasible) you could potentially prove a lower bound for each one's complexity by writing a shorter than n implementation (hard, probably infeasiblel and show positively that GP's prop isn't true. On the other hand showing that it was true would require either some very clever proof which can't apply to all programs, but somehow only these interesting ones (very likely impossible) or enumerate all C^n programs where C is the number of possible lines (something like 64^80) and show that none of them implements at least one of the interesting algorithms (absurdly impossible).

0 replies

You are right but I think that there's a more interesting question: do humans stumble upon those large interesting/great algorithms in practice?

The key point here is that we are looking at algorithms already discovered in human history rather than enumerating all possible interesting algorithms. Of course there is an interesting algorithm that is very large, but humans don't discover them in practice. If you look up a list of greatest algorithms in history, they will be rather small in length. Many of them can be sketched in a whiteboard

I think that what is happening here is that our minds just can't hold billions of concepts at once. So if you have an algorithm with billions of things, it was most likely produced by a machine. Handcrafted things, on the other hand, are smaller in comparison

Another thing is that our minds like conceptual simplicity and view simplicity as a kind of beauty. So if we have a great algorithm but it is too large, we look for ways to express them in succinct ways (the right abstractions can help with that, and also help with understanding the algorithm better). We end up succeeding because the algorithms themselves had low Kolmogorov complexity (and thus, if they are too large they probably can be further compressed)

Err… and the exobytes of training data

1 replies

Ah, not really

“Here’s one I trained earlier”?

Speed of hardware did. Back in 80a they already knew the principles of llm training. It only took one week to train 10.000 tokens.

0 replies

Got a reference on that claim?

Echoes of DeCSS ;)

13 replies

OT but question from someone curious..... is Cuda still entrenched as the only option for doing AI or is there growing support for AMD/Intel/Other ways of doing AI?

He loudly gave up on AMD after they did not fix a blocker he had for 5+ months and gave him the runaround the entire time when he asked for the code to fix it himself. He is still shipping the AMD tinybox with huge warning labels.

1 replies

Didn't they recently announce that everything was open sourced? Would be cool if he took another look at it once all of the souce code is available (if not already).

0 replies

Randomly stumbled over this[1] post with another fed up open source contributor, due to several serious issues with AMDs GPU drivers and firmware that remain unresolved for years. It also references the geohot decision you mention.

Some quotes:

I find it incredible that these companies that have large support contracts with you and have invested hundreds of thousands of dollars into your products, have been forced to turn to me, a mostly unknown self-employed hacker with very limited resources to try to work around these bugs (design faults?) in your hardware.

In the VFIO space we no longer recommend AMD GPUs at all, in every instance where people ask for which GPU to use for their new build, the advise is to use NVidia.


0 replies

I'm not sure if he's "attempting to solve it" so much as he's looking for yet another way to keep himself famous.

The guy did one good jailbreak for the iPhone, and as near as I can tell, the rest of his work has been a lot of boasting, half-assed hyped-up implementations (e.g: his self-driving car), and trying to befriend other powerful people in tech (see: his promise to single-handedly fix Musk's Twitter). He might be a smart dude, but he vastly overrates his own accomplishments, and doesn't finish near anything he starts.

1 replies

Modular Mojo is the most well funded and full of respectable players for making an alternative possible

0 replies

Check out Hidet [1]. Not as well funded, but delivers Python based ML acceleration with GPU support (unlike Mojo).


0 replies

there are obv alternatives from both intel and amd, performant blas/dnn packages, but small teams don’t use them bc cuda is easier to use and has more support, and larger teams don’t use them bc they have deals w/ nvidia or not enough GPUs are available or they’re after the absolute best performance (which is still nvidia) or bc of other stuff like unstable drivers or smth

0 replies

There are a few attempts here and there in various stages of progression. But right now, nothing matches Nvidia+CUDA in speed and usability.

0 replies

You can run inference today on pretty much any card.

Download Ollama on a modern MacBook and can run 13B and even higher (if your RAM allows) at fast speeds. People run smaller models locally on their phones

Google has trained their latest models on their own TPUs... not using Nvidia to my knowledge.

So, no, there are alternatives. CUDA has the largest mindshare on the training side though.

0 replies

There are some stirrings but don't hold your breath

8 replies

Very sad, shouldve used an agnostic framework instead of CUDA

Are there any strong LLMs trained without CUDA?

0 replies

Yes, there are several. See this blog post from Databricks describing the landscape of LLMs trained on AMD hardware for example:

The most interesting one IMO is OLMo from AI2, which is truly open. You can read their blog post about it ( but basically it is open everything - they released everything you need to reproduce their weights (training data, training code, evaluation code, and weights) with a friendly (Apache) license.

0 replies

Gemini and Gemma were trained on Google's TPUs.

As far as I can tell, its optional dependency is Open MP, not CUDA. Doesn't seem directly dependent on CUDA.

Yes, a quick skim of the code only shows openmp dependency. The C/CUDA reference might have meant to be C/OMP .

Although I wonder if it would work well with GCC PTX OMP offloading.

0 replies

The plan is to eventually implement with CUDA:

"Currently, I am working on [...] direct CUDA implementation, which will be significantly faster and probably come close to PyTorch."

It's only ~1000 LoC, seems like a pretty good case study to port over to other runtimes and show they can stand up to CUDA.

0 replies

Looking forward to your patches!

I've seen his nano GPT implemented using JAX, now we have C/CUDA. I'd love to see if nano GPT could be doable in Mojo. I took a stab at a Mojo conversion of his Wavenet project (Andrej's zero to hero course) and I gotta say... python has so many nice features lol. Stating the obvious I know but what you see done in 6 lines of python takes so much more work in other languages.

3 replies

How in Mojo do you support GPU data parallelism and all the benefits it brings ?

2 replies

You don't. Mojo doesn't support GPUs at the moment, which says a lot about a language which claims to be AI first.

0 replies

If you want CUDA up front go write PyTorch. No one is stopping you. Modular’s goal was to leverage MLIR first and bring GPUs in later. They’re barely a year old company.

0 replies

They only made Mojo available outside the preview circle about a couple of months ago, and it is yet to run on Windows laptops of researchers.

I love the attitude of considering 0.x languages production ready for all imaginable kinds of workloads.

For a prior generation of karpathy-splaining this is this Nim port: - maybe of interest if you are interested in Mojo.

Thank you!

7 replies

Candle is a minimalist ML framework for Rust with a focus on performance (including GPU support) and ease of use

Candle focuses on inference though.

1 replies

What is referecing ?

0 replies

Inference means using the neural net, as opposed to training it.

During inference you feed an input into the NN and it passes through it in "forwards" direction (i.e. from input to output), being modified according to the "weights" that were learnt during training, to derive the output.

During training, each training sample is first fed forwards through the NN, the same way as for inference, but then the output of the model (which at the beginning of training will be random/wrong) is compared to the correct/desired output for that training sample, and a corresponding error value will then be fed backwards (from output to input) through the NN according to the "backpropagation" mechanism to update the weights.

Training is a lot more involved than inference since it involves this backpropagation step.

Candle dev here, we also support training/backdrop! We certainly focus on optimizing inference performance but hopefully that should improve the training efficiency too.

0 replies

Not barely as minimal as Karpathy his implementation

0 replies

I wouldn't call in "minimalist" after seeing Karpathy's code.

If I was starting from scratch, what resources should I start with to build up an understanding of what this code does and how to read it? It's quite dense and my knowledge of LLMs is quite minimal. Are these terse variable names standard in LLM-land?

3 replies

Terse variables are a C thing.

“What resources would I need” -> you’re literally commenting on a teachers content. Karpathy (the author) has a very informative YouTube channel where he goes step by step through everything. He has a ton of repos and tutorials. Dig a little.

If all else fails… Google it.

1 replies

you’re literally commenting on a teachers content.

How am I supposed to know that?

Karpathy (the author) has a very informative YouTube channel where he goes step by step through everything.

Or that, without knowing that he's a teacher?

Terse variables are a C thing.

I didn't realize variables had to be so short in C. Glad I write C++ professionally where they've added support for longer variable names.

If all else fails… Google it.

There's a lot of LLM garbage out there. I got an answer here in a few minutes pointing to Karpathy's course which seems very high quality.

Be kinder.

0 replies

How am I supposed to know that?

You’re not supposed to know that. You asked a question, and this is you being told the answer.

It’s very convenient that the author of the post is quite literally the world’s most prolific teacher on this topic. Makes it easy to find Karpathy. You shouldn’t be expected to otherwise know that (or else why ask if you knew).

I didn't realize variables had to be so short in C. Glad I write C++ professionally where they've added support for longer variable names.

This feels like a joke but old C compilers did have variable length limits. This is part of why C historically had shorter variables than other more modern languages.

Sorry if it came off rude, the internet is hard to communicate over.

0 replies

Terse variables are a C thing.

They're a math / toy code thing. Large C projects have long descriptive names just like other languages.

Check out his zero to hero series. Which builds this with python and later pytorch, then probably his other mini C based projects.

0 replies

As siblings have said, his video series are quite good. But if you're just looking at this repo only, you probably want to look at the python reference implementation. (The C is designed to exactly replicate its functionality.)

Is this able to replace PyTorch, ... in normal practice? No.

3 replies

This is a bit an apples and oranges comparison. Pytorch is a research framework not a transformer inference library.

2 replies

This post is about training not inference. And llama.cpp has similarly simple LoRa training code. There is nothing in neural networks themselves so complex to justify the amount of complexity the Python-ML community piled up. MLX, for instance, is a similarly general purpose research framework that is a fraction of the size.

1 replies

Sure neural networks in of themselves are conceptually simple, and not difficult to code. Andrew Ng's original Coursera class is all you need to go from zero knowledge to building MATLAB based neural nets in this same hard coded style.

However, there is a huge difference in functionality (hence complexity) in a framework such as PyTorch vs hardcoding a single NN. It's a bit like the difference between writing a toy compiler in CompSci class vs a production one that supports optimization, multiple targets, etc, etc.

The first step in convenience beyond hardcoding models, was frameworks like the original Torch, and original TensorFlow. Those frameworks let you explicitly assemble a neural net out of modular "lego blocks" (tensor operations), then just call model.forward() or model.backward() - no need to yourself write the forwards and backwards functions.

What PyTorch (successor to Torch) did was increase the complexity of the framework, but bring massive ease-of-use to the developer, by getting rid of the explicit lego-block assembly process, and instead let the developer just write arbitrary Python code corresponding to what they want the model to do, and then PyTorch itself build the model internally and therefore is able to infer the backward function. This extra functionality/ease-of-use, but with corresponding internal complexity, is what differentiated PyTorch from TensorFlow, made it so succesful, and caused most developers to switch to it.

There is also a lot of other functionality in PyTorch that adds to the complexity - supporting multiple back ends, custom CUDA/etc kernels beyond what is provided by cuDNN, etc, etc.

0 replies

I know all this things. Again: look at MLX.

0 replies

Does this show that in general the most used ML frameworks are a mess? Yes.

Not really ... there is little to no overlap with what a framework like PyTorch does. There is no tensor class, no autograd, etc. Just malloc, a bunch of hand calculated pointers into that chunk of memory, and hand written gradient functions. I assume the intent here is to be educational by stripping away the layers of abstraction to make it clearer what is going on.

Frankly though, this code (all that pointer math!) is a mess too, maybe written this way to make it easy to port to cuDNN which is at a similarly low level (other than having tensor descriptors which make the memory layout more flexible).

If you want to write your own tensor class and reusable NN framework, then the lines of code go up very rapidly. I did one in C++ a while back, and the tensor class alone was 20K LOC.

4 replies

It would be great if someone created a tutorial around this explaining exactly how it works and how to do a test training run. I’m aware it’s not feasible to train a “real” model on personal hardware but it would be nice to have a practical learning experience. I’m not sure if there are good alternatives for that.

Thank you so much for responding. I will definitely check these out and also pass it on to others who might be interested.

Thank you so much for the Zero To Hero playlist!

0 replies

The author has a whole series where he does exactly that. YouTube videos, code examples, documentation, everything. Explains the math, explains how to code it, explains the architecture. Everything.

3 replies

It should be rewritten in Rust. (Just joking)

I think you just cracked AI safety.

0 replies

Sshh! I asked why it was written in C and got flagged.

0 replies

I just pasted the code into claude and reading the converted rust now, definitely need extra work

3 replies

I'd like to think he took the name from my llm.f90 project

It was originally based off of Karpathy's llama2.c but I renamed it when I added support for other architectures.

Probable a coincidence :)

I'll send you an email

0 replies

In f90? That’s pretty cool.

On a related note, IMO it would be pretty cool if we could get an LLM implementation that provides and RCI interface like all the old computational codes used to.

1 replies

I love his videos. They are dense, but I get a lot out of them.

+100 thank you karpathy!

2 replies

This is an implementation of a transformer and in README it's presented as text->text. Tokens are just integers going in and out.

Is it possible to use it to train other types of LLMs(text->image, image->text, speech->text, etc.)?

The transformer itself just takes arrays of numbers and turns them into arrays of numbers. What you are interested in is the process that happens before and after the transformer.

0 replies

Yes, anything can be an input token.

Patch of pixels ---> token Fragment of input Audio ---> token etc

2 replies

Question, apologize if slightly off-topic, it's something I'd like to use this project for: Is there an example of how to train GPT-2 on time series, in particular with covariates?

As my understanding of LLM goes at a basic level it's predicting the next token from previous tokens, which sounds directionally similar to time series (perhaps letting aside periodicity).

Yes there are many attempts in applying a transformers to timeseries forecasting. For instance (but there are many more): - Timegpt - Chronos

These kind of papers often talk the world, but often lack a proper baseline model. They only compare against very simple (naive forecast), or non tuned models. In my experience a gradient boosting model will probably solve 95% of your forecasting problems, and trying to get fancy with a transformer (or even just a simple neural net) is more trouble then it is worth.

1 replies

Fantastic -- gotta love Andrej. I am sick of the ball and chain that is Python and all of its environment dependencies. It is nice to shed all the weight and get down to the metal.

Yeah, as long as you don’t want to change the network architecture.

Edit: and you trust that Andrei didn’t screw up anywhere while hand rolling all the gradient calculations.

1 replies
16h51m uses Vulkan, thus is portable across GPU's and much faster. It also has more kernels. In simple C++

Or rather GLSL... The C++ code looks like it's mostly just scaffolding to kick off the actually important GPU work, and for that it's a surprising amount of code. Quite typical both for Vulkan and C++ though ;)

1 replies

Quick question, is this just pure C code that can be loaded into an Nvidia gpu and run (via the python code)? I scanned the C and didn't see anything CUDA related (maybe I missed something, I'm not a GPU programmer!). K mentions something about a direct CUDA implementation coming soon, how would that be different than what this is?

It’s not, if you look at his X account, he talks about his work adding on the CUDA parts

1 replies

On one hand, really nice to see the whole thing in 1000 lines of C code.

On the other hand, that malloc function low key terrifies me. :)

Better to be explicit than hiding unsafe memory accesses under C++ stdlib classes like std::vector which don't do range checking either in operator[]. And in this sort of code, automatically injected runtime range checks would most likely hurt performance enough to matter.

I would still run the code through the Clang static analyzer and a couple of test runs in ASAN and UBSAN to be sure that nothing slipped through.

Very nice.

In my experience much of the complexity of numerical software is to enable the search for the algorithm that works well with the problem/data you have. Once you know the exact algorithm you want, it is possible to make a nice clean minimalistic implementation, but that does not mean such an implementation would have been easy at the beginning.

When Lex recently talked to Andre, Andre said that he gets positively obsessed with a problem and says "this must exist". I imagine this must be one of those outputs.

Another awesome project! Note that as of this moment the CUDA part is aspirational. There is no gpu code in the repo yet.

Karpathy's code, teaching and contribution to the body of knowledge in this area really is admirable.

Sadly I am a generalist, but if I were a specialist, I would hope to contribute as openly and widely as Karpathy.

Not clout chasing, click-bait, "top 5 javascript frameworks of 2023!" ... just high quality output that marks a specialist.

Sorry to gush.

very cool, also the coding style looks good.

See, C does it very well. Great stuff. Karpathy has a gift for teaching.

Wow, and this is done after a recent trip to Bhutan to clear his head! I follow karpathy on twitter and he posted that 2 weeks without constantly looking and checking his phone kind of turns off the constantly on radio in his head.

