return to table of content

Llm.c – LLM training in simple, pure C/CUDA

tosh
22 replies
1d

LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython
api
11 replies
1d

Python has been popular for this because it’s convenient to quickly hack on and experiment with, not because it’s the most efficient thing.

im3w1l
10 replies
1d

The overhead really isn't that bad is it? Since the the python code is mostly about saying multiply matrix A with matrix B, and then that actual computation is done by optimized low level code.

bongodongobob
6 replies
23h49m

For that stuff, yeah you're correct.

What I've seen is issues with the implementation of those libraries in a project.

I don't remember exactly, but I was playing with someone's wrapper for some kind of machine learning snake game and it was taking way longer than it should have on back of the napkin math.

The issue was using either a dict or a list in a hot loop and changing it to the other sped it up like 1000x.

So it's easy to think "yeah this library is optimized" but then you build something on top of it that is not obviously going to slow it down.

But, that's the Python tradeoff.

hcarvalhoalves
3 replies
23h25m

The issue was using either a dict or a list in a hot loop and changing it to the other sped it up like 1000x.

The programmer using the wrong data structure is not a problem with the language.

CamperBob2
1 replies
23h0m

It really is with Python. There are simply too many containers and container-like concepts. Lists, arrays, sets, dicts...

0cf8612b2e1e
0 replies
22h38m

What modern language doesn’t have those?

Go kind of cheats and has maps play double duty as sets.

bongodongobob
0 replies
21h46m

Kinda. I guess my native tongue is C/C++ and I wouldn't expect such a huge performance difference when using an array vs a linked list or something.

It's not like I had millions of items in that structure either, it was like 100. I think it contained the batch training data from each round. I tried to find the project but couldn't.

I was just shocked that there was such a huge difference between primitive data structures. In that situation, I wouldn't have guessed it would make a difference.

0cf8612b2e1e
1 replies
23h30m

That sounds irrelevant to Python and just a matter of slow code cropping up in libraries until someone runs a profiler.

littlestymaar
0 replies
22h26m

But then again if your program have places where choosing the right Python primitive is important for performance, then using python is affecting performance here since even the best algorithm in Python would be slower than the equivalent C.

Most of the time it doesn't matter because there's nothing hoy on the Python side, but if there is, then Python is going to be slowing your stuff down.

jiggawatts
1 replies
23h52m

I suspect that this has a high chance of running afoul of Ahmdal’s Law. Even if you can parallelise the bulk of the computation, the serial parts remain single-threaded and start to dominate the total runtime.

taneq
0 replies
21h39m

I don’t think the serial parts of ML training are Python’s fault, are they? It’s all “operation B depends on the output of operation A”.

llm_nerd
0 replies
22h42m

It depends on how you define overhead. Runtime overhead and memory usage is absolutely marginal, and the tightest, most perfect implementation will have trouble beating it.

Instead people are trying to optimize install size of dependencies, which while maybe a fun hacking project...who really cares?

QuadmasterXLII
9 replies
1d

107MB of cPython defeated

Go to try for self

Step 1 download 2.4GB of CUDA

simonw
5 replies
23h59m

The size of CUDA really is astonishing. Any chance someone might figure out how to slim that down?

xiphias2
0 replies
23h48m

Talking directly to the kernel / driver / firmware.

As others have said, George Hotz is doing his best in reverse-engineering and skipping layers.

jsheard
0 replies
23h45m

Taking a peek inside the package it seems to mostly be the libraries - CuFFT alone is about 350MB for example, twice over for the debug and release versions. I'm guessing those are probably fat binaries pre-compiled for every generation of Nvidia hardware rather than just the PTX bytecode, which would help to speed up fresh builds, at the expense of being huge.

dartos
0 replies
23h57m

Nvidia is the only one who could, since they own it.

gcr
0 replies
23h31m

I mean, being fair, the 2.4GB CUDA SDK is absolutely required for the cPython implementation as well

fsloth
0 replies
23h27m

I don't think it's about the byte size, but the inherent complexity of the implementation. 1000 lines of C code is extremely simple by any standard. Whereas a sundry collection of Python and PyTorch libraries is anything but.

dwroberts
0 replies
23h27m

A bunch of install methods for torch via pip include ~1.5GB of lib/ because of CUDA. libtorch_cuda.so is like 800MB on its own

qwertox
16 replies
23h33m

direct CUDA implementation, which will be significantly faster and probably come close to PyTorch.

It almost hurts, to read that PyTorch is faster.

But then again, with these GPU-RAM-prices, let's see how it speeds up the CPU.

We really need SO-DIMM slots on the RTX series (or AMD/Intel equivalent) so that we can expand the RAM as we need it to. Is there a technical problem to it?

jsheard
9 replies
23h24m

Memory speed is more or less directly proportional to how close the memory is to the processor, with the fastest memory being literally inside the processor (SRAM cache), followed by memory on the same package as the processor (HBM GPUs, Apple M-series), followed by soldered down discrete memory chips (regular GPUs, games consoles), followed by socketed DIMMs in distant last place. There's not really any getting around it, the bandwidth that GPUs crave just isn't compatible with modularity.

Even CPUs are starting to move their memory closer to the core in the name of performance, as mentioned Apple is already doing it, Intel is making Xeons with on-chip memory now, and they have a version aimed at consumers on their roadmap.

viraptor
4 replies
20h55m

That's true it's got an impact, but I think there's still space available for "slightly slower with 2x memory" models. For many local uses, new cards are way past the "fast enough" line, but having 64gb on them would be really beneficial.

It's love to see some experiments / different SKUs in this area, given people are already diy-ing extra memory on NVIDIA. (https://hackaday.com/2021/01/29/add-an-extra-8gb-of-vram-to-... there were stable experiments later on, but I don't have a link now)

schneehertz
3 replies
20h28m

Graphics card manufacturers believe that selling high-memory consumer graphics cards will affect the market for commercial computing cards, so they will not do so, that's all.

airspresso
2 replies
14h56m

Nice room for a new player to disrupt then

einsteinx2
1 replies
9h15m

Problem is, making a board design using an existing GPU chip and sticking more RAM into it is (relatively) simple but of course none of the GPU chip makers would allow partners to do that. Making your own GPU chip that’s competitive with Nvidia or AMD’s current offerings is a massive undertaking and pretty much impossible for a newcomer.

Just look at how much trouble Intel has had breaking into the discrete GPU market or even just how hard it’s been for AMD to compete with Nvidia even with decades of experience in the market.

And if some newcomer could make a competitive GPU with large memory capacity they’d be crazy not to sell it at datacenter prices, maybe just undercutting the others but a few grand but still way more expensive than any consumer GPU you can buy today, even a 4090.

tlb
0 replies
4h40m

It's not doable at the board level. High-end GPUs use HBM in the same package, connected using a silicon interposer.

wtallis
2 replies
22h47m

FYI, most discrete GPUs with discrete memory packages soldered to the board near the GPU are running at substantially higher memory frequencies than the on-package DRAM in Apple's chips. But running GDDR at those speeds costs a lot of power.

osigurdson
1 replies
19h53m

I watched a presentation on this today. The presenter focused on the soldering and proximity as well. Is this really the only difference or is this transistor based memory (like L1, L2, etc.)? I get the proximity factor of course (1ft / ns EE rule of thumb). In any case, soldering and proximity don't seem like breakthrough innovations (but maybe I am wrong).

MobiusHorizons
0 replies
19h18m

Gpu ram is typically gddr6 or gddr6x which is a different standard to the chips used in ddr5 for example. GPUs have terrible latency to ram, but enormous throughput, and I assume the chips are internally optimized for that. Many aspects of a design change when you choose different latency or clockspeed targets translating into different power / area calculations.

tverbeure
0 replies
22h32m

For data rates, as in bandwidth per IO pin, distance is really only a secondary factor. HBM memory, for example, runs at substantially lower data rates than GDDR, yet it sits right next to the GPU die compared to centimeters for the GDDR. And high-speed serial links run at speeds that are an order of magnitude higher than even the internal register files of a CPU.

LatticeAnimal
2 replies
23h28m

We really need SO-DIMM slots on the RTX series (or AMD/Intel equivalent) so that we can expand the RAM as we need it to. Is there a technical problem to it?

I imagine it would incur a non trivial latency and cost penalty. The memory modules are placed pretty close to the compute die right now. Cooling would also have to change (the memory modules produce a lot of heat).

But there is also no reason for any of the GPU manufacturers to do this. A skew with twice as much memory can go for a lot more than the difference in memory cost alone

SunlitCat
0 replies
22h44m

And especially doing "interesting" combinations of gpu and memory.

Like lower end gpu with 16 GB of VRAM, but offering just 8 / 12 GB of VRAM in the middle class and then again 16 GB in the upper class of gpu selection.

ItsBob
0 replies
12h15m

I don't disagree but (I know nothing about this btw...) would it not benefit in terms of, say, a L3 cache kind of thing?

Imagine you could stick 2 x 64GB DDR5 DIMMS on the GPU in sockets, would that not be faster to access than the motherboard DIMMS? It won't be as fast as on-die memory of course but could it not act like a sort of halfway house?

tverbeure
0 replies
23h9m

Check out PCB back drilling. It's a process where you remove a few hundred microns from the vias that are used to connect GDDR RAMs to the GPUs, to avoid reflections due to the impedance mismatch that's caused by the stub.

When you have a pulse coded signal traveling at close to 10GHz, everything becomes an antenna. The technical problem is that you can't do this with a flimsy connector like the ones used for DIMMs. The reason GDDR can have a bandwidth per pin that is 4 times higher than regular DDR is because they are soldered down on the PCB.

theGeatZhopa
0 replies
15h30m

NVIDIA hates that trick.

hahnchen
0 replies
18h17m

It almost hurts, to read that PyTorch is faster.

Why?

osigurdson
13 replies
18h52m

Kind of amazing that something that can be expressed in ~1000 lines of code has completely turned the world on its head.

KeplerBoy
6 replies
15h22m

Which important concept or algorithm can't be expressed in ≤1000 lines? Seems like a pretty common theme among groundbreaking ideas.

magnat
3 replies
9h58m

Most modern A/V codecs won't fit in that limit by several orders of magnitude.

Even standard-compliant JPEG decoder would be hard to squeeze without some serious codegolfing. Discarding some barely used features gets you close to that limit, though [1].

Smallest popular TCP/IP stack [2] is ~20kLoC.

[1] https://github.com/richgel999/picojpeg

[2] https://savannah.nongnu.org/projects/lwip/

epr
0 replies
8h52m

A JPEG decoder or TCP stack are very clearly not individual concepts though. There's obviously some subjectivity as to what constitutes a single "concept" or "algorithm", but I'm not sure either of those two examples are in a gray area.

A single concept might be implementing just ARP or a discrete cosine transform. If you wanted to do a full TCP stack or JPEG decoder, that would make a lot more sense after building their internal components one by one.

Y_Y
1 replies
14h3m

That's a good question. Unfortunately I think you're asking to compute the Kolmogorov complexity of every interesting concept we have that doesn't yet have an implementation less than n=1000 lines, which is equivalent to the halting problem (modulo unbounded memory).

If you could exhaustively list all the interesting algorithms (hard but feasible) you could potentially prove a lower bound for each one's complexity by writing a shorter than n implementation (hard, probably infeasiblel and show positively that GP's prop isn't true. On the other hand showing that it was true would require either some very clever proof which can't apply to all programs, but somehow only these interesting ones (very likely impossible) or enumerate all C^n programs where C is the number of possible lines (something like 64^80) and show that none of them implements at least one of the interesting algorithms (absurdly impossible).

nextaccountic
0 replies
9h9m

You are right but I think that there's a more interesting question: do humans stumble upon those large interesting/great algorithms in practice?

The key point here is that we are looking at algorithms already discovered in human history rather than enumerating all possible interesting algorithms. Of course there is an interesting algorithm that is very large, but humans don't discover them in practice. If you look up a list of greatest algorithms in history, they will be rather small in length. Many of them can be sketched in a whiteboard

I think that what is happening here is that our minds just can't hold billions of concepts at once. So if you have an algorithm with billions of things, it was most likely produced by a machine. Handcrafted things, on the other hand, are smaller in comparison

Another thing is that our minds like conceptual simplicity and view simplicity as a kind of beauty. So if we have a great algorithm but it is too large, we look for ways to express them in succinct ways (the right abstractions can help with that, and also help with understanding the algorithm better). We end up succeeding because the algorithms themselves had low Kolmogorov complexity (and thus, if they are too large they probably can be further compressed)

datascienced
2 replies
13h32m

Err… and the exobytes of training data

rnewme
1 replies
12h54m

Ah, not really

datascienced
0 replies
11h20m

“Here’s one I trained earlier”?

holoduke
1 replies
11h24m

Speed of hardware did. Back in 80a they already knew the principles of llm training. It only took one week to train 10.000 tokens.

toxik
0 replies
3h56m

Got a reference on that claim?

daniel_reetz
0 replies
15h45m

Echoes of DeCSS ;)

andrewstuart
13 replies
23h57m

OT but question from someone curious..... is Cuda still entrenched as the only option for doing AI or is there growing support for AMD/Intel/Other ways of doing AI?

ZoomerCretin
3 replies
23h14m

He loudly gave up on AMD after they did not fix a blocker he had for 5+ months and gave him the runaround the entire time when he asked for the code to fix it himself. He is still shipping the AMD tinybox with huge warning labels.

Art9681
1 replies
19h39m

Didn't they recently announce that everything was open sourced? Would be cool if he took another look at it once all of the souce code is available (if not already).

magicalhippo
0 replies
17h59m

Randomly stumbled over this[1] post with another fed up open source contributor, due to several serious issues with AMDs GPU drivers and firmware that remain unresolved for years. It also references the geohot decision you mention.

Some quotes:

I find it incredible that these companies that have large support contracts with you and have invested hundreds of thousands of dollars into your products, have been forced to turn to me, a mostly unknown self-employed hacker with very limited resources to try to work around these bugs (design faults?) in your hardware.

In the VFIO space we no longer recommend AMD GPUs at all, in every instance where people ask for which GPU to use for their new build, the advise is to use NVidia.

[1]: https://www.reddit.com/r/Amd/comments/1bsjm5a/letter_to_amd_...

fwip
0 replies
4h56m

I'm not sure if he's "attempting to solve it" so much as he's looking for yet another way to keep himself famous.

The guy did one good jailbreak for the iPhone, and as near as I can tell, the rest of his work has been a lot of boasting, half-assed hyped-up implementations (e.g: his self-driving car), and trying to befriend other powerful people in tech (see: his promise to single-handedly fix Musk's Twitter). He might be a smart dude, but he vastly overrates his own accomplishments, and doesn't finish near anything he starts.

towelpluswater
1 replies
20h23m

Modular Mojo is the most well funded and full of respectable players for making an alternative possible

pavelstoev
0 replies
18h42m

Check out Hidet [1]. Not as well funded, but delivers Python based ML acceleration with GPU support (unlike Mojo).

[1] https://github.com/hidet-org/hidet

taminka
0 replies
22h41m

there are obv alternatives from both intel and amd, performant blas/dnn packages, but small teams don’t use them bc cuda is easier to use and has more support, and larger teams don’t use them bc they have deals w/ nvidia or not enough GPUs are available or they’re after the absolute best performance (which is still nvidia) or bc of other stuff like unstable drivers or smth

sigmoid10
0 replies
23h37m

There are a few attempts here and there in various stages of progression. But right now, nothing matches Nvidia+CUDA in speed and usability.

adam_arthur
0 replies
23h7m

You can run inference today on pretty much any card.

Download Ollama on a modern MacBook and can run 13B and even higher (if your RAM allows) at fast speeds. People run smaller models locally on their phones

Google has trained their latest models on their own TPUs... not using Nvidia to my knowledge.

So, no, there are alternatives. CUDA has the largest mindshare on the training side though.

WithinReason
0 replies
23h51m

There are some stirrings but don't hold your breath

brcmthrowaway
8 replies
1d

Very sad, shouldve used an agnostic framework instead of CUDA

robrenaud
2 replies
23h58m

Are there any strong LLMs trained without CUDA?

blackeyeblitzar
0 replies
23h49m

Yes, there are several. See this blog post from Databricks describing the landscape of LLMs trained on AMD hardware for example: https://www.databricks.com/blog/training-llms-scale-amd-mi25...

The most interesting one IMO is OLMo from AI2, which is truly open. You can read their blog post about it (https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...) but basically it is open everything - they released everything you need to reproduce their weights (training data, training code, evaluation code, and weights) with a friendly (Apache) license.

ZoomerCretin
0 replies
23h12m

Gemini and Gemma were trained on Google's TPUs.

geph2021
2 replies
1d

As far as I can tell, its optional dependency is Open MP, not CUDA. Doesn't seem directly dependent on CUDA.

gpderetta
0 replies
23h52m

Yes, a quick skim of the code only shows openmp dependency. The C/CUDA reference might have meant to be C/OMP .

Although I wonder if it would work well with GCC PTX OMP offloading.

dlazaro
0 replies
23h30m

The plan is to eventually implement with CUDA:

"Currently, I am working on [...] direct CUDA implementation, which will be significantly faster and probably come close to PyTorch."

jsheard
0 replies
1d

It's only ~1000 LoC, seems like a pretty good case study to port over to other runtimes and show they can stand up to CUDA.

exe34
0 replies
22h57m

Looking forward to your patches!

yinser
7 replies
23h47m

I've seen his nano GPT implemented using JAX, now we have C/CUDA. I'd love to see if nano GPT could be doable in Mojo. I took a stab at a Mojo conversion of his Wavenet project (Andrej's zero to hero course) and I gotta say... python has so many nice features lol. Stating the obvious I know but what you see done in 6 lines of python takes so much more work in other languages.

pavelstoev
3 replies
18h50m

How in Mojo do you support GPU data parallelism and all the benefits it brings ?

KeplerBoy
2 replies
15h24m

You don't. Mojo doesn't support GPUs at the moment, which says a lot about a language which claims to be AI first.

yinser
0 replies
5h41m

If you want CUDA up front go write PyTorch. No one is stopping you. Modular’s goal was to leverage MLIR first and bring GPUs in later. They’re barely a year old company.

pjmlp
0 replies
15h10m

They only made Mojo available outside the preview circle about a couple of months ago, and it is yet to run on Windows laptops of researchers.

I love the attitude of considering 0.x languages production ready for all imaginable kinds of workloads.

cb321
1 replies
22h21m

For a prior generation of karpathy-splaining this is this Nim port: https://github.com/Vindaar/llama2nim - maybe of interest if you are interested in Mojo.

yinser
0 replies
20h9m

Thank you!

convexstrictly
7 replies
18h25m

Candle is a minimalist ML framework for Rust with a focus on performance (including GPU support) and ease of use

https://github.com/huggingface/candle

imjonse
3 replies
16h31m

Candle focuses on inference though.

revskill
1 replies
10h56m

What is referecing ?

HarHarVeryFunny
0 replies
5h40m

Inference means using the neural net, as opposed to training it.

During inference you feed an input into the NN and it passes through it in "forwards" direction (i.e. from input to output), being modified according to the "weights" that were learnt during training, to derive the output.

During training, each training sample is first fed forwards through the NN, the same way as for inference, but then the output of the model (which at the beginning of training will be random/wrong) is compared to the correct/desired output for that training sample, and a corresponding error value will then be fed backwards (from output to input) through the NN according to the "backpropagation" mechanism to update the weights.

Training is a lot more involved than inference since it involves this backpropagation step.

l-m-z
0 replies
15h16m

Candle dev here, we also support training/backdrop! We certainly focus on optimizing inference performance but hopefully that should improve the training efficiency too.

basbuller
0 replies
15h51m

Not barely as minimal as Karpathy his implementation

0xfedbee
0 replies
13h21m

I wouldn't call in "minimalist" after seeing Karpathy's code.

idkwhatimdoin
6 replies
23h12m

If I was starting from scratch, what resources should I start with to build up an understanding of what this code does and how to read it? It's quite dense and my knowledge of LLMs is quite minimal. Are these terse variable names standard in LLM-land?

vineyardmike
3 replies
21h26m

Terse variables are a C thing.

“What resources would I need” -> you’re literally commenting on a teachers content. Karpathy (the author) has a very informative YouTube channel where he goes step by step through everything. He has a ton of repos and tutorials. Dig a little.

If all else fails… Google it.

idkwhatimdoin
1 replies
16h13m

you’re literally commenting on a teachers content.

How am I supposed to know that?

Karpathy (the author) has a very informative YouTube channel where he goes step by step through everything.

Or that, without knowing that he's a teacher?

Terse variables are a C thing.

I didn't realize variables had to be so short in C. Glad I write C++ professionally where they've added support for longer variable names.

If all else fails… Google it.

There's a lot of LLM garbage out there. I got an answer here in a few minutes pointing to Karpathy's course which seems very high quality.

Be kinder.

vineyardmike
0 replies
14h2m

How am I supposed to know that?

You’re not supposed to know that. You asked a question, and this is you being told the answer.

It’s very convenient that the author of the post is quite literally the world’s most prolific teacher on this topic. Makes it easy to find Karpathy. You shouldn’t be expected to otherwise know that (or else why ask if you knew).

I didn't realize variables had to be so short in C. Glad I write C++ professionally where they've added support for longer variable names.

This feels like a joke but old C compilers did have variable length limits. This is part of why C historically had shorter variables than other more modern languages.

Sorry if it came off rude, the internet is hard to communicate over.

https://publications.gbdirect.co.uk/c_book/chapter2/keywords...

viraptor
0 replies
20h49m

Terse variables are a C thing.

They're a math / toy code thing. Large C projects have long descriptive names just like other languages.

tayo42
0 replies
22h37m

Check out his zero to hero series. Which builds this with python and later pytorch, then probably his other mini C based projects.

satokema
0 replies
20h24m

As siblings have said, his video series are quite good. But if you're just looking at this repo only, you probably want to look at the python reference implementation. (The C is designed to exactly replicate its functionality.)

antirez
5 replies
12h45m

Is this able to replace PyTorch, ... in normal practice? No.

Does this show that in general the most used ML frameworks are a mess? Yes.

bootsmann
3 replies
11h24m

This is a bit an apples and oranges comparison. Pytorch is a research framework not a transformer inference library.

antirez
2 replies
11h13m

This post is about training not inference. And llama.cpp has similarly simple LoRa training code. There is nothing in neural networks themselves so complex to justify the amount of complexity the Python-ML community piled up. MLX, for instance, is a similarly general purpose research framework that is a fraction of the size.

HarHarVeryFunny
1 replies
4h59m

Sure neural networks in of themselves are conceptually simple, and not difficult to code. Andrew Ng's original Coursera class is all you need to go from zero knowledge to building MATLAB based neural nets in this same hard coded style.

However, there is a huge difference in functionality (hence complexity) in a framework such as PyTorch vs hardcoding a single NN. It's a bit like the difference between writing a toy compiler in CompSci class vs a production one that supports optimization, multiple targets, etc, etc.

The first step in convenience beyond hardcoding models, was frameworks like the original Torch, and original TensorFlow. Those frameworks let you explicitly assemble a neural net out of modular "lego blocks" (tensor operations), then just call model.forward() or model.backward() - no need to yourself write the forwards and backwards functions.

What PyTorch (successor to Torch) did was increase the complexity of the framework, but bring massive ease-of-use to the developer, by getting rid of the explicit lego-block assembly process, and instead let the developer just write arbitrary Python code corresponding to what they want the model to do, and then PyTorch itself build the model internally and therefore is able to infer the backward function. This extra functionality/ease-of-use, but with corresponding internal complexity, is what differentiated PyTorch from TensorFlow, made it so succesful, and caused most developers to switch to it.

There is also a lot of other functionality in PyTorch that adds to the complexity - supporting multiple back ends, custom CUDA/etc kernels beyond what is provided by cuDNN, etc, etc.

antirez
0 replies
4h15m

I know all this things. Again: look at MLX.

HarHarVeryFunny
0 replies
9h8m

Does this show that in general the most used ML frameworks are a mess? Yes.

Not really ... there is little to no overlap with what a framework like PyTorch does. There is no tensor class, no autograd, etc. Just malloc, a bunch of hand calculated pointers into that chunk of memory, and hand written gradient functions. I assume the intent here is to be educational by stripping away the layers of abstraction to make it clearer what is going on.

Frankly though, this code (all that pointer math!) is a mess too, maybe written this way to make it easy to port to cuDNN which is at a similarly low level (other than having tensor descriptors which make the memory layout more flexible).

If you want to write your own tensor class and reusable NN framework, then the lines of code go up very rapidly. I did one in C++ a while back, and the tensor class alone was 20K LOC.

blackeyeblitzar
4 replies
23h45m

It would be great if someone created a tutorial around this explaining exactly how it works and how to do a test training run. I’m aware it’s not feasible to train a “real” model on personal hardware but it would be nice to have a practical learning experience. I’m not sure if there are good alternatives for that.

blackeyeblitzar
0 replies
21h0m

Thank you so much for responding. I will definitely check these out and also pass it on to others who might be interested.

MAMAMassakali
0 replies
12h43m

Thank you so much for the Zero To Hero playlist!

vineyardmike
0 replies
21h25m

The author has a whole series where he does exactly that. YouTube videos, code examples, documentation, everything. Explains the math, explains how to code it, explains the architecture. Everything.

fori1to10
3 replies
23h27m

It should be rewritten in Rust. (Just joking)

naruhodo
0 replies
18h35m

I think you just cracked AI safety.

eclectic29
0 replies
23h24m

Sshh! I asked why it was written in C and got flagged.

ddggdd
0 replies
20h56m

I just pasted the code into claude and reading the converted rust now, definitely need extra work

andy99
3 replies
22h58m

I'd like to think he took the name from my llm.f90 project https://github.com/rbitr/llm.f90

It was originally based off of Karpathy's llama2.c but I renamed it when I added support for other architectures.

Probable a coincidence :)

andy99
0 replies
7h14m

I'll send you an email

bee_rider
0 replies
5h37m

In f90? That’s pretty cool.

On a related note, IMO it would be pretty cool if we could get an LLM implementation that provides and RCI interface like all the old computational codes used to.

0cf8612b2e1e
1 replies
23h28m

I love his videos. They are dense, but I get a lot out of them.

sghiassy
0 replies
23h21m

+100 thank you karpathy!

milansuk
2 replies
12h15m

This is an implementation of a transformer and in README it's presented as text->text. Tokens are just integers going in and out.

Is it possible to use it to train other types of LLMs(text->image, image->text, speech->text, etc.)?

bootsmann
0 replies
12h2m

The transformer itself just takes arrays of numbers and turns them into arrays of numbers. What you are interested in is the process that happens before and after the transformer.

_giorgio_
0 replies
4h43m

Yes, anything can be an input token.

Patch of pixels ---> token Fragment of input Audio ---> token etc

flockonus
2 replies
22h59m

Question, apologize if slightly off-topic, it's something I'd like to use this project for: Is there an example of how to train GPT-2 on time series, in particular with covariates?

As my understanding of LLM goes at a basic level it's predicting the next token from previous tokens, which sounds directionally similar to time series (perhaps letting aside periodicity).

EricLeer
0 replies
31m

Yes there are many attempts in applying a transformers to timeseries forecasting. For instance (but there are many more): - Timegpt https://arxiv.org/abs/2310.03589 - Chronos https://github.com/amazon-science/chronos-forecasting

These kind of papers often talk the world, but often lack a proper baseline model. They only compare against very simple (naive forecast), or non tuned models. In my experience a gradient boosting model will probably solve 95% of your forecasting problems, and trying to get fancy with a transformer (or even just a simple neural net) is more trouble then it is worth.

waynecochran
1 replies
5h17m

Fantastic -- gotta love Andrej. I am sick of the ball and chain that is Python and all of its environment dependencies. It is nice to shed all the weight and get down to the metal.

toxik
0 replies
4h0m

Yeah, as long as you don’t want to change the network architecture.

Edit: and you trust that Andrei didn’t screw up anywhere while hand rolling all the gradient calculations.

rurban
1 replies
16h51m

https://github.com/robjinman/Richard uses Vulkan, thus is portable across GPU's and much faster. It also has more kernels. In simple C++

flohofwoe
0 replies
9h59m

Or rather GLSL... The C++ code looks like it's mostly just scaffolding to kick off the actually important GPU work, and for that it's a surprising amount of code. Quite typical both for Vulkan and C++ though ;)

lubesGordi
1 replies
7h46m

Quick question, is this just pure C code that can be loaded into an Nvidia gpu and run (via the python code)? I scanned the C and didn't see anything CUDA related (maybe I missed something, I'm not a GPU programmer!). K mentions something about a direct CUDA implementation coming soon, how would that be different than what this is?

whb07
0 replies
7h42m

It’s not, if you look at his X account, he talks about his work adding on the CUDA parts

davedx
1 replies
11h41m

On one hand, really nice to see the whole thing in 1000 lines of C code.

On the other hand, that malloc function low key terrifies me. :)

flohofwoe
0 replies
10h10m

Better to be explicit than hiding unsafe memory accesses under C++ stdlib classes like std::vector which don't do range checking either in operator[]. And in this sort of code, automatically injected runtime range checks would most likely hurt performance enough to matter.

I would still run the code through the Clang static analyzer and a couple of test runs in ASAN and UBSAN to be sure that nothing slipped through.

zzbn00
0 replies
12h5m

Very nice.

In my experience much of the complexity of numerical software is to enable the search for the algorithm that works well with the problem/data you have. Once you know the exact algorithm you want, it is possible to make a nice clean minimalistic implementation, but that does not mean such an implementation would have been easy at the beginning.

triyambakam
0 replies
23h57m

When Lex recently talked to Andre, Andre said that he gets positively obsessed with a problem and says "this must exist". I imagine this must be one of those outputs.

tehsauce
0 replies
20h38m

Another awesome project! Note that as of this moment the CUDA part is aspirational. There is no gpu code in the repo yet.

sirsinsalot
0 replies
11h31m

Karpathy's code, teaching and contribution to the body of knowledge in this area really is admirable.

Sadly I am a generalist, but if I were a specialist, I would hope to contribute as openly and widely as Karpathy.

Not clout chasing, click-bait, "top 5 javascript frameworks of 2023!" ... just high quality output that marks a specialist.

Sorry to gush.

robot
0 replies
16h28m

very cool, also the coding style looks good.

richrichie
0 replies
20h9m

See, C does it very well. Great stuff. Karpathy has a gift for teaching.

mrbonner
0 replies
18h47m

Wow, and this is done after a recent trip to Bhutan to clear his head! I follow karpathy on twitter and he posted that 2 weeks without constantly looking and checking his phone kind of turns off the constantly on radio in his head.

classiebit2025
0 replies
13h58m

https://classiebit.com/eventmie-pro If you looking to host your events but you don't have any platform to host, then once visit Eventmie Pro Platform, Which is the best event management platform in 2024.