return to table of content

Reproducing GPT-2 in llm.c

karpathy
4 replies
22h47m

sounds good. both work, (though) I think HN has a bit of an anti-twitter bias.

pests
2 replies
19h37m

First, love the videos and other work you've been doing. The micrograd videos are a great way to show people this is all math in the end, and I've linked to specific timestamps in that video and others more times than I can count.

For why I think we have a anti-twitter bias...

Twitter doesn't show replies or any further context without being logged in. Most people will have accounts but I know a lot here deleted theirs or refuse to use it for one reason or another.

Also IMO most here are going to want to read the full source so it just cuts out the middleman. This would usually fall under the "Please submit the original source. If a post reports on something found on another site, submit the latter." guideline which is a little different since the source is yourself, but still the Twitter post doesn't add anything new or novel.

karpathy
1 replies
17h4m

fwiw I totally understand the sentiment! it's actually a bit sad to me that so much of our content is moving from the shared, open web to platforms like twitter, unfortunately there seems to be too much value add around built-in discoverability, comments, ease of authoring, for many people revenue sharing, etc.

pests
0 replies
14h16m

Yes, definitely. I had to double check your age (apologies! feels rude somehow) and yep, we're basically the same age. The web was different back then. Maybe not better; maybe that's nostalgia. But never before has more creators had as many tools and avenues to promote and monotonize their work as they do now.

dang
0 replies
14h23m

I agree - Twitter is still the primary source for a lot of original work and original thoughts. Unfortunately it's gotten more complicated because (1) the threads there have gotten less accessible and (2) some people have assigned the entire site to one side of the culture war.

wrboyce
2 replies
19h20m

Could you mention what the link has been changed from too? Sometimes it helps with context when reading the comments. Thanks!

dang
1 replies
14h25m

I agree that it helps! but I did mention it, no? Admittedly "to that from" is a bit of an awkward construction

wrboyce
0 replies
9h16m

facepalm I’d had a few whiskies and misread your comment. Sorry about that!

lagrange77
4 replies
1d

Thank you for the effort you put in your educational work, it helped me and others a lot! In fact, i'm training my nanoGPT version right now. :)

Ultimately my interest in llm.c is to have a nice, clean, minimal, super dependency-light repo in direct C/CUDA implementation, which I find aesthetically pleasing.

Also, it's awesome that you spend your time on your passion.

Any plans on making a video series on llm.c? :D

karpathy
3 replies
1d

Yes definitely. Related tweet of mine:

https://x.com/karpathy/status/1760388761349927356?lang=en

1. Build the thing

2. Build the ramp

Currently on step 1 :). It helps to build it first so you know where you are going, and then you can more easily re-build it when you're vector pointed at the end result.

lagrange77
0 replies
1d

That's fantastic. My gradient field is pointing towards it.

Thank you again!

htrp
0 replies
23h56m

Everytime you take gardening leave, you build something new and interesting!

LorenzoGood
0 replies
15h25m

I love when you leave your job.

ngiyabonga
3 replies
1d1h

Hi Andrej!

First, thank you for your teaching, it has helped me a lot, didn't think I'd ever have the chance to say thank you, but here you are and I hope this gets to you!

Question - what's a relevant (05-2024) baseline to compare the performance of c code to? Back when you made nanoGPT you were seeing "the file train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training". So twice the memory on the c node, but unsure of data size /epochs, any other details I may be missing. I.e. what's the net uplift of running c vs "legacy" torch code?

Thanks again for everything.

karpathy
2 replies
1d1h

The baseline is definitely PyTorch (or JAX), and indeed something like nanoGPT. I just never got nanoGPT "past the finish line" of really crossing the t's and dotting the i's and reproducing the models with as much care as I did now and here in llm.c, and getting to the point where it's a single launch command that just does the thing.

I think I'll try to develop the `train_gpt2.py` inside llm.c to be that, so that we have the two implementations exactly side by side, and it's all nice and comparable.

The C/CUDA code is currently a little bit faster than PyTorch (last time I measured ~2 weeks ago it was about 6% faster), and I think we can push this further. This is done by manually hard-coding a bunch of fusions/optimizations that are non-trivial for torch.compile to find (e.g. our FusedClassifier). But PyTorch has some pending work/PRs that will also speed up their side a lot.

Ultimately my interest in llm.c is to have a nice, clean, minimal, super dependency-light repo in direct C/CUDA implementation, which I find aesthetically pleasing. And on top of that, educational, i.e. using all of the above as an endpoint of an intro LLM course.

raymond_goo
0 replies
19h39m

Maybe talk to MasterClass...

ilaksh
0 replies
19h59m

Just out of curiosity, how do you feel about Tinygrad? They just released 0.9 and are also on the HN home page today.

espadrine
3 replies
1d1h

How big of a perf improvement would result from using the architectural tweaks that Llama3 and others have put in place since GPT-2?

karpathy
2 replies
1d

My understanding and suspicion is mostly less than you think. Llama 3 architecture has the following changes on GPT-2:

1. delete the absolute positional encoding and replace with RoPE

2. delete all biases in all layers (in LayerNorms, they turn into RMSNorm)

3. GeLU -> SwiGLU non-linearity in the MLP

4. longer context length

5. architecture hyperparameter changes, e.g. slightly different aspect ratios

And there was a paper that I can't find the reference to anymore that claimed that if you train long enough, the gap becomes even lower. Possibly because the absolutely positional encoding has enough time to train more fully, where as the RoPE layer benefits from the "inductive bias" it adds in the earlier stages of training.

But I don't have full confidence on the above claim, maybe someone has tried or has better/concrete reference.

soraki_soladead
0 replies
14h12m

Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU is gated linear unit; the gate construction you describe.

dekhn
3 replies
18h53m

Would you consider switching your interest to protein structure prediction? In particular, the current most advanced model is a closed-source, closed-weights system that was trained on a proprietary hardware. It is intentionally kept that way for now to enable deepmind to commercialize their product.

The goal here isn't to make the best performing model: it's ablation. How much can we remove from protein structure prediction (such as multiple sequence alignments and molecular dynamics, which were two improvements in AF3), while still having a generalized model that can predict novel folds.

Then focus on teaching the minimal necessary math and code to reproduce the results to the larger biological community. All I can say about AF3 is that it literally taught me that everything I learned about protein structure prediction in the last 30 years was misguided, or outright wrong.

Don't worry about drug discovery or any of the hard stuff. Just continue to show that all that's required to predict novel structures is the existing PDB.

wizzwizz4
1 replies
48m

switching your interest

That's not usually how it works.

Just continue to show that all that's required to predict novel structures is the existing PDB.

Sounds like you know a lot about this topic. You should do it!

dekhn
0 replies
35m

Yes I already published several papers in the area, but I don't work on it any more.

treme
0 replies
6h35m

lol I appreciate your effort to guide his genius towards 'max human good'

363849473754
3 replies
1d

You might have covered this topic before, but I'm curious about the main performance differences between nanoGPT and llm.c. I'm planning to take your "Zero to Hero" course, and I'd like to know how capable the nanoGPT chatbot you'll build is. Is its quality comparable to GPT-2 when used as a chatbot?

karpathy
2 replies
1d

Zero To Hero doesn't make it all the way to a chatbot, it stops at pretraining, and even that at a fairly small scale or character-level transformer on TinyShakespeare. I think it's a good conceptual intro but you don't get too too far as a competent chatbot. I think I should be able to improve on this soon.

maskil
0 replies
22h49m

Please do! It's a fantastic series!

363849473754
0 replies
23h24m

Thanks! So, you are considering expanding the Zero to Hero series to include building a basic GPT-2 toy chatbot? I believe you mentioned in one of the early lectures that you planned to include building a toy version of Dalle. Do you still have plans for that as well?

bilsbie
1 replies
1d

Any tips on understanding grokking? I’m not following that paper.

sturza
0 replies
1d

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. Overfitting and being cool about it and some new behavior might emerge.

localhost
2 replies
22h19m

How large is the set of binaries needed to do this training job? The current pytorch + CUDA ecosystem is so incredibly gigantic and manipulating those container images is painful because they are so large. I was hopeful that this would be the beginnings of a much smaller training/fine-tuning stack?

karpathy
1 replies
21h57m

That is 100% my intention and hope and I think we are very close to deleting all of that. Right now on master, I am already only using Python for the tokenization preprocessing. In principle the requirements for llm.c should be extremely minimal. I think this a few days of work that is high on my mind.

Biggest problem right now is finding a place that can host the 135GB of tokens for FineWeb100B. Will probably use S3 or something.

Related see: https://github.com/karpathy/llm.c/issues/482

metadat
0 replies
19h2m

Could this be a good case for a torrent?

simonw
1 replies
22h31m

Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens. [...] GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?).

Andrej, based on that do you have a rough cost estimate for what it would take to train a GPT-3 Ada (350M)? Do you plan to get there with llm.c ?

karpathy
0 replies
22h26m

The 350M model I trained last night was 30B tokens, 14 hours, ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K would be the estimate. You'd have to wait 140 hours on one box though. Getting an H100 box instead of A100 will already cut the time latency down probably by a factor of 2-3X, for free, even without going to fp8 (which we do plan to support).

So TLDR at this model scale, llm.c is already there functionally, I think, it's a matter of the compute resources and patience. I currently have this one box from Lambda and I have to look around for a few more boxes and merge the pending PR for multi-node training support. Getting all of this into a nice, stable state is probably a good chunk of the pending work right now.

m11a
1 replies
22h57m

Why write in CUDA and not just use PyTorch etc?

if performance, how much faster is it, out of curiosity?

kgwgk
0 replies
21h21m

Why write in CUDA and not just use PyTorch etc?

“LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. […] A few more words on what I want this repo to be: First, I want llm.c to be a place for education.”

0x1ceb00da
1 replies
10h34m

Hi. Is it possible to somehow run llm.c on an amd gpu?

anthonix1
0 replies
2h33m

Yeah, I just reproduced the GPT2 from scratch results in 8.75 hours on 4x 7900 XTX. The fork is here: https://github.com/anthonix/llm.c

sytelus
0 replies
9h41m

So, NanoGPT took 1.8 days on 8xA100 for 124M model training on 30.7B tokens using flash attention. This would translate to 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost 10X speedup!

Does this look ballpark correct? Is there any summary of where majority of this improvement comes from?

jonesn11
0 replies
13h28m

Like the FAQ, you correctly anticipated my questions.

1024core
0 replies
1d1h

Thank you, from an appreciative reader!

benterix
42 replies
1d1h

I just hope than in a couple of years we'll see a submission here titled "Reproduce GPT-4 on legacy RTX 4090."

Because currently even with open source (?) models we are still consumers, and the training is still the domain of the rich.

ravetcofx
18 replies
1d1h

Accessing the dataset to train from scratch will be the biggest hurdle, now a lot of the pile has had ladder pulled since GPT-4

ravetcofx
5 replies
1d

Holy crap, Does huggingface charge for bandwidth if you're downloading 45 terabytes??

andersa
2 replies
14h7m

Fun trivia: downloading 45TB costs about $60, according to Cloudflare.

verticalscaler
1 replies
7h11m

That's what Cloudflare charges. It costs them around 6 cents.

kazanz
0 replies
1h31m

Wish I could say I'm surprised you're getting downvotes. Carrier costs are some of the lowest costs for hosting providers. Yet that fact seems to elude a majority of the community here.

drexlspivey
1 replies
23h40m

I believe they are hosting it on Cloudflare who doesn’t charge for egress

fragmede
0 replies
23h25m

More specifically, Cloudflare R2 doesn't charge for egress, and Cloudflare doesn't charge for egress to members in the Bandwidth Alliance which include Azure, Google Cloud, Oracle, Alibaba Cloud, and others, though critically not AWS.

They very much do charge egress fees elsewhere.

meiraleal
5 replies
1d1h

I'm okay with paying for datasets

CamperBob2
4 replies
1d1h

Depends on how the courts rule. If the copyright maximalists prevail, only the wealthiest entities will be able to afford to license a useful data set.

Paradoxically enough, this is the outcome that most "Hacker News" denizens seem to be rooting for.

groby_b
2 replies
22h29m

It's almost as if people believe in fairness and compensating people for their work.

Also, it's worth noting that this is only true as long as we're stuck in the "must train on the entire sum total of human output ever created" local minimum for machine learning. Given that most biological entities learn with much less data, this might well be the thing that prods ML research to using an approach that isn't "IDK, buy a few containers of GPUs, and half a DC of storage, see if that makes things better".

nwsm
1 replies
3h4m

It's almost as if people believe in fairness and compensating people for their work.

Yet in this case we are talking about compensating the compilers/massagers/owners of the datasets, not the original authors from wherever the data was originally scraped.

wizzwizz4
0 replies
43m

Copyright is hideously broken, but in theory: the owners only own it because they compensate the authors, which they only do out of an expectation of future profit (on average).

That theory's a fantasy, because extractive systems involving gatekeepers get established, but in this specific case, enforcing copyright would make things fairer for authors. There's no extractive copyright-taking gatekeeper for websites: scrapers don't get copyright, so can't re-license the material they've scraped (unless it's permissively-licensed or something).

meiraleal
0 replies
1d

I'd still get most of my dataset from torrent but I could pay for specific things like high quality source code.

exe34
2 replies
1d1h

i suppose you wouldn't be able to use it for external services, but internally, I'm sure you can find some books that fell off the back of a truck...

HeatrayEnjoyer
1 replies
1d1h

No reason you can't go external. GPT was trained using ebook torrent sites

artninja1988
0 replies
23h55m

OpenAI has enough money to hire lawyers to defend it until the end of time though

CamperBob2
1 replies
1d1h

Someone will come along and say "Why don't you just mirror Anna's Archive?" in 3...2...1...

sebzim4500
0 replies
8h56m

I think between Anna's Archive, fineweb and as many github repos as you can scrape you can get a pretty decent dataset.

I doubt Anna's Archive would produce a good model on its own though.

huac
4 replies
4h34m

25% MFU :( maybe because of the P2P nerf?

sabareesh
1 replies
2h15m

This much bigger model (500M), P2P is enabled via Mailbox. It is expected because of memory to compute ratio

huac
0 replies
2h7m

can you elaborate?

anthonix1
1 replies
4h6m

Maybe get a 7900 XTX. 122 TFLOPS of BF16/FP16 for less than $1k and I'm getting 55.4% MFU

sabareesh
0 replies
2h14m

These are not apples to apple comparison, as this is running across GPU and much bigger model

vineyardmike
4 replies
1d1h

We won’t ever get there or need to because GPT-4 wasn’t trained on one GPU it was trained on thousands. The (most likely) biggest meaningful difference between -2 and -4 is the number of parameters and the training data/duration. I don’t think you’d really learn much more.

elicksaur
3 replies
19h10m

It’s not about learning. It’s about owning. Exactly the reason OpenAI stopped being open. Having GPT-4-quality LLMs created by anyone with a gaming PC would be pretty radical.

vineyardmike
2 replies
17h1m

And you won’t get there. Those models are far too large for a 2024 GPU. Llama-3 70b is arguably close to GPT-4 but is still too large for gaming GPUs (and probably for many years of GPU updates)

elicksaur
1 replies
14h11m

“You won’t get there” is a pretty vast statement for all of the future. Two fairly reasonable predictions: 1) the compute needed to get GPT4 performance will decrease. 2) the compute on consumer GPUs will increase.

At some point they cross, and you will be able to run a GPT4-quality LLM on a consumer GPU. At some point after that, you’ll be able to run a GPT4-quality LLM on a 2024 consumer GPU if you can find one.

Important to emphasize, I’m not saying “GPT-4”. Llama-3 was trained on 24k GPU clusters. “Able to do the exact same processing at 1/24k the compute” is different from “Able to get equivalent performance at 1/24k compute”. Even then, given a long enough time scale, the former is possible.

vineyardmike
0 replies
3h8m

1) the compute needed to get GPT4 performance will decrease. 2) the compute on consumer GPUs will increase.

I’m assuming we’re just talking inference here…

Sure compute abilities for consumers will increase but the original comment had a fixed GPU - the 4090. I can already eke out LLama3:8b on my MacBook Air, and Apple will sell you a laptop capable of running the full sized LLama.

There is a direct correlation between parameters and “knowledge” for an LM. There’s some open questions as to density (LLaMa3 specifically challenged previous assumptions) but it seems implausible to fit an equivalent model as GPT4 into 24gb vram. Just like compression, you can’t shrink forever.

GPT-4 and GPT-2 are pretty similar architecturally (I assume). So if abilities don’t matter, we can already run GPT-2 so we’re basically there for 4.

anthonix1
3 replies
23h36m

FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900 XTX machine (less than $4k worth of GPU), using the same settings as in the post (0.5M batch size etc).

pama
2 replies
20h32m

Did you reproduce the evaluation as well?

anthonix1
0 replies
4h17m

So... successfully reproduced in ~8.75 hours, taking about 18 kWh / $2.70

The first run actually failed at step 3000 or so, and I realized I had a bug in my attention / matmul kernels, but after fixing that and restarting it worked great

[1] https://github.com/anthonix/llm.c

anthonix1
0 replies
18h18m

It converges similarly on smaller datasets.

About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.

Will report back tomorrow when the training has finished..

Invictus0
3 replies
1d1h

I'm not saying this to be rude, but I think you have a deep misunderstanding of how AI training works. You cannot just skip the matrix multiplications necessary to train the model, or get current hardware to do it faster.

xdavidliu
0 replies
22h46m

was the first sentence really necessary? The second sentence seems fine by itself.

benterix
0 replies
8h10m

No offence taken! As far as my (shallow!) understanding goes, the main challenge is the need for many GPUs with huge amounts of memory, and it still takes ages to train the model. So regarding the use of consumer GPUs, some work has been done already, and I've seen some setups where people combine of these and are successful. As for the the other aspects, maybe at some point we distill what is really needed to a smaller but excellent dataset that would give similar results in the final models.

auspiv
2 replies
23h54m

Considering it takes 8x A100 GPUs (80GB VRAM) to train GPT-2, I think it'll take far more than a single 4090.

bufo
0 replies
23h40m

The RTX 4090 has about the same BF16 Tensor Core TOPs than the A100, assuming 50% MFU (like the A100 40 GB PCIe) it would take 8x longer on 1 RTX 4090 vs 8x A100 80GB SXM, so 12 hours. Datasheet here for the TOPs https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid... 50% MFU should be achievable on the 4090.

anthonix1
0 replies
4h15m

Nah, I reproduced on 4x 7900 XTX machine in 8.75 hours, so a single 7900 XTX (costs less than $1k) could do it in under 24 hours. Was hitting 55.4% MFU.

doubloon
0 replies
10h55m

Dude im hoping we get rid of Nvidia completely. I can run llama.cpp inference on a 7B model on my 24 core cpu intel machine using just CPU and it only uses about 4gb of ram and is not that slow. If we could have massive parallel arm core or even riscv machines without the Cuda issues with proprietary driver hell it would be much more open source. And much less wonkage for the normie user

natsucks
3 replies
21h59m

In your opinion is it important for ML engineers to know C?

esafak
0 replies
21h2m

You'd have to be deep into ML Infrastructure to use C, probably via CUDA. No-one who develops or uses ML models touches C or even C++. tinygrad and llama.cpp are exceptions.

brcmthrowaway
0 replies
21h12m

0% chance

adeptima
0 replies
10h8m

Spend one year to study multiple languages - bash, C, C++, Go, Python ... and even Mojo or Rust. 10-20 hours a week. Being able to read top programming languages is the best investment I ever made. You will become fearless and can see the matrix ;)

indigodaddy
3 replies
1d1h

Looks like this is re: training, but wonder how inference would be on some garbage older machine with no GPU on this model?

ryankrage77
2 replies
22h40m

Last time I tried GPT-2 on CPU (which I think was shortly before chatGPT was launched), I was getting about 0.2 tokens/sec. CPU utilization was low though, so running inference in parralel gave better results. I was using 2 x E5-2660's.

int_19h
1 replies
22h17m

DDR5 helps a lot. You can actually run stuff like LLaMA at >1 tok/s on the CPU with high-end gaming hardware these days.

doubloon
0 replies
10h42m

I have a 24 core Intel cpu and llama3.cpp runs llama3 surprisingly fast in surprisingly little RAM. Yes it becomes a space heater but theres light at the end of the cuda free tunnel

ls612
1 replies
15h2m

Is this the sort of thing that a person with curiosity and a 4090 could do? It says he used 8xA100s in the cloud to do this but is it just a matter of the 4090 going 8x slower or will memory constraints kill the whole endeavour?

smaddox
0 replies
1h46m

4090 should have enough VRAM for 124M param training. Even at float32 precision, with AdamW optimizer, parameters should only be ~2GB (124M params x 4 bytes per param x ~4 for optimizer weight overhead). So there should be plenty of remaining space for activations.

akkishore
1 replies
5h16m

Hi Andrej,

Huge fan of all the work you do. Wanted to understand something fundamental and whom better to ask than you: Whats so special about the transformer architecture that its able to predict the next token so beautifully understanding all the intricate previous token relationships? I understand Attention but what so special about this architecture that no other architectures are able to "attend" appropriately to previous tokens? Being a CS guy, its really hard for me to fathom that we have not yet created another architecture which can perform similarly.

smaddox
0 replies
1h55m

Transformers have quadratic computational complexity in sequence length, i.e. O(N^2) where N is the sequence length. RNNs, Linformer, Mamba, etc. have linear or quasi-linear computational complexity in sequence length, which often bottlenecks information movement across tokens.

In theory, if you grew the RNN's state quadratically vs sequence length, you could likely achieve comparable performance to transformers, but it would likely be less efficient than transformers.

unknown2342
0 replies
3h18m

Thank you for your work Andrej! <3

notg963
0 replies
1d

Do you have plans to create videos for the llm.c?

metalloid
0 replies
5h55m

This awesome!

We need a series on how to build the llm.c from the scratch. Any volunteer?

:-)

celltalk
0 replies
22h52m

Time for llm videos!

anoy8888
0 replies
23h7m

Can it be done in rust ?

aliljet
0 replies
19h20m

Is there a reason you're not trying to port this into an even more stack agnostic world without CUDA?

adeptima
0 replies
10h14m

Andrej Karpathy karpathy is a magician!

But being the coolest kid on the block with pure C/CUDA implementation is not enough https://github.com/karpathy/llm.c

Studying a baby Llama 2 model source code in pure Mojo is the next level https://github.com/tairov/llama2.mojo https://github.com/tairov/llama2.mojo/blob/master/llama2.moj...

Mojo Lang - Tomorrow's High Performance Python? (with Chris Lattner) https://www.youtube.com/watch?v=JRcXUuQYR90

Andrej Karpathy and Chris Lattner collab is on my wishlist ;)