Hi HN the main (more detailed) article is here https://github.com/karpathy/llm.c/discussions/481
Happy to answer questions!
Hi HN the main (more detailed) article is here https://github.com/karpathy/llm.c/discussions/481
Happy to answer questions!
I just hope than in a couple of years we'll see a submission here titled "Reproduce GPT-4 on legacy RTX 4090."
Because currently even with open source (?) models we are still consumers, and the training is still the domain of the rich.
Accessing the dataset to train from scratch will be the biggest hurdle, now a lot of the pile has had ladder pulled since GPT-4
https://huggingface.co/datasets/HuggingFaceFW/fineweb has 15T cleaned and deduplicated english web data tokens.
Holy crap, Does huggingface charge for bandwidth if you're downloading 45 terabytes??
Fun trivia: downloading 45TB costs about $60, according to Cloudflare.
That's what Cloudflare charges. It costs them around 6 cents.
Wish I could say I'm surprised you're getting downvotes. Carrier costs are some of the lowest costs for hosting providers. Yet that fact seems to elude a majority of the community here.
I believe they are hosting it on Cloudflare who doesn’t charge for egress
More specifically, Cloudflare R2 doesn't charge for egress, and Cloudflare doesn't charge for egress to members in the Bandwidth Alliance which include Azure, Google Cloud, Oracle, Alibaba Cloud, and others, though critically not AWS.
They very much do charge egress fees elsewhere.
I'm okay with paying for datasets
Depends on how the courts rule. If the copyright maximalists prevail, only the wealthiest entities will be able to afford to license a useful data set.
Paradoxically enough, this is the outcome that most "Hacker News" denizens seem to be rooting for.
It's almost as if people believe in fairness and compensating people for their work.
Also, it's worth noting that this is only true as long as we're stuck in the "must train on the entire sum total of human output ever created" local minimum for machine learning. Given that most biological entities learn with much less data, this might well be the thing that prods ML research to using an approach that isn't "IDK, buy a few containers of GPUs, and half a DC of storage, see if that makes things better".
It's almost as if people believe in fairness and compensating people for their work.
Yet in this case we are talking about compensating the compilers/massagers/owners of the datasets, not the original authors from wherever the data was originally scraped.
Copyright is hideously broken, but in theory: the owners only own it because they compensate the authors, which they only do out of an expectation of future profit (on average).
That theory's a fantasy, because extractive systems involving gatekeepers get established, but in this specific case, enforcing copyright would make things fairer for authors. There's no extractive copyright-taking gatekeeper for websites: scrapers don't get copyright, so can't re-license the material they've scraped (unless it's permissively-licensed or something).
I'd still get most of my dataset from torrent but I could pay for specific things like high quality source code.
i suppose you wouldn't be able to use it for external services, but internally, I'm sure you can find some books that fell off the back of a truck...
No reason you can't go external. GPT was trained using ebook torrent sites
OpenAI has enough money to hire lawyers to defend it until the end of time though
Someone will come along and say "Why don't you just mirror Anna's Archive?" in 3...2...1...
I think between Anna's Archive, fineweb and as many github repos as you can scrape you can get a pretty decent dataset.
I doubt Anna's Archive would produce a good model on its own though.
Well here is a comment on 4090 https://github.com/karpathy/llm.c/discussions/481#discussion...
25% MFU :( maybe because of the P2P nerf?
This much bigger model (500M), P2P is enabled via Mailbox. It is expected because of memory to compute ratio
can you elaborate?
Maybe get a 7900 XTX. 122 TFLOPS of BF16/FP16 for less than $1k and I'm getting 55.4% MFU
These are not apples to apple comparison, as this is running across GPU and much bigger model
We won’t ever get there or need to because GPT-4 wasn’t trained on one GPU it was trained on thousands. The (most likely) biggest meaningful difference between -2 and -4 is the number of parameters and the training data/duration. I don’t think you’d really learn much more.
It’s not about learning. It’s about owning. Exactly the reason OpenAI stopped being open. Having GPT-4-quality LLMs created by anyone with a gaming PC would be pretty radical.
And you won’t get there. Those models are far too large for a 2024 GPU. Llama-3 70b is arguably close to GPT-4 but is still too large for gaming GPUs (and probably for many years of GPU updates)
“You won’t get there” is a pretty vast statement for all of the future. Two fairly reasonable predictions: 1) the compute needed to get GPT4 performance will decrease. 2) the compute on consumer GPUs will increase.
At some point they cross, and you will be able to run a GPT4-quality LLM on a consumer GPU. At some point after that, you’ll be able to run a GPT4-quality LLM on a 2024 consumer GPU if you can find one.
Important to emphasize, I’m not saying “GPT-4”. Llama-3 was trained on 24k GPU clusters. “Able to do the exact same processing at 1/24k the compute” is different from “Able to get equivalent performance at 1/24k compute”. Even then, given a long enough time scale, the former is possible.
1) the compute needed to get GPT4 performance will decrease. 2) the compute on consumer GPUs will increase.
I’m assuming we’re just talking inference here…
Sure compute abilities for consumers will increase but the original comment had a fixed GPU - the 4090. I can already eke out LLama3:8b on my MacBook Air, and Apple will sell you a laptop capable of running the full sized LLama.
There is a direct correlation between parameters and “knowledge” for an LM. There’s some open questions as to density (LLaMa3 specifically challenged previous assumptions) but it seems implausible to fit an equivalent model as GPT4 into 24gb vram. Just like compression, you can’t shrink forever.
GPT-4 and GPT-2 are pretty similar architecturally (I assume). So if abilities don’t matter, we can already run GPT-2 so we’re basically there for 4.
FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900 XTX machine (less than $4k worth of GPU), using the same settings as in the post (0.5M batch size etc).
Did you reproduce the evaluation as well?
So... successfully reproduced in ~8.75 hours, taking about 18 kWh / $2.70
The first run actually failed at step 3000 or so, and I realized I had a bug in my attention / matmul kernels, but after fixing that and restarting it worked great
It converges similarly on smaller datasets.
About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.
Will report back tomorrow when the training has finished..
I'm not saying this to be rude, but I think you have a deep misunderstanding of how AI training works. You cannot just skip the matrix multiplications necessary to train the model, or get current hardware to do it faster.
was the first sentence really necessary? The second sentence seems fine by itself.
There's work on replacing multiplication. Here's four examples:
https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_...
https://arxiv.org/abs/2012.03458
https://openaccess.thecvf.com/content/CVPR2021W/MAI/papers/E...
No offence taken! As far as my (shallow!) understanding goes, the main challenge is the need for many GPUs with huge amounts of memory, and it still takes ages to train the model. So regarding the use of consumer GPUs, some work has been done already, and I've seen some setups where people combine of these and are successful. As for the the other aspects, maybe at some point we distill what is really needed to a smaller but excellent dataset that would give similar results in the final models.
Considering it takes 8x A100 GPUs (80GB VRAM) to train GPT-2, I think it'll take far more than a single 4090.
The RTX 4090 has about the same BF16 Tensor Core TOPs than the A100, assuming 50% MFU (like the A100 40 GB PCIe) it would take 8x longer on 1 RTX 4090 vs 8x A100 80GB SXM, so 12 hours. Datasheet here for the TOPs https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid... 50% MFU should be achievable on the 4090.
Nah, I reproduced on 4x 7900 XTX machine in 8.75 hours, so a single 7900 XTX (costs less than $1k) could do it in under 24 hours. Was hitting 55.4% MFU.
Dude im hoping we get rid of Nvidia completely. I can run llama.cpp inference on a 7B model on my 24 core cpu intel machine using just CPU and it only uses about 4gb of ram and is not that slow. If we could have massive parallel arm core or even riscv machines without the Cuda issues with proprietary driver hell it would be much more open source. And much less wonkage for the normie user
In your opinion is it important for ML engineers to know C?
You'd have to be deep into ML Infrastructure to use C, probably via CUDA. No-one who develops or uses ML models touches C or even C++. tinygrad and llama.cpp are exceptions.
0% chance
Spend one year to study multiple languages - bash, C, C++, Go, Python ... and even Mojo or Rust. 10-20 hours a week. Being able to read top programming languages is the best investment I ever made. You will become fearless and can see the matrix ;)
Looks like this is re: training, but wonder how inference would be on some garbage older machine with no GPU on this model?
Last time I tried GPT-2 on CPU (which I think was shortly before chatGPT was launched), I was getting about 0.2 tokens/sec. CPU utilization was low though, so running inference in parralel gave better results. I was using 2 x E5-2660's.
DDR5 helps a lot. You can actually run stuff like LLaMA at >1 tok/s on the CPU with high-end gaming hardware these days.
I have a 24 core Intel cpu and llama3.cpp runs llama3 surprisingly fast in surprisingly little RAM. Yes it becomes a space heater but theres light at the end of the cuda free tunnel
Is this the sort of thing that a person with curiosity and a 4090 could do? It says he used 8xA100s in the cloud to do this but is it just a matter of the 4090 going 8x slower or will memory constraints kill the whole endeavour?
4090 should have enough VRAM for 124M param training. Even at float32 precision, with AdamW optimizer, parameters should only be ~2GB (124M params x 4 bytes per param x ~4 for optimizer weight overhead). So there should be plenty of remaining space for activations.
Hi Andrej,
Huge fan of all the work you do. Wanted to understand something fundamental and whom better to ask than you: Whats so special about the transformer architecture that its able to predict the next token so beautifully understanding all the intricate previous token relationships? I understand Attention but what so special about this architecture that no other architectures are able to "attend" appropriately to previous tokens? Being a CS guy, its really hard for me to fathom that we have not yet created another architecture which can perform similarly.
Transformers have quadratic computational complexity in sequence length, i.e. O(N^2) where N is the sequence length. RNNs, Linformer, Mamba, etc. have linear or quasi-linear computational complexity in sequence length, which often bottlenecks information movement across tokens.
In theory, if you grew the RNN's state quadratically vs sequence length, you could likely achieve comparable performance to transformers, but it would likely be less efficient than transformers.
the code works well on H100: https://x.com/Yuchenj_UW/status/1795554739633221804
Thank you for your work Andrej! <3
Do you have plans to create videos for the llm.c?
This awesome!
We need a series on how to build the llm.c from the scratch. Any volunteer?
:-)
Time for llm videos!
Can it be done in rust ?
Is there a reason you're not trying to port this into an even more stack agnostic world without CUDA?
Andrej Karpathy karpathy is a magician!
But being the coolest kid on the block with pure C/CUDA implementation is not enough https://github.com/karpathy/llm.c
Studying a baby Llama 2 model source code in pure Mojo is the next level https://github.com/tairov/llama2.mojo https://github.com/tairov/llama2.mojo/blob/master/llama2.moj...
Mojo Lang - Tomorrow's High Performance Python? (with Chris Lattner) https://www.youtube.com/watch?v=JRcXUuQYR90
Andrej Karpathy and Chris Lattner collab is on my wishlist ;)
Ok, we've changed the URL to that from https://twitter.com/karpathy/status/1795484547267834137 above. Thanks!
sounds good. both work, (though) I think HN has a bit of an anti-twitter bias.
First, love the videos and other work you've been doing. The micrograd videos are a great way to show people this is all math in the end, and I've linked to specific timestamps in that video and others more times than I can count.
For why I think we have a anti-twitter bias...
Twitter doesn't show replies or any further context without being logged in. Most people will have accounts but I know a lot here deleted theirs or refuse to use it for one reason or another.
Also IMO most here are going to want to read the full source so it just cuts out the middleman. This would usually fall under the "Please submit the original source. If a post reports on something found on another site, submit the latter." guideline which is a little different since the source is yourself, but still the Twitter post doesn't add anything new or novel.
fwiw I totally understand the sentiment! it's actually a bit sad to me that so much of our content is moving from the shared, open web to platforms like twitter, unfortunately there seems to be too much value add around built-in discoverability, comments, ease of authoring, for many people revenue sharing, etc.
Yes, definitely. I had to double check your age (apologies! feels rude somehow) and yep, we're basically the same age. The web was different back then. Maybe not better; maybe that's nostalgia. But never before has more creators had as many tools and avenues to promote and monotonize their work as they do now.
I agree - Twitter is still the primary source for a lot of original work and original thoughts. Unfortunately it's gotten more complicated because (1) the threads there have gotten less accessible and (2) some people have assigned the entire site to one side of the culture war.
Could you mention what the link has been changed from too? Sometimes it helps with context when reading the comments. Thanks!
I agree that it helps! but I did mention it, no? Admittedly "to that from" is a bit of an awkward construction
facepalm I’d had a few whiskies and misread your comment. Sorry about that!
Thank you for the effort you put in your educational work, it helped me and others a lot! In fact, i'm training my nanoGPT version right now. :)
Also, it's awesome that you spend your time on your passion.
Any plans on making a video series on llm.c? :D
Yes definitely. Related tweet of mine:
https://x.com/karpathy/status/1760388761349927356?lang=en
1. Build the thing
2. Build the ramp
Currently on step 1 :). It helps to build it first so you know where you are going, and then you can more easily re-build it when you're vector pointed at the end result.
That's fantastic. My gradient field is pointing towards it.
Thank you again!
Everytime you take gardening leave, you build something new and interesting!
I love when you leave your job.
Hi Andrej!
First, thank you for your teaching, it has helped me a lot, didn't think I'd ever have the chance to say thank you, but here you are and I hope this gets to you!
Question - what's a relevant (05-2024) baseline to compare the performance of c code to? Back when you made nanoGPT you were seeing "the file train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training". So twice the memory on the c node, but unsure of data size /epochs, any other details I may be missing. I.e. what's the net uplift of running c vs "legacy" torch code?
Thanks again for everything.
The baseline is definitely PyTorch (or JAX), and indeed something like nanoGPT. I just never got nanoGPT "past the finish line" of really crossing the t's and dotting the i's and reproducing the models with as much care as I did now and here in llm.c, and getting to the point where it's a single launch command that just does the thing.
I think I'll try to develop the `train_gpt2.py` inside llm.c to be that, so that we have the two implementations exactly side by side, and it's all nice and comparable.
The C/CUDA code is currently a little bit faster than PyTorch (last time I measured ~2 weeks ago it was about 6% faster), and I think we can push this further. This is done by manually hard-coding a bunch of fusions/optimizations that are non-trivial for torch.compile to find (e.g. our FusedClassifier). But PyTorch has some pending work/PRs that will also speed up their side a lot.
Ultimately my interest in llm.c is to have a nice, clean, minimal, super dependency-light repo in direct C/CUDA implementation, which I find aesthetically pleasing. And on top of that, educational, i.e. using all of the above as an endpoint of an intro LLM course.
Maybe talk to MasterClass...
Just out of curiosity, how do you feel about Tinygrad? They just released 0.9 and are also on the HN home page today.
How big of a perf improvement would result from using the architectural tweaks that Llama3 and others have put in place since GPT-2?
My understanding and suspicion is mostly less than you think. Llama 3 architecture has the following changes on GPT-2:
1. delete the absolute positional encoding and replace with RoPE
2. delete all biases in all layers (in LayerNorms, they turn into RMSNorm)
3. GeLU -> SwiGLU non-linearity in the MLP
4. longer context length
5. architecture hyperparameter changes, e.g. slightly different aspect ratios
And there was a paper that I can't find the reference to anymore that claimed that if you train long enough, the gap becomes even lower. Possibly because the absolutely positional encoding has enough time to train more fully, where as the RoPE layer benefits from the "inductive bias" it adds in the earlier stages of training.
But I don't have full confidence on the above claim, maybe someone has tried or has better/concrete reference.
Note llama's feed forward is a bit different too:
I.e. the nonlinearity is a gate.https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...
Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU is gated linear unit; the gate construction you describe.
Would you consider switching your interest to protein structure prediction? In particular, the current most advanced model is a closed-source, closed-weights system that was trained on a proprietary hardware. It is intentionally kept that way for now to enable deepmind to commercialize their product.
The goal here isn't to make the best performing model: it's ablation. How much can we remove from protein structure prediction (such as multiple sequence alignments and molecular dynamics, which were two improvements in AF3), while still having a generalized model that can predict novel folds.
Then focus on teaching the minimal necessary math and code to reproduce the results to the larger biological community. All I can say about AF3 is that it literally taught me that everything I learned about protein structure prediction in the last 30 years was misguided, or outright wrong.
Don't worry about drug discovery or any of the hard stuff. Just continue to show that all that's required to predict novel structures is the existing PDB.
That's not usually how it works.
Sounds like you know a lot about this topic. You should do it!
Yes I already published several papers in the area, but I don't work on it any more.
lol I appreciate your effort to guide his genius towards 'max human good'
You might have covered this topic before, but I'm curious about the main performance differences between nanoGPT and llm.c. I'm planning to take your "Zero to Hero" course, and I'd like to know how capable the nanoGPT chatbot you'll build is. Is its quality comparable to GPT-2 when used as a chatbot?
Zero To Hero doesn't make it all the way to a chatbot, it stops at pretraining, and even that at a fairly small scale or character-level transformer on TinyShakespeare. I think it's a good conceptual intro but you don't get too too far as a competent chatbot. I think I should be able to improve on this soon.
Please do! It's a fantastic series!
Thanks! So, you are considering expanding the Zero to Hero series to include building a basic GPT-2 toy chatbot? I believe you mentioned in one of the early lectures that you planned to include building a toy version of Dalle. Do you still have plans for that as well?
Do you think grokking leads to proper generalized reasoning? https://arxiv.org/abs/2405.15071
Any tips on understanding grokking? I’m not following that paper.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. Overfitting and being cool about it and some new behavior might emerge.
How large is the set of binaries needed to do this training job? The current pytorch + CUDA ecosystem is so incredibly gigantic and manipulating those container images is painful because they are so large. I was hopeful that this would be the beginnings of a much smaller training/fine-tuning stack?
That is 100% my intention and hope and I think we are very close to deleting all of that. Right now on master, I am already only using Python for the tokenization preprocessing. In principle the requirements for llm.c should be extremely minimal. I think this a few days of work that is high on my mind.
Biggest problem right now is finding a place that can host the 135GB of tokens for FineWeb100B. Will probably use S3 or something.
Related see: https://github.com/karpathy/llm.c/issues/482
Could this be a good case for a torrent?
Andrej, based on that do you have a rough cost estimate for what it would take to train a GPT-3 Ada (350M)? Do you plan to get there with llm.c ?
The 350M model I trained last night was 30B tokens, 14 hours, ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K would be the estimate. You'd have to wait 140 hours on one box though. Getting an H100 box instead of A100 will already cut the time latency down probably by a factor of 2-3X, for free, even without going to fp8 (which we do plan to support).
So TLDR at this model scale, llm.c is already there functionally, I think, it's a matter of the compute resources and patience. I currently have this one box from Lambda and I have to look around for a few more boxes and merge the pending PR for multi-node training support. Getting all of this into a nice, stable state is probably a good chunk of the pending work right now.
Why write in CUDA and not just use PyTorch etc?
if performance, how much faster is it, out of curiosity?
“LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. […] A few more words on what I want this repo to be: First, I want llm.c to be a place for education.”
Hi. Is it possible to somehow run llm.c on an amd gpu?
Yeah, I just reproduced the GPT2 from scratch results in 8.75 hours on 4x 7900 XTX. The fork is here: https://github.com/anthonix/llm.c
So, NanoGPT took 1.8 days on 8xA100 for 124M model training on 30.7B tokens using flash attention. This would translate to 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost 10X speedup!
Does this look ballpark correct? Is there any summary of where majority of this improvement comes from?
Like the FAQ, you correctly anticipated my questions.
Thank you, from an appreciative reader!