HN comments for: Reproducing GPT-2 in llm.c

karpathy

45 replies

1d1h

2024-05-28 16:44:43 UTC

Hi HN the main (more detailed) article is here https://github.com/karpathy/llm.c/discussions/481

Happy to answer questions!

dang

8 replies

23h7m

2024-05-28 19:21:01 UTC

Ok, we've changed the URL to that from https://twitter.com/karpathy/status/1795484547267834137 above. Thanks!

karpathy

4 replies

22h47m

2024-05-28 19:41:14 UTC

sounds good. both work, (though) I think HN has a bit of an anti-twitter bias.

pests

2 replies

19h37m

2024-05-28 22:51:09 UTC

First, love the videos and other work you've been doing. The micrograd videos are a great way to show people this is all math in the end, and I've linked to specific timestamps in that video and others more times than I can count.

For why I think we have a anti-twitter bias...

Twitter doesn't show replies or any further context without being logged in. Most people will have accounts but I know a lot here deleted theirs or refuse to use it for one reason or another.

Also IMO most here are going to want to read the full source so it just cuts out the middleman. This would usually fall under the "Please submit the original source. If a post reports on something found on another site, submit the latter." guideline which is a little different since the source is yourself, but still the Twitter post doesn't add anything new or novel.

karpathy

1 replies

17h4m

2024-05-29 01:23:53 UTC

fwiw I totally understand the sentiment! it's actually a bit sad to me that so much of our content is moving from the shared, open web to platforms like twitter, unfortunately there seems to be too much value add around built-in discoverability, comments, ease of authoring, for many people revenue sharing, etc.

pests

0 replies

14h16m

2024-05-29 04:12:25 UTC

Yes, definitely. I had to double check your age (apologies! feels rude somehow) and yep, we're basically the same age. The web was different back then. Maybe not better; maybe that's nostalgia. But never before has more creators had as many tools and avenues to promote and monotonize their work as they do now.

dang

0 replies

14h23m

2024-05-29 04:04:56 UTC

I agree - Twitter is still the primary source for a lot of original work and original thoughts. Unfortunately it's gotten more complicated because (1) the threads there have gotten less accessible and (2) some people have assigned the entire site to one side of the culture war.

wrboyce

2 replies

19h20m

2024-05-28 23:08:01 UTC

Could you mention what the link has been changed from too? Sometimes it helps with context when reading the comments. Thanks!

dang

1 replies

14h25m

2024-05-29 04:03:00 UTC

I agree that it helps! but I did mention it, no? Admittedly "to that from" is a bit of an awkward construction

wrboyce

0 replies

9h16m

2024-05-29 09:12:10 UTC

facepalm I’d had a few whiskies and misread your comment. Sorry about that!

lagrange77

4 replies

2024-05-28 17:44:20 UTC

Thank you for the effort you put in your educational work, it helped me and others a lot! In fact, i'm training my nanoGPT version right now. :)

Ultimately my interest in llm.c is to have a nice, clean, minimal, super dependency-light repo in direct C/CUDA implementation, which I find aesthetically pleasing.

Also, it's awesome that you spend your time on your passion.

Any plans on making a video series on llm.c? :D

karpathy

3 replies

2024-05-28 17:56:46 UTC

Yes definitely. Related tweet of mine:

https://x.com/karpathy/status/1760388761349927356?lang=en

1. Build the thing

2. Build the ramp

Currently on step 1 :). It helps to build it first so you know where you are going, and then you can more easily re-build it when you're vector pointed at the end result.

lagrange77

0 replies

2024-05-28 18:15:28 UTC

That's fantastic. My gradient field is pointing towards it.

Thank you again!

htrp

0 replies

23h56m

2024-05-28 18:32:37 UTC

Everytime you take gardening leave, you build something new and interesting!

LorenzoGood

0 replies

15h25m

2024-05-29 03:03:25 UTC

I love when you leave your job.

ngiyabonga

3 replies

1d1h

2024-05-28 17:01:52 UTC

Hi Andrej!

First, thank you for your teaching, it has helped me a lot, didn't think I'd ever have the chance to say thank you, but here you are and I hope this gets to you!

Question - what's a relevant (05-2024) baseline to compare the performance of c code to? Back when you made nanoGPT you were seeing "the file train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training". So twice the memory on the c node, but unsure of data size /epochs, any other details I may be missing. I.e. what's the net uplift of running c vs "legacy" torch code?

Thanks again for everything.

karpathy

2 replies

1d1h

2024-05-28 17:18:26 UTC

The baseline is definitely PyTorch (or JAX), and indeed something like nanoGPT. I just never got nanoGPT "past the finish line" of really crossing the t's and dotting the i's and reproducing the models with as much care as I did now and here in llm.c, and getting to the point where it's a single launch command that just does the thing.

I think I'll try to develop the `train_gpt2.py` inside llm.c to be that, so that we have the two implementations exactly side by side, and it's all nice and comparable.

The C/CUDA code is currently a little bit faster than PyTorch (last time I measured ~2 weeks ago it was about 6% faster), and I think we can push this further. This is done by manually hard-coding a bunch of fusions/optimizations that are non-trivial for torch.compile to find (e.g. our FusedClassifier). But PyTorch has some pending work/PRs that will also speed up their side a lot.

Ultimately my interest in llm.c is to have a nice, clean, minimal, super dependency-light repo in direct C/CUDA implementation, which I find aesthetically pleasing. And on top of that, educational, i.e. using all of the above as an endpoint of an intro LLM course.

raymond_goo

0 replies

19h39m

2024-05-28 22:49:14 UTC

Maybe talk to MasterClass...

ilaksh

0 replies

19h59m

2024-05-28 22:29:00 UTC

Just out of curiosity, how do you feel about Tinygrad? They just released 0.9 and are also on the HN home page today.

espadrine

3 replies

1d1h

2024-05-28 17:19:31 UTC

How big of a perf improvement would result from using the architectural tweaks that Llama3 and others have put in place since GPT-2?

karpathy

2 replies

2024-05-28 17:41:35 UTC

My understanding and suspicion is mostly less than you think. Llama 3 architecture has the following changes on GPT-2:

1. delete the absolute positional encoding and replace with RoPE

2. delete all biases in all layers (in LayerNorms, they turn into RMSNorm)

3. GeLU -> SwiGLU non-linearity in the MLP

4. longer context length

5. architecture hyperparameter changes, e.g. slightly different aspect ratios

And there was a paper that I can't find the reference to anymore that claimed that if you train long enough, the gap becomes even lower. Possibly because the absolutely positional encoding has enough time to train more fully, where as the RoPE layer benefits from the "inductive bias" it adds in the earlier stages of training.

But I don't have full confidence on the above claim, maybe someone has tried or has better/concrete reference.

jorlow

1 replies

22h38m

2024-05-28 19:50:26 UTC

Note llama's feed forward is a bit different too:

  self.w2(F.silu(self.w1(x)) * self.w3(x))

I.e. the nonlinearity is a gate.

https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

soraki_soladead

0 replies

14h12m

2024-05-29 04:15:39 UTC

Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU is gated linear unit; the gate construction you describe.

dekhn

3 replies

18h53m

2024-05-28 23:35:22 UTC

Would you consider switching your interest to protein structure prediction? In particular, the current most advanced model is a closed-source, closed-weights system that was trained on a proprietary hardware. It is intentionally kept that way for now to enable deepmind to commercialize their product.

The goal here isn't to make the best performing model: it's ablation. How much can we remove from protein structure prediction (such as multiple sequence alignments and molecular dynamics, which were two improvements in AF3), while still having a generalized model that can predict novel folds.

Then focus on teaching the minimal necessary math and code to reproduce the results to the larger biological community. All I can say about AF3 is that it literally taught me that everything I learned about protein structure prediction in the last 30 years was misguided, or outright wrong.

Don't worry about drug discovery or any of the hard stuff. Just continue to show that all that's required to predict novel structures is the existing PDB.

wizzwizz4

1 replies

48m

2024-05-29 17:39:57 UTC

switching your interest

That's not usually how it works.

Just continue to show that all that's required to predict novel structures is the existing PDB.

Sounds like you know a lot about this topic. You should do it!

dekhn

0 replies

35m

2024-05-29 17:52:57 UTC

Yes I already published several papers in the area, but I don't work on it any more.

treme

0 replies

6h35m

2024-05-29 11:53:03 UTC

lol I appreciate your effort to guide his genius towards 'max human good'

363849473754

3 replies

2024-05-28 17:57:18 UTC

You might have covered this topic before, but I'm curious about the main performance differences between nanoGPT and llm.c. I'm planning to take your "Zero to Hero" course, and I'd like to know how capable the nanoGPT chatbot you'll build is. Is its quality comparable to GPT-2 when used as a chatbot?

karpathy

2 replies

2024-05-28 18:09:27 UTC

Zero To Hero doesn't make it all the way to a chatbot, it stops at pretraining, and even that at a fairly small scale or character-level transformer on TinyShakespeare. I think it's a good conceptual intro but you don't get too too far as a competent chatbot. I think I should be able to improve on this soon.

maskil

0 replies

22h49m

2024-05-28 19:39:13 UTC

Please do! It's a fantastic series!

363849473754

0 replies

23h24m

2024-05-28 19:04:09 UTC

Thanks! So, you are considering expanding the Zero to Hero series to include building a basic GPT-2 toy chatbot? I believe you mentioned in one of the early lectures that you planned to include building a toy version of Dalle. Do you still have plans for that as well?

sturza

2 replies

1d1h

2024-05-28 17:07:57 UTC

Do you think grokking leads to proper generalized reasoning? https://arxiv.org/abs/2405.15071

bilsbie

1 replies

2024-05-28 18:04:59 UTC

Any tips on understanding grokking? I’m not following that paper.

sturza

0 replies

2024-05-28 18:24:16 UTC

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. Overfitting and being cool about it and some new behavior might emerge.

localhost

2 replies

22h19m

2024-05-28 20:08:41 UTC

How large is the set of binaries needed to do this training job? The current pytorch + CUDA ecosystem is so incredibly gigantic and manipulating those container images is painful because they are so large. I was hopeful that this would be the beginnings of a much smaller training/fine-tuning stack?

karpathy

1 replies

21h57m

2024-05-28 20:31:22 UTC

That is 100% my intention and hope and I think we are very close to deleting all of that. Right now on master, I am already only using Python for the tokenization preprocessing. In principle the requirements for llm.c should be extremely minimal. I think this a few days of work that is high on my mind.

Biggest problem right now is finding a place that can host the 135GB of tokens for FineWeb100B. Will probably use S3 or something.

metadat

0 replies

19h2m

2024-05-28 23:26:00 UTC

Could this be a good case for a torrent?

simonw

1 replies

22h31m

2024-05-28 19:57:22 UTC

Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens. [...] GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?).

Andrej, based on that do you have a rough cost estimate for what it would take to train a GPT-3 Ada (350M)? Do you plan to get there with llm.c ?

karpathy

0 replies

22h26m

2024-05-28 20:02:18 UTC

The 350M model I trained last night was 30B tokens, 14 hours, ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K would be the estimate. You'd have to wait 140 hours on one box though. Getting an H100 box instead of A100 will already cut the time latency down probably by a factor of 2-3X, for free, even without going to fp8 (which we do plan to support).

So TLDR at this model scale, llm.c is already there functionally, I think, it's a matter of the compute resources and patience. I currently have this one box from Lambda and I have to look around for a few more boxes and merge the pending PR for multi-node training support. Getting all of this into a nice, stable state is probably a good chunk of the pending work right now.

m11a

1 replies

22h57m

2024-05-28 19:31:16 UTC

Why write in CUDA and not just use PyTorch etc?

if performance, how much faster is it, out of curiosity?

kgwgk

0 replies

21h21m

2024-05-28 21:07:06 UTC

Why write in CUDA and not just use PyTorch etc?

“LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. […] A few more words on what I want this repo to be: First, I want llm.c to be a place for education.”

0x1ceb00da

1 replies

10h34m

2024-05-29 07:53:47 UTC

Hi. Is it possible to somehow run llm.c on an amd gpu?

anthonix1

0 replies

2h33m

2024-05-29 15:55:21 UTC

Yeah, I just reproduced the GPT2 from scratch results in 8.75 hours on 4x 7900 XTX. The fork is here: https://github.com/anthonix/llm.c

sytelus

0 replies

9h41m

2024-05-29 08:46:46 UTC

So, NanoGPT took 1.8 days on 8xA100 for 124M model training on 30.7B tokens using flash attention. This would translate to 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost 10X speedup!

Does this look ballpark correct? Is there any summary of where majority of this improvement comes from?

jonesn11

0 replies

13h28m

2024-05-29 04:59:51 UTC

Like the FAQ, you correctly anticipated my questions.

1024core

0 replies

1d1h

2024-05-28 16:57:17 UTC

Thank you, from an appreciative reader!

benterix

42 replies

1d1h

2024-05-28 16:40:19 UTC

I just hope than in a couple of years we'll see a submission here titled "Reproduce GPT-4 on legacy RTX 4090."

Because currently even with open source (?) models we are still consumers, and the training is still the domain of the rich.

ravetcofx

18 replies

1d1h

2024-05-28 16:55:32 UTC

Accessing the dataset to train from scratch will be the biggest hurdle, now a lot of the pile has had ladder pulled since GPT-4

GaggiX

6 replies

2024-05-28 17:55:56 UTC

https://huggingface.co/datasets/HuggingFaceFW/fineweb has 15T cleaned and deduplicated english web data tokens.

ravetcofx

5 replies

2024-05-28 18:27:54 UTC

Holy crap, Does huggingface charge for bandwidth if you're downloading 45 terabytes??

andersa

2 replies

14h7m

2024-05-29 04:21:25 UTC

Fun trivia: downloading 45TB costs about $60, according to Cloudflare.

verticalscaler

1 replies

7h11m

2024-05-29 11:17:29 UTC

That's what Cloudflare charges. It costs them around 6 cents.

kazanz

0 replies

1h31m

2024-05-29 16:56:40 UTC

Wish I could say I'm surprised you're getting downvotes. Carrier costs are some of the lowest costs for hosting providers. Yet that fact seems to elude a majority of the community here.

drexlspivey

1 replies

23h40m

2024-05-28 18:47:58 UTC

I believe they are hosting it on Cloudflare who doesn’t charge for egress

fragmede

0 replies

23h25m

2024-05-28 19:02:55 UTC

More specifically, Cloudflare R2 doesn't charge for egress, and Cloudflare doesn't charge for egress to members in the Bandwidth Alliance which include Azure, Google Cloud, Oracle, Alibaba Cloud, and others, though critically not AWS.

They very much do charge egress fees elsewhere.

meiraleal

5 replies

1d1h

2024-05-28 17:16:15 UTC

I'm okay with paying for datasets

CamperBob2

4 replies

1d1h

2024-05-28 17:27:27 UTC

Depends on how the courts rule. If the copyright maximalists prevail, only the wealthiest entities will be able to afford to license a useful data set.

Paradoxically enough, this is the outcome that most "Hacker News" denizens seem to be rooting for.

groby_b

2 replies

22h29m

2024-05-28 19:58:48 UTC

It's almost as if people believe in fairness and compensating people for their work.

Also, it's worth noting that this is only true as long as we're stuck in the "must train on the entire sum total of human output ever created" local minimum for machine learning. Given that most biological entities learn with much less data, this might well be the thing that prods ML research to using an approach that isn't "IDK, buy a few containers of GPUs, and half a DC of storage, see if that makes things better".

nwsm

1 replies

3h4m

2024-05-29 15:24:24 UTC

It's almost as if people believe in fairness and compensating people for their work.

Yet in this case we are talking about compensating the compilers/massagers/owners of the datasets, not the original authors from wherever the data was originally scraped.

wizzwizz4

0 replies

43m

2024-05-29 17:45:13 UTC

Copyright is hideously broken, but in theory: the owners only own it because they compensate the authors, which they only do out of an expectation of future profit (on average).

That theory's a fantasy, because extractive systems involving gatekeepers get established, but in this specific case, enforcing copyright would make things fairer for authors. There's no extractive copyright-taking gatekeeper for websites: scrapers don't get copyright, so can't re-license the material they've scraped (unless it's permissively-licensed or something).

meiraleal

0 replies

2024-05-28 17:53:03 UTC

I'd still get most of my dataset from torrent but I could pay for specific things like high quality source code.

exe34

2 replies

1d1h

2024-05-28 17:15:55 UTC

i suppose you wouldn't be able to use it for external services, but internally, I'm sure you can find some books that fell off the back of a truck...

HeatrayEnjoyer

1 replies

1d1h

2024-05-28 17:20:53 UTC

No reason you can't go external. GPT was trained using ebook torrent sites

artninja1988

0 replies

23h55m

2024-05-28 18:33:06 UTC

OpenAI has enough money to hire lawyers to defend it until the end of time though

CamperBob2

1 replies

1d1h

2024-05-28 17:15:52 UTC

Someone will come along and say "Why don't you just mirror Anna's Archive?" in 3...2...1...

sebzim4500

0 replies

8h56m

2024-05-29 09:32:11 UTC

I think between Anna's Archive, fineweb and as many github repos as you can scrape you can get a pretty decent dataset.

I doubt Anna's Archive would produce a good model on its own though.

sabareesh

5 replies

21h45m

2024-05-28 20:43:21 UTC

Well here is a comment on 4090 https://github.com/karpathy/llm.c/discussions/481#discussion...

huac

4 replies

4h34m

2024-05-29 13:54:18 UTC

25% MFU :( maybe because of the P2P nerf?

sabareesh

1 replies

2h15m

2024-05-29 16:13:00 UTC

This much bigger model (500M), P2P is enabled via Mailbox. It is expected because of memory to compute ratio

huac

0 replies

2h7m

2024-05-29 16:20:52 UTC

can you elaborate?

anthonix1

1 replies

4h6m

2024-05-29 14:22:03 UTC

Maybe get a 7900 XTX. 122 TFLOPS of BF16/FP16 for less than $1k and I'm getting 55.4% MFU

sabareesh

0 replies

2h14m

2024-05-29 16:13:39 UTC

These are not apples to apple comparison, as this is running across GPU and much bigger model

vineyardmike

4 replies

1d1h

2024-05-28 17:17:00 UTC

We won’t ever get there or need to because GPT-4 wasn’t trained on one GPU it was trained on thousands. The (most likely) biggest meaningful difference between -2 and -4 is the number of parameters and the training data/duration. I don’t think you’d really learn much more.

elicksaur

3 replies

19h10m

2024-05-28 23:18:22 UTC

It’s not about learning. It’s about owning. Exactly the reason OpenAI stopped being open. Having GPT-4-quality LLMs created by anyone with a gaming PC would be pretty radical.

vineyardmike

2 replies

17h1m

2024-05-29 01:27:30 UTC

And you won’t get there. Those models are far too large for a 2024 GPU. Llama-3 70b is arguably close to GPT-4 but is still too large for gaming GPUs (and probably for many years of GPU updates)

elicksaur

1 replies

14h11m

2024-05-29 04:17:14 UTC

“You won’t get there” is a pretty vast statement for all of the future. Two fairly reasonable predictions: 1) the compute needed to get GPT4 performance will decrease. 2) the compute on consumer GPUs will increase.

At some point they cross, and you will be able to run a GPT4-quality LLM on a consumer GPU. At some point after that, you’ll be able to run a GPT4-quality LLM on a 2024 consumer GPU if you can find one.

Important to emphasize, I’m not saying “GPT-4”. Llama-3 was trained on 24k GPU clusters. “Able to do the exact same processing at 1/24k the compute” is different from “Able to get equivalent performance at 1/24k compute”. Even then, given a long enough time scale, the former is possible.

vineyardmike

0 replies

3h8m

2024-05-29 15:20:17 UTC

1) the compute needed to get GPT4 performance will decrease. 2) the compute on consumer GPUs will increase.

I’m assuming we’re just talking inference here…

Sure compute abilities for consumers will increase but the original comment had a fixed GPU - the 4090. I can already eke out LLama3:8b on my MacBook Air, and Apple will sell you a laptop capable of running the full sized LLama.

There is a direct correlation between parameters and “knowledge” for an LM. There’s some open questions as to density (LLaMa3 specifically challenged previous assumptions) but it seems implausible to fit an equivalent model as GPT4 into 24gb vram. Just like compression, you can’t shrink forever.

GPT-4 and GPT-2 are pretty similar architecturally (I assume). So if abilities don’t matter, we can already run GPT-2 so we’re basically there for 4.

anthonix1

3 replies

23h36m

2024-05-28 18:52:14 UTC

FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900 XTX machine (less than $4k worth of GPU), using the same settings as in the post (0.5M batch size etc).

pama

2 replies

20h32m

2024-05-28 21:56:11 UTC

Did you reproduce the evaluation as well?

anthonix1

0 replies

4h17m

2024-05-29 14:10:44 UTC

So... successfully reproduced in ~8.75 hours, taking about 18 kWh / $2.70

The first run actually failed at step 3000 or so, and I realized I had a bug in my attention / matmul kernels, but after fixing that and restarting it worked great

[1] https://github.com/anthonix/llm.c

anthonix1

0 replies

18h18m

2024-05-29 00:10:23 UTC

It converges similarly on smaller datasets.

About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.

Will report back tomorrow when the training has finished..

Invictus0

3 replies

1d1h

2024-05-28 17:21:24 UTC

I'm not saying this to be rude, but I think you have a deep misunderstanding of how AI training works. You cannot just skip the matrix multiplications necessary to train the model, or get current hardware to do it faster.

xdavidliu

0 replies

22h46m

2024-05-28 19:41:45 UTC

was the first sentence really necessary? The second sentence seems fine by itself.

nickpsecurity

0 replies

19h12m

2024-05-28 23:16:32 UTC

There's work on replacing multiplication. Here's four examples:

https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_...

https://arxiv.org/abs/2012.03458

https://openaccess.thecvf.com/content/CVPR2021W/MAI/papers/E...

https://arxiv.org/pdf/2106.10860

benterix

0 replies

8h10m

2024-05-29 10:17:42 UTC

No offence taken! As far as my (shallow!) understanding goes, the main challenge is the need for many GPUs with huge amounts of memory, and it still takes ages to train the model. So regarding the use of consumer GPUs, some work has been done already, and I've seen some setups where people combine of these and are successful. As for the the other aspects, maybe at some point we distill what is really needed to a smaller but excellent dataset that would give similar results in the final models.

auspiv

2 replies

23h54m

2024-05-28 18:34:30 UTC

Considering it takes 8x A100 GPUs (80GB VRAM) to train GPT-2, I think it'll take far more than a single 4090.

bufo

0 replies

23h40m

2024-05-28 18:48:36 UTC

The RTX 4090 has about the same BF16 Tensor Core TOPs than the A100, assuming 50% MFU (like the A100 40 GB PCIe) it would take 8x longer on 1 RTX 4090 vs 8x A100 80GB SXM, so 12 hours. Datasheet here for the TOPs https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvid... 50% MFU should be achievable on the 4090.

anthonix1

0 replies

4h15m

2024-05-29 14:12:53 UTC

Nah, I reproduced on 4x 7900 XTX machine in 8.75 hours, so a single 7900 XTX (costs less than $1k) could do it in under 24 hours. Was hitting 55.4% MFU.

doubloon

0 replies

10h55m

2024-05-29 07:33:25 UTC

Dude im hoping we get rid of Nvidia completely. I can run llama.cpp inference on a 7B model on my 24 core cpu intel machine using just CPU and it only uses about 4gb of ram and is not that slow. If we could have massive parallel arm core or even riscv machines without the Cuda issues with proprietary driver hell it would be much more open source. And much less wonkage for the normie user

natsucks

3 replies

21h59m

2024-05-28 20:29:25 UTC

In your opinion is it important for ML engineers to know C?

esafak

0 replies

21h2m

2024-05-28 21:26:19 UTC

You'd have to be deep into ML Infrastructure to use C, probably via CUDA. No-one who develops or uses ML models touches C or even C++. tinygrad and llama.cpp are exceptions.

brcmthrowaway

0 replies

21h12m

2024-05-28 21:15:58 UTC

0% chance

adeptima

0 replies

10h8m

2024-05-29 08:20:06 UTC

Spend one year to study multiple languages - bash, C, C++, Go, Python ... and even Mojo or Rust. 10-20 hours a week. Being able to read top programming languages is the best investment I ever made. You will become fearless and can see the matrix ;)

indigodaddy

3 replies

1d1h

2024-05-28 16:40:03 UTC

Looks like this is re: training, but wonder how inference would be on some garbage older machine with no GPU on this model?

ryankrage77

2 replies

22h40m

2024-05-28 19:48:02 UTC

Last time I tried GPT-2 on CPU (which I think was shortly before chatGPT was launched), I was getting about 0.2 tokens/sec. CPU utilization was low though, so running inference in parralel gave better results. I was using 2 x E5-2660's.

int_19h

1 replies

22h17m

2024-05-28 20:11:34 UTC

DDR5 helps a lot. You can actually run stuff like LLaMA at >1 tok/s on the CPU with high-end gaming hardware these days.

doubloon

0 replies

10h42m

2024-05-29 07:46:26 UTC

I have a 24 core Intel cpu and llama3.cpp runs llama3 surprisingly fast in surprisingly little RAM. Yes it becomes a space heater but theres light at the end of the cuda free tunnel

ls612

1 replies

15h2m

2024-05-29 03:26:31 UTC

Is this the sort of thing that a person with curiosity and a 4090 could do? It says he used 8xA100s in the cloud to do this but is it just a matter of the 4090 going 8x slower or will memory constraints kill the whole endeavour?

smaddox

0 replies

1h46m

2024-05-29 16:42:37 UTC

4090 should have enough VRAM for 124M param training. Even at float32 precision, with AdamW optimizer, parameters should only be ~2GB (124M params x 4 bytes per param x ~4 for optimizer weight overhead). So there should be plenty of remaining space for activations.

akkishore

1 replies

5h16m

2024-05-29 13:11:52 UTC

Hi Andrej,

Huge fan of all the work you do. Wanted to understand something fundamental and whom better to ask than you: Whats so special about the transformer architecture that its able to predict the next token so beautifully understanding all the intricate previous token relationships? I understand Attention but what so special about this architecture that no other architectures are able to "attend" appropriately to previous tokens? Being a CS guy, its really hard for me to fathom that we have not yet created another architecture which can perform similarly.

smaddox

0 replies

1h55m

2024-05-29 16:32:39 UTC

Transformers have quadratic computational complexity in sequence length, i.e. O(N^2) where N is the sequence length. RNNs, Linformer, Mamba, etc. have linear or quasi-linear computational complexity in sequence length, which often bottlenecks information movement across tokens.

In theory, if you grew the RNN's state quadratically vs sequence length, you could likely achieve comparable performance to transformers, but it would likely be less efficient than transformers.

zimabluerain

0 replies

19h3m

2024-05-28 23:25:14 UTC

the code works well on H100: https://x.com/Yuchenj_UW/status/1795554739633221804

unknown2342

0 replies

3h18m

2024-05-29 15:10:17 UTC

Thank you for your work Andrej! <3

notg963

0 replies

2024-05-28 18:28:20 UTC

Do you have plans to create videos for the llm.c?

metalloid

0 replies

5h55m

2024-05-29 12:32:48 UTC

This awesome!

We need a series on how to build the llm.c from the scratch. Any volunteer?

:-)

celltalk

0 replies

22h52m

2024-05-28 19:36:25 UTC

Time for llm videos!

anoy8888

0 replies

23h7m

2024-05-28 19:21:38 UTC

Can it be done in rust ?

aliljet

0 replies

19h20m

2024-05-28 23:07:39 UTC

Is there a reason you're not trying to port this into an even more stack agnostic world without CUDA?

adeptima

0 replies

10h14m

2024-05-29 08:14:09 UTC

Andrej Karpathy karpathy is a magician!

But being the coolest kid on the block with pure C/CUDA implementation is not enough https://github.com/karpathy/llm.c

Studying a baby Llama 2 model source code in pure Mojo is the next level https://github.com/tairov/llama2.mojo https://github.com/tairov/llama2.mojo/blob/master/llama2.moj...

Mojo Lang - Tomorrow's High Performance Python? (with Chris Lattner) https://www.youtube.com/watch?v=JRcXUuQYR90

Andrej Karpathy and Chris Lattner collab is on my wishlist ;)