HN comments for: Building LLMs from the Ground Up: A 3-Hour Coding Workshop

alecco

40 replies

10h50m

2024-09-01 07:40:32 UTC

Using PyTorch is not "LLMs from the ground up".

It's a fine PyTorch tutorial but let's not pretend it's something low level.

menzoic

8 replies

10h29m

2024-09-01 08:00:56 UTC

Is this a joke? Can’t tell. OpenAI uses PyTorch to build LLMs

leobg

2 replies

9h26m

2024-09-01 09:04:05 UTC

People think of the Karpathy tutorials which do indeed build LLMs from the ground up, starting with Python dictionaries.

krmboya

1 replies

5h8m

2024-09-01 13:21:52 UTC

From scratch is relative. To a python programmer, from scratch may mean starting with dictionaries but a non-programmer will have to learn what python dicts are first.

To someone who already knows excel, from scratch with excel sheets instead of python may work with them.

wredue

0 replies

2h4m

2024-09-01 16:26:01 UTC

For the record, if you do not know what a dict actually is, and how it works, it is impossible to use it effectively.

Although if your claim is then that most programmers do not care about being effective, that I would tend to agree with given the 64 gigs of ram my basic text editors need these days.

jnhl

1 replies

10h14m

2024-09-01 08:16:17 UTC

You could always go deeper and from some points of view, it's not "from the ground up" enough unless you build your own autograd and tensors from plain numpy arrays.

0cf8612b2e1e

0 replies

1h43m

2024-09-01 16:47:05 UTC

Numpy sounds like cheating on the backs of others. Going to need your own hand crafted linear algebra routines.

atoav

1 replies

8h45m

2024-09-01 09:45:14 UTC

No it is not. From scratch has a meaning. To me it means: in a way that letxs you undrrstand the important details, e.g. using a programming language without major dependencies.

Calling that from scratch is like saying "Just go to the store and tell them what you want" in a series called: "How to make sausage from scratch".

When I want to know how to do X from scratch I am not interested in "how to get X the fastest way possible", to be frank I am not even interested in "How to get X in the way others typically get it", what I am interested in is learning how to do all the stuff that is normally hidden away in dependencies or frameworks myself — or, you know, from scratch. And considering the comments here I am not alone in that reading.

kenjackson

0 replies

1h41m

2024-09-01 16:49:19 UTC

Your definition doesn’t match mine. My definition is fuzzier. It is “building something using no more than the common tools of the trade”. The term “common” is very era dependent.

For example, building a web server from scratch - I’d probably assume the presence of a sockets library or at the very least networking card driver support. For logging and configuration I’d assume standard I/o support.

It probably comes down to what you think makes LLMs interesting as programs.

TZubiri

0 replies

10h1m

2024-09-01 08:29:38 UTC

Source please?

jb1991

7 replies

10h5m

2024-09-01 08:24:47 UTC

Learn to play Bach: start with making your own piano.

defrost

3 replies

9h56m

2024-09-01 08:33:55 UTC

Bach (Johann Sebastian .. there were many musical Bach's in the family) owned and wrote for harpsichords, lute-harpsichords, violin, viola, cellos, a viola da gamba, lute and spinet.

Never had a piano, not even a fortepiano .. though reportedly he played one once.

vixen99

0 replies

6h51m

2024-09-01 11:39:19 UTC

We know what he meant.

jb1991

0 replies

6h20m

2024-09-01 12:10:16 UTC

Yes, I know, but that’s irrelevant. You can replace the word piano in my comment with harpsichord if it makes you happy.

generic92034

0 replies

7h31m

2024-09-01 10:59:28 UTC

He had to improvise on the Hammerklavier when visiting Frederick the Great in Potsdam. That (improvising for Frederick) is also the starting point for the later creation of https://en.wikipedia.org/wiki/The_Musical_Offering .

jahdgOI

2 replies

9h29m

2024-09-01 09:01:08 UTC

Pianos are not proprietary in that they all have the same interface. This is like a web development tutorial in ColdFusion.

maleldil

0 replies

4h53m

2024-09-01 13:37:36 UTC

Are you implying that PyTorch is proprietary?

jb1991

0 replies

2h46m

2024-09-01 15:43:55 UTC

We’re digressing to get way off the whole point of the comment, but to address your point, actually piano design has been an area of great innovation over the centuries, with different companies doing it in considerably different ways.

BaculumMeumEst

5 replies

8h18m

2024-09-01 10:11:57 UTC

I really like Sebastian's content but I do agree with you. I didn't get into deep learning until starting with Karpathy's series, which starts by creating an autograd engine from scratch. Before that I tried learning with fast.ai, which dives immediately into building networks with Pytorch, but I noped out of there quickly. It felt about as fun as learning Java in high school. I need to understand what I'm working with!

krmboya

2 replies

5h16m

2024-09-01 13:14:31 UTC

Maybe it's just different learning styles. Some people, me included, like to start getting some immediate real world results to keep it relevant and form some kind of intuition, then start peeling back the layers to understand the underlying principles. With fastAI you are already doing this by the 3rd lecture.

Like driving a car, you don't need to understand what's under the hood you start driving, but eventually understanding it makes you a better driver.

BaculumMeumEst

1 replies

3h20m

2024-09-01 15:10:35 UTC

For sure! In both cases I imagine it is a conscious choice where the teachers thought about the trade-offs of each option. Both have their merits. Whenever you write learning material you have to decide where to draw the line of how far you want to break down the subject matter. You have to think quite hard about exactly who you are writing for. It's really hard to do!

jph00

0 replies

50m

2024-09-01 17:39:57 UTC

You seem to be implying that the top-down approach is a trade off that involves not breaking down the subject matter into as lower level details. I think the opposite is true - when you go top down you can keep teaching lower and lower layers all the way down to physics if you like!

jph00

1 replies

54m

2024-09-01 17:36:21 UTC

fast.ai also does autograd from scratch - and goes further than Karpathy since it even does matrix multiplication from scratch.

But it doesn’t start there. It uses top-down pedagogy, instead of bottom up.

BaculumMeumEst

0 replies

36m

2024-09-01 17:53:49 UTC

Oh that’s interesting to know! I guess I gel better with bottom up. As soon as I start seeing API functions I don’t understand I immediately want to know how they work!

nerdponx

4 replies

5h35m

2024-09-01 12:54:53 UTC

Low level by what standards? Is writing an IRC client in Python using only the socket API also not "from scratch"?

badsectoracula

3 replies

5h14m

2024-09-01 13:16:03 UTC

Considering i seem to be the minority here based on all the other responses the message you replied to, the answer i'd give is "by mine, i guess".

At least when i saw the "Building LLMs from the Ground Up" what i expected was someone to open vim, emacs or their favorite text editor and start writing some C code (or something around that level) to implement, well, everything from the "ground" (the operating system's user space which in most OSes is around the overall level of C) and "up".

nerdponx

2 replies

4h29m

2024-09-01 14:01:30 UTC

The problem with this line of thinking is that 1) it's all relative anyway, and 2) The notion of "ground" is completely different depending on which perspective you have.

To a statistician or a practitioner approaching machine learning from a mathematical perspective, the computational details are a distraction.

Yes, these models would not be possible without automatic differentiation and massively parallel computing. But there is a lot of rich detail to consider in building up the model from first mathematical principles, motivating design choices with prior art from natural language processing, various topics related to how input data is represented and loss is evaluated, data processing considerations, putting things into context of machine, learning more broadly, etc. You could fill half a book chapter with that kind of content (and people do), without ever talking about computational details beyond a passing mention.

In my personal opinion, fussing over manual memory management is far afield from anything useful unless you want to actually work on hardware or core library implementations like Pytorch. Nobody else in industry is doing that.

wredue

1 replies

2h6m

2024-09-01 16:24:07 UTC

Gluing together premade components is not “from the ground up” by most people’s definition.

People are looking at the ground up for a clear picture of what the thing is actually doing, so masking the important part of what is actually happening, then calling it “ground up” is disingenuous.

nerdponx

0 replies

1h53m

2024-09-01 16:37:14 UTC

Yes, but "what the thing is actually doing" is different depending on what your perspective is on what "the thing" and what "actually" consists of.

If you are interested in how the model works conceptually, how training works, how it represents text semantically, etc., then I maintain that computational details are an irrelevant distraction, not an essential foundation.

How about another analogy? Is SICP not a good foundation for learning about language design because it uses Scheme and not assembly or C?

SirSegWit

4 replies

10h19m

2024-09-01 08:11:15 UTC

I'm still waiting for an assembly language model tutorial, but apparently there are no real engineers out there anymore, only torch script kiddies /s

oaw-bct-ar-bamf

1 replies

9h26m

2024-09-01 09:04:02 UTC

Automotive actually uses ML in plain c with some inline assembly sprinkled on top run run models in embedded devices.

It’s definitely out there and in productive use.

mdp2021

0 replies

2h19m

2024-09-01 16:10:50 UTC

ML in plain c

Which engines in particular? I never found especially flexible ones.

wredue

0 replies

2h1m

2024-09-01 16:29:14 UTC

Ironically, slippery slope argumentation is a favourite style of kids.

Unfortunately, your argument is a well known fallacy and carries no weight.

sigmoid10

0 replies

10h14m

2024-09-01 08:16:13 UTC

Pfft. Assembly. I'm waiting for the real low level tutorial based on quantum electrodynamics.

delano

1 replies

6h30m

2024-09-01 11:59:44 UTC

If you want to make an apple pie from scratch, first you have to invent the universe.

CamperBob2

0 replies

3h34m

2024-09-01 14:56:28 UTC

After watching the Karpathy videos on the subject, of course.

botverse

1 replies

10h9m

2024-09-01 08:21:10 UTC

#378

alecco

0 replies

6h39m

2024-09-01 11:51:29 UTC

I'll write a guide "no-code LLMs in CUDA".

_giorgio_

1 replies

5h48m

2024-09-01 12:41:48 UTC

Your comment is one of the most pompous that I've ever read.

NVDIA value lies only in pytorch and cuda optimizations with respect with pure c implementation, so saying that you need go lower level than cuda or pytorch means simply reinventing Nvidia. Good luck with that

alecco

0 replies

5h23m

2024-09-01 13:06:50 UTC

1. I only said the meaning of the title is wrong, and I praised the content

2. I didn't say CUDA wouldn't be ground up or low level (please re-read) (I say in another comment about a no-code guide with CUDA, but it's obviously a joke)

3. And finally, I think your comment comes out as holier than thou and finger pointing and making a huge deal out of a minor semantic observation.

atoav

0 replies

8h54m

2024-09-01 09:36:12 UTC

Wanted to say the same thing. As an educator who once gave a course on a similar topic for non-programmers you need to start way, way earlier.

E.g.

1. Programming basics

2. How to manipulate text using programs (reading, writing, tokenization, counting words, randomization, case conversion, ...)

3. How to extract statistical properties from texts (ngrams, etc, ...)

4. How to generate crude text using markov chains

5. Improving on markov chains and thinking about/trying out different topologies

Etc.

Sure markov chains are not exactly LLMS, but they are a good starting point to byild a intuition how programs can extract statistical properties from text and generate new text based on that. Also it gives you a feeling how programes can work on text.

If you start directly with a framework there is some essential understanding missing.

theanonymousone

10 replies

10h6m

2024-09-01 08:24:21 UTC

It may be unreasonable, but I have a default negativity toward anything that uses the word "coding" instead of programming or development.

ljlolel

6 replies

9h51m

2024-09-01 08:39:05 UTC

This is more a European thing

atoav

3 replies

9h36m

2024-09-01 08:54:10 UTC

I am from Europe and I am not completely sure about that to be honest. I also prefer programming.

I also dislike software development as it reminds me of developing a photograhic negative – like "oh let's check out how the software we developed came out".

It should be software engineering and it should be held to a similar standard as other engineering fields if it isn't done in a non-professional context.

reichstein

1 replies

8h51m

2024-09-01 09:38:52 UTC

The word "development" can mean several things. I don't think "software development" sounds bad when grouped with a phrase like "urban development". It describes growing and tuning software for, well, working better, solving more needs, and with fewer failure modes.

I do agree that a "coder" creates code, and a programmer creates programs. I expect more of a complete program than of a bunch of code. If a text says "coder", it does set an expectation about the professionalism of the text. And I expect even more from a software solution created by a software engineer. At least a specification!

Still, I, a professional software engineer and programmer, also write "code" for throwaway scripts, or just for myself, or that never gets completed. Or for fun. I will read articles by and for coders too.

The word is a signal. It's neither good nor bad, but If that's not the signal the author wants to send, they should work on their communication.

mdp2021

0 replies

8h44m

2024-09-01 09:46:33 UTC

If that's not the signal the author wants to send

You can't use a language that will be taken by everyone the same way. The public is heterogeneous - its subsets will use different "codes".

mdp2021

0 replies

9h22m

2024-09-01 09:08:30 UTC

software development

Wrong angle. There is a problem, your consideration of the problem, the refinement of your solution to the problem: the solution gradually unfolds - it is developed.

SkiFire13

1 replies

8h37m

2024-09-01 09:52:46 UTC

As an European: my language doesn't even have a proper equivalent to "coding", only a direct translation to "programming"

badsectoracula

0 replies

4h34m

2024-09-01 13:56:40 UTC

I'm from Europe and my language doesn't have an equivalent to "coding" but i'm still using the English word "coder" and "coding" for decades - in my case i learned it from the demoscene where it was always used for programmers since the 80s. FWIW the Demoscene is (or was at least) largely a European thing (groups outside of Europe did exist but the majority of both groups and demoparties were -and i think still are- in Europe) so perhaps there is some truth about the "coding" word being a European thing (e.g. it sounded ok in some languages and spread from there).

Also in my ears coder always sounded cooler than programmer and it wasn't until a few years ago i first heard that to some people it has negative connotations. Too late to change though, it still sounds cooler to me :-P.

[0] https://en.wikipedia.org/wiki/Demoscene

xanderlewis

0 replies

9h55m

2024-09-01 08:35:34 UTC

Probably now an unpopular view (as is any opinion perceived as 'judgemental' or 'gatekeeping'), but I agree.

smartmic

0 replies

9h52m

2024-09-01 08:38:07 UTC

I fully agree. We had a discussion about this one year ago: https://news.ycombinator.com/item?id=36924239

mdp2021

0 replies

9h35m

2024-09-01 08:55:15 UTC

Quite a cry, in a submission page from one of the most language "obsessed" in this community.

Now: "code" is something you establish - as the content of the codex medium (see https://en.wikipedia.org/wiki/Codex for its history); from the field of law, a set of rules, exported in use to other domains since at least the mid XVI century in English.

"Program" is something you publish, with the implied content of a set of intentions ("first we play Bach then Mozart" - the use postdates "code"-as-"set of rules" by centuries).

"Develop" is something you unfold - good, but it does not imply "rules" or "[sequential] process" like the other two terms.

bschmidt1

7 replies

15h7m

2024-09-01 03:22:58 UTC

Love stuff like this. Tangentially I'm working on useful language models without taking the LLM approach:

Next-token prediction: https://github.com/bennyschmidt/next-token-prediction

Good for auto-complete, spellcheck, etc.

AI chatbot: https://github.com/bennyschmidt/llimo

Good for domain-specific conversational chat with instant responses that doesn't hallucinate.

p1esk

4 replies

14h18m

2024-09-01 04:12:25 UTC

Why do you call your language model “transformer”?

bschmidt1

3 replies

14h12m

2024-09-01 04:17:56 UTC

Language is the language model that extends Transformer. Transformer is a base model for any kind of token (words, pixels, etc.).

However, currently there is some language-specific stuff in Transformer that should be moved to Language :) I'm focusing first on language models, and getting into image generation next.

p1esk

2 replies

14h4m

2024-09-01 04:26:15 UTC

No, I mean, a transformer is a very specific model architecture, and your simple language model has nothing to do with that architecture. Unless I’m missing something.

richrichie

1 replies

11h45m

2024-09-01 06:44:54 UTC

For a century, transformer meant a very different thing. Power systems people are justifiably amused.

p1esk

0 replies

5h1m

2024-09-01 13:28:59 UTC

And it means something else in Hollywood. But we are discussing language models here, aren’t we?

vunderba

0 replies

12h27m

2024-09-01 06:03:17 UTC

I took a very cursory look at the code, and it looks like this is just a standard Markov chain. Is it doing something different?

kgeist

0 replies

9h55m

2024-09-01 08:34:43 UTC

Simpler take on embeddings (just bigrams stored in JSON format)

So Markov chains

ein0p

6 replies

13h50m

2024-09-01 04:39:47 UTC

I’m not sure why you’d want to build an LLM these days - you won’t be able to train it anyway. It’d make a lot of sense to teach people how to build stuff with LLMs, not LLMs themselves.

ckok

3 replies

13h28m

2024-09-01 05:02:27 UTC

This has been said about pretty much every subject. Writing your own Browsers, compilers, cryptography, etc. But at least for me even if nothing comes of it just knowing how it really works, What steps are involved are part of using things properly. Some people are perfectly happy using a black box, but without kowning how its made, how do we know the limits? How will the next generation of llms happen if nobody can get excited about the internal workings?

ein0p

2 replies

12h54m

2024-09-01 05:36:19 UTC

You don’t need to write your own LLM to know how it works. And unlike, say, a browser it doesn’t really do anything even remotely impressive unless you have at least a few tens of thousands of dollars to spend on training. Source: my day job is to do precisely what I’m telling you not to bother doing, but I do have access to a large pool of GPUs. If I didn’t, I’d be doing what I suggest above.

richrichie

0 replies

11h43m

2024-09-01 06:47:28 UTC

Good points. For learning purpose, just understanding what a neural network is and how it works covers it all.

BaculumMeumEst

0 replies

7h24m

2024-09-01 11:06:13 UTC

But I mean people can always rent GPUs too. And they're getting pretty ubiquitous as we ramp up from the AI hype craze, I am just an IT monkey at the moment and even I have on-demand access to a server with something like 4x192GB GPUs at work.

kgeist

0 replies

9h49m

2024-09-01 08:40:55 UTC

It's possible to train useful LLMs on affordable harwdare. It depends on what kind of LLM you want. Sure you won't build the next ChatGPT, but not every language task requires a universal general-purpose LLM with billions of parameters.

BaculumMeumEst

0 replies

7h27m

2024-09-01 11:02:48 UTC

It's so fun! And for me at least, it sparks a lot of curiosity to learn the theory behind them, so I would imagine it is similar for others. And some of that theory will likely cross over to the next AI breakthrough. So I think this is a fun and interesting vehicle for a lot of useful knowledge. It's not like building compilers is still super relevant for most of us, but many people still learn to do it!

eclectic29

4 replies

19h21m

2024-08-31 23:09:27 UTC

This is excellent. Thanks for sharing. It's always good to go back to the fundamentals. There's another resource that is also quite good: https://jaykmody.com/blog/gpt-from-scratch/

_giorgio_

3 replies

11h53m

2024-09-01 06:36:56 UTC

Not true.

Your resource is really bad.

"We'll then load the trained GPT-2 model weights released by OpenAI into our implementation and generate some text."

skinner_

2 replies

8h22m

2024-09-01 10:08:02 UTC

Your resource is really bad.

What a bad take. That resource is awesome. Sure, it is about inference, not training, but why is that a bad thing?

szundi

1 replies

3h18m

2024-09-01 15:11:49 UTC

This is not “building from the ground up”

abustamam

0 replies

2h53m

2024-09-01 15:37:37 UTC

Why is that bad?

atum47

4 replies

19h36m

2024-08-31 22:54:23 UTC

Excuse my ignorance, is this different from Andrej Karpathy https://www.youtube.com/watch?v=kCc8FmEb1nY

Anyway I will watch it tonight before bed. Thank you for sharing.

BaculumMeumEst

3 replies

18h51m

2024-08-31 23:38:59 UTC

Andrej's series is excellent, Sebastian's book + this video are excellent. There's a lot of overlap but they go into more detail on different topics or focus on different things. Andrej's entire series is absolutely worth watching, his upcoming Eureka Labs stuff is looking extremely good too. Sebastian's blog and book are definitely worth the time and money IMO.

brcmthrowaway

2 replies

10h42m

2024-09-01 07:47:52 UTC

what book

StefanBatory

1 replies

10h4m

2024-09-01 08:26:09 UTC

Most likely this one.

https://www.manning.com/books/build-a-large-language-model-f...

(I've taken it from the footnotes on the article)

BaculumMeumEst

0 replies

8h14m

2024-09-01 10:16:00 UTC

That's the one! High enough quality that I would guess it would highly convert from torrents to purchases. Hypothetically, of course.

adultSwim

3 replies

15h39m

2024-09-01 02:50:56 UTC

This page is just a container for a youtube video. I suggest updating this HN link to point to the video directly, which contains the same links as the page in its description.

yebyen

0 replies

12h50m

2024-09-01 05:40:16 UTC

Why not support the author's own website? It looks like a nice website

mdp2021

0 replies

10h57m

2024-09-01 07:32:45 UTC

On the contrary, I saved you that extra step of looking for Sebastian Raschka's repository of writings.

_giorgio_

0 replies

11h50m

2024-09-01 06:40:16 UTC

He shares a ton of videos and code. His material is really valuable. Just support him?

abusaidm

3 replies

19h56m

2024-08-31 22:34:16 UTC

Nice write up Sebastian, looking forward to the book. There are lots of details on the LLM and how it’s composed, would be great if you can expand on how Llama and OpenAI could be cleaning and structuring their training data given it seems this is where the battle is heading in the long run.

rahimnathwani

1 replies

15h30m

2024-09-01 02:59:43 UTC

  how Llama and OpenAI could be cleaning and structuring their training data

If you're interested in this, there are several sections in the Llama paper you will likely enjoy:

https://ai.meta.com/research/publications/the-llama-3-herd-o...

kbrkbr

0 replies

4h8m

2024-09-01 14:22:24 UTC

But isn't it the beauty of llm's that they need comparably little preparation (unstructured text as input) and pick the features on their own so to say?

edit: grammar

rakahn

0 replies

16h34m

2024-09-01 01:55:46 UTC

Yes. Would love to read that.

alok-g

2 replies

13h49m

2024-09-01 04:41:40 UTC

This is great! Hope it works on a Windows 11 machine too (I often find that when Windows isn't explicitly mentioned, the code isn't tested on it and usually fails to work due to random issues).

sidkshatriya

0 replies

13h7m

2024-09-01 05:23:11 UTC

When it does not work on Windows 11 -- what about trying it out on WSL (Windows Subsystem for Linux ) ?

politelemon

0 replies

6h26m

2024-09-01 12:04:39 UTC

This should work perfectly fine in WSL2 as it has access to a GPU. Do remember to install the Cuda toolkit, NVidia has one for WSL2 specifically.

https://developer.nvidia.com/cuda-downloads?target_os=Linux&...

paradite

0 replies

11h19m

2024-09-01 07:11:32 UTC

I wrote a practical guide on how to train nanoGPT from scratch on Azure a while ago. It's pretty hands-on and easy to follow:

https://16x.engineer/2023/12/29/nanoGPT-azure-T4-ubuntu-guid...

karmakaze

0 replies

14h51m

2024-09-01 03:39:28 UTC

This is great. Just yesterday I was wondering how exactly transformers/attention and LLMs work. I'd worked through how back-propagation works in a deep RNN a long while ago and thought it would be interesting to see the rest.

1zael

0 replies

11h14m

2024-09-01 07:16:22 UTC

Sebastian, you are a god among mortals. Thank you.