return to table of content

Let's Build the GPT Tokenizer [video]

sabareesh
16 replies
1d3h

Even if you pay it is hard to get such a high quality content!

progbits
14 replies
1d2h

I've been learning a few new CS things recently and honestly I mostly find inverse correlation between cost and quality.

There are books from oreilly and paid MOOC courses that are just padded with lots of unnecessary text or silly "concept definition" quizzes to make them seem worth the price.

And there are excellent free YT video lectures, free books or blog posts.

Andrej's YT videos are one great example. https://course.fast.ai is another.

simmanian
7 replies
1d2h

Do you have recommendations for other high quality courses teaching CS things?

diimdeep
1 replies
1d

Build an 8-bit computer from scratch https://eater.net/8bit/ https://www.youtube.com/playlist?list=PLowKtXNTBypGqImE405J2...

Andreas Kling. OS hacking: Making the system boot with 256MB RAM https://www.youtube.com/watch?v=rapB5s0W5uk

MIT 6.006 Introduction to Algorithms, Spring 2020 https://www.youtube.com/playlist?list=PLUl4u3cNGP63EdVPNLG3T...

MIT 6.824: Distributed Systems https://www.youtube.com/@6.824

MIT 6.172 Performance Engineering of Software Systems, Fall 2018 https://www.youtube.com/playlist?list=PLUl4u3cNGP63VIBQVWguX...

CalTech cs124 Operating Systems https://duckduckgo.com/?t=ffab&q=caltech+cs124&ia=web

try searching here at HN for recommendations https://hn.algolia.com

nojvek
0 replies
23h20m

Thank you a ton for the links.

GaneshSuriya
1 replies
1d1h

- operating system in three easy pieces (https://pages.cs.wisc.edu/~remzi/OSTEP) is incredible for learning OS internals

- beej's networking guide is the best thing for network layer stuff https://beej.us/guide/

- explained from first principles great too https://explained-from-first-principles.com/

- pintos from Stanford https://web.stanford.edu/class/cs140/projects/pintos/pintos_...

vb234
0 replies
22h11m

Wow. Thanks for sharing. I had no idea that Professor Remzi and his wife Andrea wrote a book on Operating Systems. I loved his class (took it almost 22 years ago.) Will have to check his book out.

davidbarker
0 replies
1d1h

I can highly recommend CS50 from Harvard (https://www.youtube.com/@cs50). Even after being involved in tech for 25+ years, I learnt a lot from just the first lecture alone.

Disclosure: Professor Malan is a friend of mine, but I was a fan of CS50 long before that!

codelobe
0 replies
22h26m

Replying to bookmark(hoard) all the thread links later.

Fellow hackers might also enjoy:

https://www.nand2tetris.org/

Tomte
0 replies
15h24m

nand2tetris: https://www.nand2tetris.org/

I like the book better than the online course.

thfuran
1 replies
1d1h

And there are excellent free YT video lectures, free books or blog posts.

There's also a tremendous amount of extremely low quality YouTube and blog content.

progbits
0 replies
1d1h

Sure. I don't claim the free content is all good.

But from my limited sample size, the best free content is better than the best paid content.

GaneshSuriya
1 replies
1d1h

It's not only about the cost, though. There's an inverse correlation with the glossiness of the content as well.

If the web page /content is too polished, they're most likely optimizing for wooing users.

Unlike a lot of the examples I gave in the sibling comments. Where the optimization is only on the love for the topic being discussed

rahimnathwani
0 replies
1d

  There's an inverse correlation with the glossiness of the content as well.
This is probably due to survivorship bias. Sites that have poor content and poor visual appeal (glossiness) never get on your radar.

i.e. Berkson's Paradox: https://en.wikipedia.org/wiki/Berkson%27s_paradox

lynx23
0 replies
11h30m

Full ACK. I have also grown weary of payed course offerings, because many I have checked out were basically low quality or shallow.

danielmarkbruce
0 replies
1d

There are some extremely good CS textbooks which cost money. That being said, many good ML/AI texts are free. But it's not easy reading.

3abiton
0 replies
1d2h

His previous video onLLM tramsformer foundation is extremely useful.

mynameisure
6 replies
16h18m

A noob question? Do you all intend to work on LLM’s or watching the content for the curious mind.I am asking how anyone like me as a software generalist can make use of this amazing content.Anyone with insights on how to transition from a generalist backend engineer to an AI engineer ? Or its a niche and the only path is the route of PHD …

chasd00
2 replies
8h0m

There’s a (or soon to be) market for software people that can evaluate a use case and apply an LLM if warranted. You don’t need a PhD but do need a good working knowledge of the nuts/bolts to speak truth to hype. Karpathy has a YouTube titles something like “a busy persons guide to LLM” and in it he describes the model as an operating system kernel with tools and utilities surrounding it. You can build and understand those valuable tools and utilities without having a PhD in AI. I think that’s the way to break into the AI market as a traditional developer.

A good example is llangchain

seanbethard
1 replies
3h20m

2009: Porter stemming with NLTK

2013: LDA with MALLET

2015: spaCy

2018: BERT

2023: GPT-4

2024: every person is an NLP expert in four lines of LangChain code

amelius
0 replies
1h42m

It's how we went from mathematics, theory of computation, algorithms, programming language research to everybody is a web-developer.

lb4r
0 replies
14h12m

Speaking for myself, and except for just being curious, it's mostly for similar reasons as to why you'd want to read, for example, CLRS, even though you'll probably never implement an algorithm like that in a real production environment yourself. It's not so much about learning how, but rather why, because it'll help you answer your why's in the future (not that the how can't also be important, of course).

elbear
0 replies
14h57m

Just a guess, but understanding how LLMs are built may also help you if you want to fine-tune a model. Someone who knows more may confirm or contradict this.

brainless
0 replies
13h59m

I was not really interested in LLMs till a month back. I had an earlier product where I wanted a no-code app for business insights on any data source. Plug in MySQL, PostgreSQL, APIs like Stripe, Salesforce, Shopify, even CSV files and it would be able to generate queries from user's GUI interactions. Like Airtable but for own data sources. I was generating SQLs including JOINs, or HTTPS API calls.

Then I abandoned it in 2021. This year, it struck me that LLMs would be great to infer business insights from the schema. I could create reports and dashboards automatically, surface critical action points straight from the schema/data and users chatting with the app.

So for the last couple weeks, I have been building it, running test on LLMs (CodeLlama, Zephyr, Mistral, Llama 2, Claude and ChatGPT). The results are quite good. There is a lot of tech that I need to handle: schema analysis, SQL or API calls, and the whole UI. But without LLMs, there was no clear way for me to infer business insights from schema + user chats.

To me, this is not a niche anymore now that I have found a problem I wanted to tackle already.

seanbethard
4 replies
11h58m

It’s pretty wild how little discussion there's been about the core feature of these models. It's as if this aspect of their development has been solved. Basically all NLP publications today take these BPE tokens as a starting point and if they are mentioned at all they’re mentioned in passing.

https://blog.seanbethard.net/meanings-are-tiktokens-in-space

lordswork
1 replies
10h7m

It's kind of like lexers and parsers for compilers. It's largely a solved problem so doesn't get much attention.

seanbethard
0 replies
4h27m

Thanks for your reply.

It's exactly like lexers for compilers. This parsing strategy coupled with the decision to then map the results into an embedding space of arbitrary dimensionality is why these models don't work and cannot be said to understand language. They cannot reliably handle fundamental aspects of meaning. They aren't equipped for it.

They're pretty good at coming up with well-formed sentences of English, though. They ought to be given the excessive amounts of data they've seen.

PeterisP
1 replies
6h39m

It makes sense - publications write about the things they added, changed or evaluated, not about all the (many!) things they do exactly as everyone else; so tokenization would be mentioned only if the publication is explicitly about a different tokenization.

And while it's a core feature, it's a fairly robust one, while you can get some targeted improvements, the default option(s) are good enough and you won't improve much over them.

seanbethard
0 replies
3h41m

Thanks for your reply.

That's my first point. In 10 years we have word2vec, GloVe, GPT-2 and... tiktoken. lol. It's as if directional, numeric magnitudes in an embedding space of arbitrary dimensionality have magically captured or will magically capture the nuances and expressivity of language. Optimization techniques and new strategies for domain adaption are what matters, particularly for mobile devices, on-device ASR and short-form videos.

I don't think robust is a good characterization of clusters of semantic attributes in space or a distributional semantics of language. I'd say crude and without understanding are more accurate descriptions. Capturing semantic properties sometimes is not the same thing as having a semantics.

By targeted improvements you must be referring to domain adaptation and by the default option you must be referring to attention over BPE tokens? You can move directional quantities around in directional quantity space all day. If it results in expected behavior for your application that you weren't getting before that's great. If that's all you want to get out of these models then indeed there's nothing to do here. I'm not after improvements so much as I'm after something that works.

mikewarot
3 replies
11h31m

No wonder GPT does so horribly on anything involving spelling, or the exact specifications of letters.

To fix it, I'd throw a few gigabytes of synthetic data in the training mix before fine tuning that included the alphabets of all the relevant languages, things like.

  A is an upper case a
  a is a lower case A
  the sequence of numbers is 0 1 2 3 4 5 6 7 8 9 10 11 12
  0 + 1 = 1
  1 + 1 = 2
etc.

It still amazes me that Word2Vec is as useful as it is, let alone LLMs. The structure inherent in language really does convey far more meaning that we assume. We're like fish, not being aware of water, when we use language.

sebzim4500
0 replies
11h29m

We know OpenAI trains on significant amounts of synthetic data, they probably have something like this.

pyinstallwoes
0 replies
7h33m

Humans, a host form. Language, a life form.

CGamesPlay
0 replies
8h7m

I am curious if an LLM trained using a tokenizer that renders words into IPA alphabet would make a better bot for creative writing, especially things like rhyming, assonance, puns, and other sound-based word games. It might also do better on "fringe" languages, where there is low corpus in the language but the words might have cognates in more widely known languages.

threesevenths
2 replies
1d3h

Andrej's video on building GPT nano is an excellent tutorial of all of the steps involved in a modern LLM.

xdennis
1 replies
1d
pests
0 replies
18h11m

His earlier videos on micrograd and makemore are a gold mine as well.

mrtksn
2 replies
1d3h

I can't recommend enough the whole series, zero to hero: https://karpathy.ai/zero-to-hero.html

No metaphors trying explain "complex" ideas, making them scary and seem overly complex. Instead, hands on implementations with analogy explainers where you can actually understand the ideas and see how simple it is.

Steeper learning curve at first but it is much more satisfying and you actually earn the ability to reason about this stuff instead of writing over the top influencer BS.

yen223
0 replies
21h52m

One thing I like about that zero-to-hero series is how he almost never handwave over seemingly minor details.

Definitely recommend watching those videos and doing the exercises, if you have any interest in how LLMs work.

MPSimmons
0 replies
1d2h

Thanks for this link - I have some free time coming up, and this seems like a great use of it!

ShamelessC
1 replies
20h47m

Had to double check my playback speed - he talks like a 1.25x playback speaker sounds.

coolThingsFirst
0 replies
11h6m

let him teach we don't need one more rapper

unknown2342
0 replies
11h19m

very grateful that he puts out this kind of education. The one bit I have is that he didn’t explained all the abstract question and the beginning which leads into bad taste I guess. I hope I am not disrespecting.

timzaman
0 replies
16h6m

The best thing is that I know Andrej reads all these comments. Hi Andrej! This is your calling. Miss you though!

theptrk
0 replies
17h26m

There should be awards for this type of content. Andrew Ng series and Karpathy series as first inductees to the hall of fame.

sorenjan
0 replies
9h15m

I would love a video series from him where he makes a text2img diffusion model. I found the fast.ai course a bit unfocused and annoying.

moffkalast
0 replies
23h36m

you see when it's a space egg, it's a single token

I'm not sure if the crew of the Nostromo would agree ;)

albert_e
0 replies
17h0m

His video on Backpropagation was a revelation to me :

https://www.youtube.com/watch?v=q8SA3rM6ckI

098799
0 replies
13h26m

Probably more coming soon given he just left openai to pursue other things.