return to table of content

Visualizing Attention, a Transformer's Heart [video]

seydor
62 replies
13h34m

I have found the youtube videos by CodeEmporium to be simpler to follow https://www.youtube.com/watch?v=Nw_PJdmydZY

Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another

mjburgess
58 replies
11h9m

The explanation is just that NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words). Their weights are a model of this distribution. LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data.

Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

Why is 'London in UK' "known" but 'London in France' isnt? Just because 'UK' much more frequently occurs in the dataset.

The algorithm isnt doing anything other than aligning computation to hardware; the computation isnt doing anything interesting. The value comes from the conditional probability structure in the data. -- that comes from people arranging words usefully, because they're communicating information with one another

nerdponx
19 replies
11h0m

I think you're downplaying the importance of the attention/transformer architecture here. If it was "just" a matter of throwing compute at probabilities, then we wouldn't need any special architecture at all.

P(next_word|previous_words) is ridiculously hard to estimate in a way that is actually useful. Remember how bad text generation used to be before GPT? There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.

mjburgess
16 replies
10h16m

Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.

There is no innovation here in the sense of a brand new algorithm for modelling conditional probabilities -- the innovation is in adapting the algorithm for GPU training on text/etc.

bruce343434
9 replies
9h40m

I don't know why you seem to have such a bone to pick with transformers but imo it's still interesting to learn about it, and reading your dismissively toned drivel of "just" and "simply" makes me tired. You're barking up the wrong tree man, what are you on about.

mjburgess
7 replies
9h18m

No issue with transformers -- the entire field of statistical learning, decision trees to NNs, do the same thing... there's no mystery here. No person with any formal training in mathematical finance, applied statistics, hard experimental sciences on complex domains... etc. would be taken in here.

I'm trying my best to inform people who are interested in being informed, against an entire media ecosystem being played like a puppet-on-a-string by ad companies. The strategy of these companies is to exploit how easy is it to strap anthropomorphic interfaces over models of word frequencies and have everyone lose their minds.

Present the same models as a statistical dashboard, and few would be so adamant that their sci-fi fantasy is the reality.

fellendrone
2 replies
4h18m

models of word frequencies

Ironically, your best effort to inform people seems to be misinformed.

You're talking about a Markov model, not a language model with trained attention mechanisms. For a start, transformers can consider the entire context (which could be millions of tokens) rather than simple state to state probabilities.

No wonder you believe people are being 'taken in' and 'played by the ad companies'; your own understanding seems to be fundamentally misplaced.

saeranv
1 replies
40m

I think they are accounting for the entire context, they specifically write out:

> P(next_word|previous_words)

So the "next_word" is conditioned on "previous_words" (plural), which I took to mean the joint distribution of all previous words.

But, I think even that's too reductive. The transformer is specifically not a function acting as some incredibly high-dimensional lookup table of token conditional probabilities. It's learning a (relatively) small amount of parameters to compress those learned conditional probabilities into a radically lower-dimensional embedding.

Maybe you could describe this as a discriminative model of conditional probability, but at some point, we start describing that kind of information compression as semantic understanding, right?

nerdponx
0 replies
36m

It's reductive because it obscures just how complicated that `P(next_word|previous_words)` is, and it obscures the fact that "previous_words" is itself a carefully-constructed (tokenized & vectorized) representation of a huge amount of text. One individual "state" in this Markov-esque chain is on the order of an entire book, in the bigger models.

divan
1 replies
8h53m

Do you have blog or anything to follow?

mjburgess
0 replies
8h28m

I may start publishing academic papers in XAI as part of a PhD; if I do, I'll share somehow. The problem is the asymmetry of bullshit: the size of paper necessary for academics to feel that claims have been evidenced is book-length for critique but 2pg for "novel contributions".

jameshart
0 replies
6h3m

“There’s no mystery here”

Nobody’s claiming there’s ‘mystery’. Transformers are a well known, publicly documented architecture. This thread is about a video explaining exactly how they work - that they are a highly parallelizable approach that lends itself to scaling back propagation training.

“No person with … formal training … would be taken in here”

All of a sudden you’re accusing someone of perpetuating a fraud - I’m not sure who though. “Ad companies”?

Are you seriously claiming that there hasn’t been a qualitative improvement in the results of language generation tasks as a result of applying transformers in the large language model approach? Word frequencies turn out to be a powerful thing to model!

It’s ALL just hype, none of the work being done in the field has produced any value, and everyone should… use ‘statistical dashboards’ (whatever those are)?

eutectic
0 replies
7h30m

Different models have different inductive biases. There is no way you could build GPT4 with decision trees.

kordlessagain
0 replies
5h44m

Somebody's judgment weights need to be updated to include emoji embeddings.

YetAnotherNick
4 replies
8h47m

No. This is blatantly false. The belief that recurrent model can't be scaled is untrue. People have recently trained MAMBA with billions of parameters. The fundamental reason why transformers changed the field is that they are lot more scalable context length wise, and LSTM, LRU etc doesn't come close.

mjburgess
1 replies
7h57m

they are lot more scalable context length wise

Sure, we're agreeing. I'm just being less specific.

YetAnotherNick
0 replies
6h34m

Scalable as in loss wise scalable, not compute wise.

HarHarVeryFunny
1 replies
4h31m

Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LTSMs). This is why you see hybrid architectures like Jamba = Mamba + transformer. The ability to attend to specific tokens is really key, and what is lost in recurrent models where sequence history is munged into a single state.

YetAnotherNick
0 replies
1h51m

That's my point. It doesn't perform in terms of loss, even though it performs well enough in terms of compute

HarHarVeryFunny
0 replies
4h36m

Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.

This is only partially true. I wouldn't say you could use *any* NN architecture for sequence-to-sequence prediction. You either have to model them as a potentially infinite sequence with an RNN of some sort (e.g. LSTM), or, depending on the sequence type, model them as a hierarchy of sub-sequences, using something like a multi-layered convolution or transformer.

The transformer is certainly well suited to current massively parallel hardware architectures, and this was also a large part of the motivation for the design.

While the transformer isn't the only way to do seq-2-seq with neural nets, I think the reason it is so successful is more than simply being scalable and well matched to the available training hardware. Other techniques just don't work as well. From the mechanistic interpretability work that has been done so far, it seems that learnt "induction heads", utilizing the key-based attention, and layered architecture, are what give transformers their power.

JeremyNT
1 replies
4h50m

There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.

Isn't that essentially what mjburgess said in the parent post?

LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data... The algorithm isnt doing anything other than aligning computation to hardware
nerdponx
0 replies
2h19m

Not really, and no. Torch and CUDA align computation to hardware.

If it were just a matter of doing that, we would be fine with fully-connected MLP. And maybe that would work with orders of magnitude more data and compute than we currently throw at these models. But we are already pushing the cutting edge of those things to get useful results out of the specialized architecture.

Choosing the right NN architecture is like feature engineering: the exact details don't matter that much, but getting the right overall structure can be the difference between learning a working model and failing to learn a working model, from the same source data with the same information content. Clearly our choice of inductive bias matters, and the transformer architecture is clearly an improvement over other designs.

Surely you wouldn't argue that a CNN is "just" aligning computation to hardware, right? Transformers are clearly showing themselves as a reliably effective model architecture for text in the same way that CNNs are reliably effective for images.

IanCal
12 replies
9h25m

This is wrong, or at least a simplification to the point of removing any value.

NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words).

They are trained to maximise this, yes.

Their weights are a model of this distribution.

That doesn't really follow, but let's leave that.

Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

Here's the rub. If how you describe them is all they're doing then a sequence of never-before-seen words would have no valid response. All words would be equally likely. It would mean that a single brand new word would result in absolute gibberish following it as there's nothing to go on.

Let's try:

Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.

Result: You now have two kjsdhlisrnj.

I would wager a solid amount that kjsdhlisrnj never appears in the input data. If it does pick another one, it doesn't matter.

So we are learning something more general than the frequencies of sequences of tokens.

I always end up pointing to this but OthelloGPT is very interesting https://thegradient.pub/othello/

While it's trained on sequences of moves, what it does is more than just "sequence a,b,c is followed by d most often"

mjburgess
10 replies
9h14m

Any NN "trained on" data sampled from an abstract complete outcome space (eg., a game with formal rules; mathematical sequences, etc) can often represent that space completely. It comes down to whether you can form conditional probability models of the rules, and that's usually possible because that's what abstract rules are.

I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.

1. P(number-word|tell me how many...) > P(other-kinds-of-words|tell me how many...)

2. P(two|I have one ... I add another ...) > P(one|...) > P(three|...) > others

This is trivial.

IanCal
9 replies
9h6m

Right, learning more abstract rules about how things work is the goal and where the value comes in. Not all algorithms are able to do this, even if they can do what you describe in your first comment.

That's why they're interesting, othellogpt is interesting because it builds a world model.

mjburgess
8 replies
8h56m

It builds a model of a "world" whose structure is conditional probabilities, this is circular. It's like saying you can use a lego model to build a model of another lego model. All the papers which "show" NNs building "world" models arent using any world. It's lego modelling lego.

The lack of a world model only matters when the data NNs are trained on aren't valid measures of the world that data is taken to model. All the moves of a chess game are a complete model of chess. All the books ever written aren't a model of, well, anything -- the structure of the universe isnt the structure of text tokens.

The only reason all statistical algorithms, including NNs, appear to model the actual world is because patterns in data give this appearance: P(The Sun is Hot) > P(The Sun is Cold) -- there is no model of the sun here.

The reason P("The Sun is Hot") seems to model the sun, is because we can read the english words "sun" and "hot" -- it is we who think the machine which generates this text does so semantically.. but the people who wrote that phrase in the dataset did so; the machine is just generating "hot" because of that dataset.

IanCal
7 replies
8h48m

Othellogpt is fed only moves and builds a model of the current board state in its activations. It never sees a board.

It's like saying you can use a lego model to build a model of another lego model.

No it's like using a description of piece placements and having a picture in mind about what the current model looks like.

mjburgess
6 replies
8h38m

The "board" is abstract. Any game of this sort is defined by a series of conditional probabilities:

{P(Pawn_on_sqare_blah|previous_moves) ... etc.}

What all statistical learning algorithms model is sets of conditional probabilities. So any stat alg is a model of a set of these rules... that's the "clay" of these models.

The problem is the physical world isn't anything like this. The reason I say, "I liked that TV show" is because I had a series of mental states caused by the TV show over time (and so on). This isnt representable as a set of conditional probs in the same way.

You could imagine, at the end of history, there being a total set of all possible conditional probabilities: P(I liked show|my_mental_states, time, person, location, etc.) -- this would be uncomputable, but it could be supposed.

If you had that dataset then yes, NNs would learn the entire structure of the world, because that's the dataset. The problem is that the world cannot be represented in this fashion, not that NNs could model it if it could be. A decision tree could.

P(I liked the TV show) doesnt follow from any dataset ever collected. It follows from my mental states. So no NN can ever model it. They can model frequency associations of these phrases in historical text documents: this isnt a model of hte world

IanCal
5 replies
8h25m

Any game of this sort is defined by a series of conditional probabilities: {P(Pawn_on_sqare_blah|previous_moves) ... etc.}

That would always be 1 or 0, but also that data is not fed into othellogpt. That is not the dataset. It is not fed in board states at all.

It learns it, but it is not the dataset.

mjburgess
4 replies
8h15m

It is the dataset. When you're dealing with abstract objects (ie., mathematical spaces), they are all isomorphic.

It doesnt matter if you "feed in" 1+1+1+1 or 2+2 or sqrt(16).

The rules of chess are encoded either explicit rules or by contrast classes of valid/invalid games. These are equivalent formulations.

When you're dealing with text tokens it does matter if "Hot" is frequently after "The Sun is..." because reality isnt an abstract space, and text tokens arent measures of it.

IanCal
3 replies
6h59m

It is the dataset.

No. A series of moves alone provides strictly less information than a board state or state + list of rules.

mjburgess
2 replies
6h45m

If the NN learns the game, that is itself an existence proof of the opposite, (by obvious information-theoretic arguments).

Training is supervised, so you don't need bare sets of moves to encode the rules; you just need a way of subsetting the space into contrast classes of valid/invalid.

It's a lie to say the "data" is the moves, the data is the full outcome space: ({legal moves}, {illegal moves}) where the moves are indexed by the board structure (necessarily, since moves are defined by the board structure -- its an abstract game). So there's two deceptions here: (1) supervision structures the training space; and (2) the individual training rows have sequential structure which maps to board structure.

Complete information about the game is provided to the NN.

But let's be clear, the othellogpt still generates illegal moves -- showing that it does not learn the binary conditional structure of the actual game.

The deceptiveness of training a NN on a game whose rules are conditional probability structures and then claiming the very-good-quality conditional probability structures it finds are "World Models" is... maddening.

This is all just fraud to me; frauds dressing up other frauds in transparent clothing. LLMs trained on the internet are being sold as approximating the actual world, not 8x8 boardgames. I have nothing polite to say about any of this

IanCal
1 replies
6h28m

It's a lie to say the "data" is the moves, the data is the full outcome space: ({legal moves}, {illegal moves})

There is nothing about illegal moves provided to othellogpt as far as I'm aware.

Complete information about the game is provided to the NN.

That is not true. Where is the information that there are two players provided? Or that there are two colours? Or how the colours change? Where is the information about invalid moves provided?

But let's be clear, the othellogpt still generates illegal moves -- showing that it does not learn the binary conditional structure of the actual game.

Not perfectly, no. But that's not at all required for my point, though is relevant if you try and use the fact it learns to play the game as proof that moves provide all information about legal board states.

mjburgess
0 replies
5h52m

How do you think the moves are represented?

All abstract games of this sort are just sequences of bit patterns, each pattern related to the full legal space by a conditional probability structure (or, equivalently, as set ratios).

Strip away all the NN b/s and anthropomorphic language and just represent it to yourself using bit sets.

Then ask: how hard is it to approximate the space from which these bit sets are drawn using arbitrarily deep conditional probability structures?

it's trivial

the problem the author sets up about causal structures in the world cannot be represented as a finite sample of bit set sequences -- and even if it could, that isnt the data being used

the author hasn't understood the basics of what the 'world model' problem even is

pas
0 replies
2h58m

how does it work underneath?

"kjsdhlisrnj" is in the context, it gets tokenized, and now when the LLM is asked to predict/generate next-token sequences somehow "kjsdhlisrnj" is there too. it learns patterns. okay sure, they ger encoded somehow, but during infernce how does this lead to application of a recalled pattern on the right token(s)?

also, can it invent new words?

albertzeyer
11 replies
10h43m

You are more speaking about n-gram models here. NNs do far more than that.

Or if you just want to say that NNs are used as a statistical model here: Well, yea, but that doesn't really tell you anything. Everything can be a statistical model.

E.g., you could also say "this is exactly the way the human brain works", but it doesn't really tell you anything how it really works.

mjburgess
8 replies
10h18m

My description is true of any statistical learning algorithm.

The thing that people are looking to for answers, the NN itself, does not have them. That's like looking to Newton's compass to understand his general law of gravitation.

The reason that LLMs trained on the internet and every ebook has the structure of human communication is because the dataset has that structure. Why does the data have that structure? this requires science, there is no explanation "in the compass".

NNs are statistical models trained on data -- drawing analogies to animals is a mystification that causes people's ability to think clearly he to jump out the window. No one compares stock price models to the human brain; no banking regulator says, "well your volatility estimates were off because your machines had the wrong thoughts". This is pseudoscience.

Animals are not statistical learning algorithms, so the reason that's uninformative is because it's false. Animals are in direct causal contact with the world and uncover its structure through interventional action and counterfactual reasoning. The structure of animal bodies, and the general learning strategies are well-known, and having nothing to do with LLMs/NNs.

The reason that I know "The cup is in my hand" is not because P("The cup is in my hand"|HistoricalTexts) > P(not "The cup is in my hand"|HistoricalTexts)

vineyardmike
4 replies
9h22m

The reason that I know "The cup is in my hand" is not because P("The cup is in my hand"|HistoricalTexts) > P(not "The cup is in my hand"|HistoricalTexts)

I mostly agree with your points, but I still disagree with this premise. Humans (and other animals) absolutely are statistical reasoning machines. They're just advanced ones which can process more than "text" - they're multi-modal.

As a super dumb-simple set of examples: Think about the origin of the phrase "Cargo Cult" and similar religious activities - people will absolutely draw conclusions about the world based on their learned observations. Intellectual "reasoning" (science!) really just relies on more probabilities or correlations.

The reason you know the cup is in your hand is because P("I see a cup and a hand"|HistoryOfEyesight) + P("I feel a cylinder shape"|HistoryOfTactileFeeling) + .... > P(Inverse). You can pretend it's because humans are intelligent beings with deep reasoning skills (not trying to challenge your smarts here!), but humans learn through trial and error just like a NN with reinforcement learning.

Close your eyes and ask a person to randomly place either a cup from your kitchen in your hand or a different object. You can probably tell which one is it is. Why? Because you have learned what it feels like, and learned countless examples of cups that are different, from years of passive practice. Thats basically deep learning.

mjburgess
3 replies
9h4m

I mean something specific by "statistics": modelling frequency associations in static ensembles of data.

Having a body which changes over time that interacts with a world that changes over time makes animal learning not statistical (call it, say, experimental). That animals fall into skinner-box irrational behaviour can be modelled as a kind of statistical learning, but it actually isnt.

It's a failure of ecological salience mechanisms in regulating the "experimental learning" that animals engage in. Eg., with the cargo cults the reason they adopted that view was because their society had a "big man" value system based on material acquisition and western waring powers seemed Very Big and so were humiliating. In order to retain their status they adopted (apparently irrational) theories of how the world worked (gods, etc).

From the outside this process might seem statistical, but it's the opposite. Their value system made material wealth have a different causal salience which was useful in their original ecology (a small island with small resources), but it went haywire when faced with the whole world.

Eventually these mechanisms update with this new information, or the tribe dies off -- but what's going wrong here is that the very very non-statistical learning ends up describable that way.

This is indeed, why we should be very concerned about people skinner-boxing themsleves with LLMs

vineyardmike
1 replies
8h26m

Having a body which changes over time that interacts with a world that changes over time makes animal learning not statistical (call it, say, experimental).

The "experiment" of life is what defines the statical values! Experimentation is just learning what the statistical output of something is.

If I hand you a few dice, you'd probably be able to guess the statistical probability of every number for given roll. Because you've learned that through years of observation building a mental model. If I hand you a weighted die, suddenly your mental model is gone, and you can re-learn experimentally by rolling it a bunch. How can you explain experimental learning except "statistically"?

they adopted (apparently irrational) theories of how the world worked (gods, etc)

They can be wrong without being irrational. Building an airport doesn't make planes show up, but planes won't show up without an airport. If you're an island nation with little understanding of the global geopolitical environment of WWII, you'd have no idea why planes started showing up on your island, but they keep showing up, and only at an airport. It seems rational to assume they'd continue showing up to airports.

that animals fall into skinner-box irrational behaviour can be modelled as a kind of statistical learning, but it actually isnt

What is it if not statistical?

Also skinner boxes are, in a way, perfectly rational. There's no way to understand the environment, and if pushing a button feeds you, then rationally you should push the button when hungry. Humans like to think we're smart because we've invented deductive reasoning, and we quote "correlation is not causation" that we're not just earning to predict the world around us from past experiences.

mjburgess
0 replies
8h19m

For dice the ensemble average is the time-average: if you roll the dice 1000 times the probability of getting a different result doesn't change.

For almost everything in the world, action on it, changes it. There are vanishingly few areas where this isn't the case (most physics, most chemistry, etc.).

Imagine trying to do statistics but every time you sampled from reality the distribution of your sample changes not due to randomness, but because reality has changed. Now, can you do statistics? No.

It makes all the difference in the world to have a body and hold the thing you're studying. Statistics is trying to guess the shape of the ice cube from the puddle; animal learning is making ice cubes.

data_maan
0 replies
2h25m

Having a body which changes over time that interacts with a world that changes over time makes animal learning not statistical (call it, say, experimental). That animals fall into skinner-box irrational behaviour can be modelled as a kind of statistical learning, but it actually isnt.

RL is doing just this, simulating an environment. And we can have an agent "learn" in that environment.

I think tying learning to a body is too restrictive. The

You strongly rely on the assumption that "something else" generates the statistics we observe, but scientifically, there exists little evidence whether that "something else" exists (see eg the Bayesian brain).

Demlolomot
2 replies
9h37m

If learning in real life over 5-20 years shows the same result as a LLM being trained by billions of tokens, than yes it can be compared.

And there are a lot of people out there who do not a lot of reasoning.

After all optical illusions exist, our brain generalizes.

The same thing happens with words like the riddle about the doctor operating on a child were we discover that the doctor is actually a female.

And while llms only use text, we can already see how multimodal models become better, architecture gets better and hardware too.

mjburgess
1 replies
9h32m

I don't know what your motivation in comparison is; mine is science, ie., explanation.

I'm not interested that your best friend emits the same words in the same order as an LLM; i'm more interested that he does so because he enjoys you company whereas the LLM does not.

Engineer's overstep their mission when they assume that because you can substitute one thing for another, and sell a product in doing so, that this is informative. It isnt. I'm not interested in whether you can replace the sky for a skybox and have no one notice -- who cares? What might fool an ape is everything, and what that matters for science is nothing.

Demlolomot
0 replies
4h57m

My thinking is highly influenced by brain research.

We don't just talk about a LLM we talk about a neuronal network architecture.

There is a direct link to us (neural networks)

cornholio
1 replies
9h25m

"this is exactly the way the human brain works"

I'm always puzzled by such assertions. A cursory look at the technical aspects of an iterated attention - perceptron transformation clearly shows it's just a convoluted and powerful way to query the training data, a "fancy" Markov chain. The only rationality it can exhibit is that which is already embedded in the dataset. If trained on nonsensical data it would generate nonsense and if trained with a partially non-sensical dataset it will generate an average between truth and nonsense that maximizes some abstract algorithmic goal.

There is no knowledge generation going on, no rational examination of the dataset through the lens of an internal model of reality that allows the rejection of invalid premises. The intellectual food already chewed and digested in the form of the training weights, with the model just mechanically extracting the nutrients, as opposed to venturing in the outside world to hunt.

So if it works "just like the human brain", it does so in a very remote sense, just like a basic neural net works "just like the human brain", i.e individual biological neurons can be said to be somewhat similar.

pas
0 replies
3h15m

If a human spends the first 30 years of their life in a cult they will be also speaking nonsense a lot - from our point of view.

Sure, we have a nice inner loop, we do some pruning, picking and choosing, updating, weighting things based on emotions, goals, etc.

Who knows how complicated those things will prove to model/implement...

michaelt
4 replies
9h32m

That's not really an explanation that tells people all that much, though.

I can explain that car engines 'just' convert gasoline into forward motion. But if a the person hearing the explanation is hoping to learn what a cam belt or a gearbox is, or why cars are more reliable now than they were in the 1970s, or what premium gas is for, or whether helicopter engines work on the same principle - they're going to need a more detailed explanation.

mjburgess
3 replies
9h26m

It explains the LLM/NN. If you want to explain why it emits words in a certain order you need to explain how reality generated the dataset, ie., you need to explain how people communicate (and so on).

There is no mystery why an NN trained on the night sky would generate nightsky-like photos; the mystery is why those photos have those patterns... solving that is called astrophysics.

Why do people, in reasoning through physics problems, write symbols in a certain order? Well, explain physics, reasoning, mathematical notation, and so on. The ordering of the symbols gives rise to a certain utility of immitating that order -- but it isnt explained by that order. That's circular: "LLMs generate text in the order they do, because that's the order of the text they were given"

michaelt
2 replies
8h12m

That leaves loads of stuff unexplained.

If the LLM is capable of rewording the MIT license into a set of hard-hitting rap battle lyrics, but the training dataset didn't contain any examples of anyone doing that, is the LLM therefore capable of producing output beyond the limits of its training data set?

Is an LLM inherently constrained to mediocrity? If an LLM were writing a novel, does its design force it to produce cliche characters and predictable plotlines? If applied in science, are they inherently incapable of advancing the boundaries of human knowledge?

Why transformers instead of, say, LSTMs?

Must attention be multi-headed? Why can't the model have a simpler architecture, allowing such implementation details to emerge from the training data?

Must they be so big that leading performance is only in the hands of multi-billion-dollar corporations?

What's going on with language handling? Are facts learned in an abstract enough way that they can cross language barriers? Should a model produce different statements of fact when questioned in different languages? Does France need a French-language LLM?

Is it reasonable to expect models to perform basic arithmetic accurately? What about summarising long documents?

Why is it that I can ask questions with misspellings, but get answers with largely correct spelling? If misspellings were in the training data, why aren't they in the output? Does the cleverness that stops LLMs from learning misspellings from the training data also stop them from learning other common mistakes?

If LLMs can be trained to be polite despite having examples of impoliteness in their training data, can they also be trained to not be racist, despite having examples of racism in their training data?

Can a model learn a fact that is very rarely present in the training data - like an interesting result in an obscure academic paper? Or must a fact be widely known and oft-repeated in order to be learned?

Merely saying "it predicts the next word" doesn't really explain much at all.

mjburgess
1 replies
8h5m

Which conditional probability sequences can be exploited for engineering utility cannot be known ahead of time; nor is it explained by the NN. It's explained by investigating how the data was created by people.

Train a NN to generate pictures of the nightsky: which can be used for navigation? Who knows, ahead of time. The only way of knowing is to have an explanation of how the solar system works and then check the pictures are accurate enough.

The NN which generates photos of the nightsky has nothing in it that explains the solar system, nor does any aspect of an NN model the solar system. The photos it was trained on happened to have their pixels arranged in that order.

Why those arrangements occur is explained by astrophysics.

If you want to understand what ChatGPT can do, you need to ask OpenAI for their training data and then perform scientific investigations of its structure and how that structure came to be.

Talking in terms of the NN model is propaganda and pseudoscience: the NN didnt arrange the pixels, gravity did. Likewise, the NN isnt arranging rap lyrics in that order because it's rapping: singers are.

There is no actual mystery here. It's just we are prevented form access to the data by OpenAI, and struggle to explain reality which generated that data -- which requires years of actual science.

pas
0 replies
2h49m

It has a lot of things already encoded regarding the solar system, but it cannot really access it, it cannot - as far as I know - run functions on its own internal encoded data, right? If it does something like that, it's because it learned that higher-level pattern based on training data.

The problem with NN arrangements in general is that we don't know if it's actually pulling out some exact training data (or a useful so-far-unseen pattern from the data!) or it's some distorted confabulation. (Clever Hans all again. If I ask ChatGPT to code me a nodeJS IMAP backup program it does, but the package it gleeful imports/require()s is made up.

And while the typical artsy arts have loose rules, where making up new shit based on what people wish for is basically the only one, in other contexts that's a hard no-no.

seydor
2 replies
10h50m

People specifically would like to know what the attention calculations add to this learning of the distribution

ffwd
1 replies
10h31m

Just speculating but I think attention enables differentiation of semantic concepts for a word or sentence within a particular context. Like for any total set of training data you have a lesser number of semantic concepts (like let's say you have 10000 words, then it might contain 2000 semantic concepts, and those concepts are defined by the sentence structure and surrounding words, which is why they have a particular meaning), and then attention allows to differentiate those different contexts at different levels (words/etc). Also the fact you can do this attention at runtime/inference means you can generate the context from the prompt, which enables the flexibility of variable prompt/variable output but you lose the precision of giving an exact prompt and getting an exact answer

ffwd
0 replies
8h55m

I'm not one to whine about downvotes but I just have to say, it's a bad feeling when I can't even respond to the negative feedback because there is no accompanying comment. Did I misinterpret something? Did you? Who will ever know when there is no information. :L

forrestthewoods
1 replies
10h49m

I find this take super weak sauce and shallow.

This recent $10,000 challenge is super super interesting imho. https://twitter.com/VictorTaelin/status/1778100581837480178

State of the art models are doing more than “just” predicting the probability of the next symbol.

mjburgess
0 replies
10h11m

You underestimate the properties of the sequential-conditional structure of human communication.

Consider how a clever 6yo could fake being a physicist with access to a library of physics textbooks and a shredder. All the work is done for them. You'd need to be a physicist to spot them faking it.

Of course, LLMs are in a much better position than having shredded physics textbooks -- they have shreddings of all books. So you actually have to try to expose this process, rather than just gullibly prompt using confirmation bias. It's trivial to show they work this way, both formally and practically.

The issue is, practically, gullible people aren't trying.

sirsinsalot
0 replies
6h32m

It isn't some kind of Markov chain situation. Attention cross-links the abstract meaning of words, subtle implications based on context and so on.

So, "mat" follows "the cat sat on the" where we understand the entire worldview of the dataset used for training; not just the next-word probability based on one or more previous words ... it's based on all previous meaning probability, and those meaning probablility and so on.

nextaccountic
0 replies
10h43m

Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

What about cases that are not present in the dataset?

The model must be doing something besides storing raw probabilities to avoid overfitting and enable generalization (imagine that you could have a very performant model - when it works - but it sometimes would spew "Invalid input, this was not in the dataset so I don't have a conditional probability and I will bail out")

fellendrone
0 replies
5h18m

Why does, 'mat' follow from 'the cat sat on the ...'

You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.

This is tantamount to a straw man. Not only do few people use untuned base models, it completely ignores in-context learning that allows the model to build complex semantic structures from the relationships learnt from its training data.

Unlike base models, instruct and chat fine-tuning teaches models to 'reason' (or rather, perform semantic calculations in abstract latent spaces) with their "conditional probability structure", as you call it, to varying extents. The model must learn to use its 'facts', understand semantics, and perform abstractions in order to follow arbitrary instructions.

You're also confabulating the training metric of "predicting tokens" with the mechanisms required to satisfy this metric for complex instructions. It's like saying "animals are just performing survival of the fittest". While technically correct, complex behaviours evolve to satisfy this 'survival' metric.

You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:

For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.

For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.

For three, it shows a lack of understanding in how transformers perform semantic computation in-context from the relationships learnt by the feed-forward layers. If you're genuinely interested in understanding the computation model of transformers and how attention can perform semantic computation, take a look here: https://srush.github.io/raspy/

For a practical example of 'understanding' (to use the term loosely), give an instruct/chat tuned model the text of an article and ask it something like "What questions should this article answer, but doesn't?" This requires not just extracting phrases from a source, but understanding the context of the article on several levels, then reasoning about what the context is not asserting. Even comparatively simple 4x7B MoE models are able to do this effectively.

nerdponx
2 replies
11h11m

TBF there is no good explanation why it works

My mental justification for attention has always been that the output of the transformer is a sequence of new token vectors such that each individual output token vector incorporates contextual information from the surrounding input token vectors. I know it's incomplete, but it's better than nothing at all.

rcarmo
0 replies
10h4m

You're effectively steering the predictions based on adjacent vectors (and precursors from the prompt). That mental model works fine.

eurekin
0 replies
9h38m

TBF there is no good explanation why it works

I thought the general consesus was: "transformers allow neural networks to have adaptive weights".

As opposed to the previous architectures, were every edge connecting two neurons always has the same weight.

EDIT: a good video, where it's actually explained better: https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay

bilsbie
14 replies
17h9m

I finally understand this! Why did every other video make it so confusing!

chrishare
2 replies
16h51m

It is confusing, 3b1b is just that good.

visarga
1 replies
12h30m

At the same time it feels extremely simple

attention(Q,K,V) = softmax (Q K^T √ dK ) @ V

is just half a row; the multi-head, masking and positional stuff just toppings

we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math

bilsbie
0 replies
6h39m

For me I never had too much trouble understanding the algorithm. But this is the first time I can see why it works.

Solvency
2 replies
16h50m

Because:

1. good communication requires an intelligence that most people sadly lack

2. because the type of people who are smart enough to invent transformers have zero incentive to make them easily understandable.

most documents are written by authors subconsciously desperate to mentally flex on their peers.

penguin_booze
0 replies
14h39m

Pedagogy requires empathy, to know what it's like to not know something. They'll often draw on experiences the listener is already familiar with, and then bridge the gap. This skill is orthogonal to the mastery of the subject itself, which I think is the reason most descriptions sound confusing, inadequate, and/or incomprehensible.

Often, the disseminating medium is a one-sided, like a video or a blog post, which doesn't help, either. A conversational interaction would help the expert sense why someone outside the domain find the subject confusing ("ah, I see what you mean"...), discuss common pitfalls ("you might think it's like this... but no, it's more like this...") etc.

WithinReason
0 replies
10h13m

2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.

Al-Khwarizmi
2 replies
10h33m

Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:

1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".

2. The key papers (Attention is All You Need, the BERT paper, etc.) are badly written. This is probably an unpopular opinion. But note that I'm not diminishing their merits. It's perfectly compatible to write a hugely impactful, transformative paper describing an amazing breakthrough, but just don't explain it very well. And that's exactly what happened, IMO.

3. The way in which these architectures were discovered was largely by throwing things at the wall and seeing what sticked. There is no reflection process that ended on a prediction that such an architecture would work well, which was then empirically verified. It's empirical all the way through. This means that we don't have a full understanding of why it works so well, all explanations are post hoc rationalizations (in fact, lately there is some work implying that other architectures may work equally well if tweaked enough). It's hard to explain something that you don't even fully understand.

Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.

maleldil
0 replies
8h4m

This is probably an unpopular opinion

There's a reason The Illustrated Transformer[1] was/is so popular: it made the original paper much more digestible.

[1] https://jalammar.github.io/illustrated-transformer/

cmplxconjugate
0 replies
10h9m

This is probably an unpopular opinion.

I wouldn't say so. Historically it's quite common. Maxwell's EM papers used such convoluted notation it it quite difficult to read. It wasn't until they were reformulated in vector calculus that they became infinitely more digestible.

I think though your third point is the most important; right now people are focused on results.

ur-whale
1 replies
13h57m

Why did every other video make it so confusing!

In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.

I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.

namaria
0 replies
9h35m

It's extraordinarily difficult to imagine how it feels not to understand something. Great educators can bridge that gap. I don't think it's correlated with research ability in any way. It's just a very rare skill set, to be able to empathize with people who don't understand what you do.

thomasahle
1 replies
15h35m

I'm someone who would love to get better at making educational videos/content. 3b1b is obviously the gold standard here.

I'm curious what things other videos did worse compared to 3b1b?

bilsbie
0 replies
15h19m

I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.

I like how he was able to avoid going into the weeds and stay focused on leading you to understanding. I remember another video where I got really hung up on positional encoding and I felt like I could t continue until I understood that. Or other videos that overfocus on matrix operations or softmax, etc.

thinkingtoilet
0 replies
5h59m

Grant has a gift of explaining complicated things very clearly. There's a good reason his channel is so popular.

Xcelerate
12 replies
8h57m

As someone with a background in quantum chemistry and some types of machine learning (but not neural networks so much) it was a bit striking while watching this video to see the parallels between the transformer model and quantum mechanics.

In quantum mechanics, the state of your entire physical system is encoded as a very high dimensional normalized vector (i.e., a ray in a Hilbert space). The evolution of this vector through time is given by the time-translation operator for the system, which can loosely be thought of as a unitary matrix U (i.e., a probability preserving linear transformation) equal to exp(-iHt), where H is the Hamiltonian matrix of the system that captures its “energy dynamics”.

From the video, the author states that the prediction of the next token in the sequence is determined by computing the next context-aware embedding vector from the last context-aware embedding vector alone. Our prediction is therefore the result of a linear state function applied to a high dimensional vector. This seems a lot to me like we have produced a Hamiltonian of our overall system (generated offline via the training data), then we reparameterize our particular subsystem (the context window) to put it into an appropriate basis congruent with the Hamiltonian of the system, then we apply a one step time translation, and finally transform the resulting vector back into its original basis.

IDK, when your background involves research in a certain field, every problem looks like a nail for that particular hammer. Does anyone else see parallels here or is this a bit of a stretch?

bdjsiqoocwk
4 replies
8h43m

I think you're just describing a state machine, no? The fact that you encode the state in a vector and steps by matrices is an implementation detail...?

Xcelerate
3 replies
8h9m

Perhaps a probabilistic FSM describes the actual computational process better since we don’t have a concept equivalent to superposition with transformers (I think?), but the framework of a FSM alone doesn’t seem to capture the specifics of where the model/machine comes from (what I’m calling the Hamiltonian), nor how a given context window (the subsystem) relates to it. The change of basis that involves the attention mechanism (to achieve context-awareness) seems to align better with existing concepts in QM.

One might model the human brain as a FSM as well, but I’m not sure I’d call the predictive ability of the brain an implementation detail.

BoGoToTo
2 replies
7h46m

| context window

I actually just asked a question on the physics stack exchange that is semi relevant to this. https://physics.stackexchange.com/questions/810429/functiona...

In my question I was asking about a hypothetical time-evolution operator that includes an analog of a light cone that you could think of as a context window. If you had a quantum state that was evolved through time by this operator then I think you could think of the speed of light being a byproduct of the width of the context window of some operator that progresses the quantum state forward by some time interval.

Note I am very much hobbyist-tier with physics so I could also be way off base and this could all be nonsense.

ricardobeat
1 replies
7h28m

I’m way out of my depth here, but wouldn’t such a function have to encode an amount of information/state orders of magnitude larger than the definition of the function itself?

If this turns out to be possible, we will have found the solution to the Sloot mystery :D

https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_System

DaiPlusPlus
0 replies
3h34m

The article references patent “1009908C2” but I can’t find it in the Dutch patent site, nor Google Patent search.

The rest of the article has “crank” written all over it; almost certainly investor fraud too - it’d be straightforward to fake the claimed smartcard video thing to a nontechnical observer - though not quite as egregious as Steorn Orbo or Theranos though.

BoGoToTo
3 replies
7h56m

I've been thinking about his a bit lately. If time is non-continuous then could you model the time evolution of the universe as some operator recursively applied to the quantum state of the universe? If each application of the operator progresses the state of the universe by a single planck-time could we even observe a difference between that and a universe where time is continuous?

tweezy
0 replies
3h42m

So one of the most "out there" non-fiction books I've read recently is called "Alien Information Theory". It's a wild ride and there's a lot of flat-out crazy stuff in it but it's a really engaging read. It's written by a computational neuroscientist who's obsessed with DMT. The DMT parts are pretty wild, but the computational neuroscience stuff is intriguing.

In one part he talks about a thought experiment modeling the universe as a multidimensional cellular automata. Where fundamental particles are nothing more than the information they contain. And particles colliding is a computation that tells how that node and the adjacent nodes to update their state.

Way out and not saying there's anything truth to it. But it was a really interesting and fun concept to chew on.

pas
0 replies
3h23m

This sounds like the Bohmian pilot wave theory (which is a global formulation of QM). ... Which might be not that crazy, since spooky action at a distance is already a given. And in cosmology (or quantum gravity) some models are describing a region of space based only its surface. So in some sense the universe is much less information dense, than we think.

https://en.m.wikipedia.org/wiki/Holographic_principle

BobbyTables2
0 replies
4h50m

I think Wolfram made news proposing something roughly along these lines.

Either way, I find Planck time/energy to be a very spooky concept.

https://wolframphysics.org/

lagrange77
0 replies
8h30m

I only understand half of it, but it sounds very interesting. I've always wondered, if the principle of stationary action could be of any help with machine learning, e.g. provide an alternative point of view / formulation.

francasso
0 replies
7h1m

I don't think the analogy holds: even if you forget all the preceding non linear steps, you are still left with just a linear dynamical system. It's neither complex nor unitary, which are two fundamental characteristics of quantum mechanics.

cmgbhm
0 replies
1h6m

Not a direct comment on the question but I had a math PhD as an intern before. One of his comments was having tons of high dimensional linear algebra stuff was super advanced 1900s and has plenty of room for new cs discovery.

Didn’t make the “what was going on then in physics “ connection until now.

rollinDyno
6 replies
14h48m

Hold on, every predicted token is only a function of the previous token? I must have something wrong. This would mean that within the embedding of "was", which is of length 12,228 in this example. Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?

vanjajaja1
1 replies
14h28m

at that point what it has is not a representation of the input, its a representation of what the next output could be. ie. its a lossy process and you can't extract what came in the past, only the details relevant to next word prediction

(is my understanding)

rollinDyno
0 replies
12h27m

If the point was the presentation of only the next token, and predicted tokens were a function of only the preceding token, then the vector of the new token wouldn’t have the information to produce new tokens that kept telling the novel.

faramarz
1 replies
14h30m

it's not about a single point encapsulating a novel, but how sequences of such embeddings can represent complex ideas when processed by the model's layers.

each prediction is based on a weighted context of all previous tokens, not just the immediately preceding one.

rollinDyno
0 replies
12h32m

That weighted context is the 12228 dimensional vector, no?

I suppose that when you each element in the vector weighs 16 bits then the space is immense and capable to have a novel in a point.

jgehring
0 replies
8h52m

That's what happens in the very last layer. But at that point the embedding for "was" got enriched multiple times, i.e., in each attention pass, with information from the whole context (which is the whole novel here). So for the example, it would contain the information to predict, let's say, the first token of the first name of the murderer.

Expanding on that, you could imagine that the intent of the sentence to complete (figuring out the murderer) would have to be captured in the first attention passes so that other layers would then be able to integrate more and more context in order to extract that information from the whole context. Also, it means that the forward passes for previous tokens need to have extracted enough salient high-level information already since you don't re-compute all attention passes for all tokens for each next token to predict.

evolvingstuff
0 replies
44m

You are correct, that is an error in an otherwise great video. The k+1 token is not merely a function of the kth vector, but rather all prior vectors (combined using attention). There is nothing "special" about the kth vector.

spacecadet
5 replies
16h37m

Fun video. Much of my "art" lately has been dissecting models, injecting or altering attention, and creating animated visualizations of their inner workings. Some really fun shit.

spacecadet
3 replies
15h53m

Nah someone down voted it. And yes, it looks like that + 20 others that are animated.

CamperBob2
2 replies
15h11m

Downvotes == empty boats. If "Empty Boat parable" doesn't ring a bell, Google it...

spacecadet
0 replies
8h57m

anger is a gift

globalnode
0 replies
14h42m

unless an algorithm decides to block or devalue the content, but yeah i looked it up, very interesting parable, thanks for sharing.

rayval
3 replies
12h4m

Here's a compelling visualization of the functioning of an LLM when processing a simple request: https://bbycroft.net/llm

This complements the detailed description provided by 3blue1brown

bugthe0ry
1 replies
8h57m

When visualised this way, the scale of GPT-3 is insane. I can't imagine what 4 would like here.

spi
0 replies
5h50m

IIRC, GPT-4 would actually be a bit _smaller_ to visualize than GPT3. Details are not public, but from the leaks GPT-4 (at least, some by-now old version of it) was a mixture of expert, with every model having around 110B parameters [1]. So, while the total number of parameters is bigger than GPT-3 (1800B vs. 175B), it is "just" 16 copies of a smaller (110B) parameters model. So if you wanted to visualize it in any meaningful way, the plot wouldn't grow bigger - or it would, if you included all different experts, but they are just copies of the same architecture with different parameters, which is not all that useful for visualization purposes.

[1] https://medium.com/@daniellefranca96/gpt4-all-details-leaked...

lying4fun
0 replies
15m

amazing visualisation

tylerneylon
2 replies
13h56m

Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.

One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:

https://learnandburn.ai/p/how-to-build-a-10m-token-context

(I edited that article.)

rahimnathwani
0 replies
13h10m

He lists Ring Attention and half a dozen other techniques, but they're not within the scope of this video: https://youtu.be/eMlx5fFNoYc?t=784

danielhanchen
0 replies
12h35m

Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.

shahbazac
2 replies
6h4m

Is there a reference which describes how the current architecture evolved? Perhaps from very simple core idea to the famous “all you need paper?”

Otherwise it feels like lots of machinery created out of nowhere. Lots of calculations and very little intuition.

Jeremy Howard made a comment on Twitter that he had seen various versions of this idea come up again and again - implying that this was a natural idea. I would love to see examples of where else this has come up so I can build an intuitive understanding.

HarHarVeryFunny
0 replies
3h47m

Roughly:

1) The initial seq-2-seq approach was using LSTMs - one to encode the input sequence, and one to decode the output sequence. It's amazing that this worked at all - encode a variable length sentence into a fixed size vector, then decode it back into another sequence, usually of different length (e.g. translate from one language to another).

2) There are two weaknesses of this RNN/LSTM approach - the fixed size representation, and the corresponding lack of ability to determine which parts of the input sequence to use when generating specific parts of the output sequence. These deficiencies were addressed by Bahdanau et al in an architecture that combined encoder-decoder RNNs with an attention mechanism ("Bahdanau attention") that looked at each past state of the RNN, not just the final one.

3) RNNs are inefficient to train, so Jakob Uszkoreit was motivated to come up with an approach that better utilized available massively parallel hardware, and noted that language is as much hierarchical as sequential, suggesting a layered architecture where at each layer the tokens of the sub-sequence would be processed in parallel, while retaining a Bahdanau-type attention mechanism where these tokens would attend to each other ("self-attention") to predict the next layer of the hierarchy. Apparently in initial implementation the idea worked, but not better than other contemporary approaches (incl. convolution), but then another team member, Noam Shazeer, took the idea and developed it, coming up with an architecture (which I've never seen described) that worked much better, which was then experimentally ablated to remove unnecessary components, resulting in the original transformer. I'm not sure who came up with the specific key-based form of attention in this final architecture.

4) The original transformer, as described in the "attention is all you need paper", still had a separate encoder and decoder, copying earlier RNN based approaches, and this was used in some early models such as Google's BERT, but this is unnecessary for language models, and OpenAI's GPT just used the decoder component, which is what everyone uses today. With this decoder-only transformer architecture the input sentence is input into the bottom layer of the transformer, and transformed one step at a time as it passes through each subsequent layer, before emerging at the top. The input sequence has an end-of-sequence token appended to it, which is what gets transformed into the next-token (last token) of the output sequence.

abotsis
2 replies
13h26m

I think what made this so digestible for me were the animations. The timing, how they expand/contract and unfold while he’s speaking.. is all very well done.

_delirium
1 replies
13h4m

That is definitely one of the things he does better than most. He actually wrote his own custom animation library for math animations: https://github.com/3b1b/manim

thomasahle
1 replies
15h34m

I like the way he uses a low-rank decomposition of the Value matrix instead of Value+Output matrices. Much more intuitive!

imjonse
0 replies
12h6m

It is the first time I hear about the Value matrix being low rank, so for me this was the confusing part. Codebases I have seen also have value + output matrixes so it is clearer that Q,K,V are similar sizes and there's a separate projection matrix that adapts to the dimensions of the next network layer. UPDATE: He mentions this in the last sections of the video.

sthatipamala
0 replies
17h23m

Sounds interesting; what else is part of those onboarding docs?

mastazi
1 replies
16h47m

That example with the "was" token at the end of a murder novel is genius (at 3:58 - 4:28 in the video) really easy for a non technical person to understand.

hamburga
0 replies
16h12m

I think Ilya gets credit for that example — I’ve heard him use it in his interview with Jensen Huang.

justanotherjoe
1 replies
12h0m

It seems he brushes over the positional encoding, which for me was the most puzzling part of transformers. The way I understood it, positional encoding is much like dates. Just like dates, there are repeating minutes, hours, days, months...etc. Each of these values has shorter 'wavelength' than the next. The values are then used to identify the position of each tokens. Like, 'oh, im seeing january 5th tokens. I'm january 4th. This means this is after me'. Of course the real pos.encoding is much smoother and doesn't have abrupt end like dates/times, but i think this was the original motivation for positional encodings.

nerdponx
0 replies
11h4m

That's one way to think about it.

It's clever way to encode "position in sequence" as some kind of smooth signal that can be added to each input vector. You might appreciate this detailed explanation: https://towardsdatascience.com/master-positional-encoding-pa...

Incidentally, you can encode dates (e.g. day of week) in a model as sin(day of week) and cos(day of week) to ensure that "day 7" is mathematically adjacent to "day 1".

jiggawatts
1 replies
17h17m

It always blows my mind that Grant Sanderson can explain complex topics in such a clear, understandable way.

I've seen several tutorials, visualisations, and blogs explaining Transformers, but I didn't fully understand them until this video.

chrishare
0 replies
16h49m

His content and impact is phenomenal

YossarianFrPrez
1 replies
16h50m

This video (with a slightly different title on YouTube) helped me realize that the attention mechanism isn't exactly a specific function so much as it is a meta-function. If I understand it correctly, Attention + learned weights effectively enables a Transformer to learn a semi-arbitrary function, one which involves a matching mechanism (i.e., the scaled dot-product.)

hackinthebochs
0 replies
16h31m

Indeed. The power of attention is that it searches the space of functions and surfaces the best function given the constraints. This is why I think linear attention will never come close to the ability of standard attention, the quadratic term is a necessary feature of searching over all pairs of inputs and outputs.

mehulashah
0 replies
13h49m

This is one of the best explanations that I’ve seen on the topic. I wish there was more work, however, not on how Transfomers work, but why they work. We are still figuring it out, but I feel that the exploration is not at all systematic.

kordlessagain
0 replies
5h41m

What I'm now wondering about is how intuition to connect completely separate ideas works in humans. I will have very strong intuition something is true, but very little way to show it directly. Of course my feedback on that may be biased, but it does seem some people have "better" intuition than others.

cs702
0 replies
6m

Fantastic work by Grant Sanderson, as usual.

Attention has won.[a]

It deserves to be more widely understood.

---

[a] Nothing has outperformed attention so far, not even Mamba: https://arxiv.org/abs/2402.01032

bjornsing
0 replies
9h16m

This was the best explanation I’ve seen. I think it comes down to essentially two aspects: 1) he doesn’t try to hide complexity and 2) he explains what he thinks is the purpose of each computation. This really reduces the room for ambiguity that ruins so many other attempts to explain transformers.