I have found the youtube videos by CodeEmporium to be simpler to follow https://www.youtube.com/watch?v=Nw_PJdmydZY
Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another
The explanation is just that NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words). Their weights are a model of this distribution. LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data.
Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.
Why is 'London in UK' "known" but 'London in France' isnt? Just because 'UK' much more frequently occurs in the dataset.
The algorithm isnt doing anything other than aligning computation to hardware; the computation isnt doing anything interesting. The value comes from the conditional probability structure in the data. -- that comes from people arranging words usefully, because they're communicating information with one another
I think you're downplaying the importance of the attention/transformer architecture here. If it was "just" a matter of throwing compute at probabilities, then we wouldn't need any special architecture at all.
P(next_word|previous_words) is ridiculously hard to estimate in a way that is actually useful. Remember how bad text generation used to be before GPT? There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.
Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.
There is no innovation here in the sense of a brand new algorithm for modelling conditional probabilities -- the innovation is in adapting the algorithm for GPU training on text/etc.
I don't know why you seem to have such a bone to pick with transformers but imo it's still interesting to learn about it, and reading your dismissively toned drivel of "just" and "simply" makes me tired. You're barking up the wrong tree man, what are you on about.
No issue with transformers -- the entire field of statistical learning, decision trees to NNs, do the same thing... there's no mystery here. No person with any formal training in mathematical finance, applied statistics, hard experimental sciences on complex domains... etc. would be taken in here.
I'm trying my best to inform people who are interested in being informed, against an entire media ecosystem being played like a puppet-on-a-string by ad companies. The strategy of these companies is to exploit how easy is it to strap anthropomorphic interfaces over models of word frequencies and have everyone lose their minds.
Present the same models as a statistical dashboard, and few would be so adamant that their sci-fi fantasy is the reality.
Ironically, your best effort to inform people seems to be misinformed.
You're talking about a Markov model, not a language model with trained attention mechanisms. For a start, transformers can consider the entire context (which could be millions of tokens) rather than simple state to state probabilities.
No wonder you believe people are being 'taken in' and 'played by the ad companies'; your own understanding seems to be fundamentally misplaced.
I think they are accounting for the entire context, they specifically write out:
So the "next_word" is conditioned on "previous_words" (plural), which I took to mean the joint distribution of all previous words.
But, I think even that's too reductive. The transformer is specifically not a function acting as some incredibly high-dimensional lookup table of token conditional probabilities. It's learning a (relatively) small amount of parameters to compress those learned conditional probabilities into a radically lower-dimensional embedding.
Maybe you could describe this as a discriminative model of conditional probability, but at some point, we start describing that kind of information compression as semantic understanding, right?
It's reductive because it obscures just how complicated that `P(next_word|previous_words)` is, and it obscures the fact that "previous_words" is itself a carefully-constructed (tokenized & vectorized) representation of a huge amount of text. One individual "state" in this Markov-esque chain is on the order of an entire book, in the bigger models.
Do you have blog or anything to follow?
I may start publishing academic papers in XAI as part of a PhD; if I do, I'll share somehow. The problem is the asymmetry of bullshit: the size of paper necessary for academics to feel that claims have been evidenced is book-length for critique but 2pg for "novel contributions".
“There’s no mystery here”
Nobody’s claiming there’s ‘mystery’. Transformers are a well known, publicly documented architecture. This thread is about a video explaining exactly how they work - that they are a highly parallelizable approach that lends itself to scaling back propagation training.
“No person with … formal training … would be taken in here”
All of a sudden you’re accusing someone of perpetuating a fraud - I’m not sure who though. “Ad companies”?
Are you seriously claiming that there hasn’t been a qualitative improvement in the results of language generation tasks as a result of applying transformers in the large language model approach? Word frequencies turn out to be a powerful thing to model!
It’s ALL just hype, none of the work being done in the field has produced any value, and everyone should… use ‘statistical dashboards’ (whatever those are)?
Different models have different inductive biases. There is no way you could build GPT4 with decision trees.
Somebody's judgment weights need to be updated to include emoji embeddings.
No. This is blatantly false. The belief that recurrent model can't be scaled is untrue. People have recently trained MAMBA with billions of parameters. The fundamental reason why transformers changed the field is that they are lot more scalable context length wise, and LSTM, LRU etc doesn't come close.
Sure, we're agreeing. I'm just being less specific.
Scalable as in loss wise scalable, not compute wise.
Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LTSMs). This is why you see hybrid architectures like Jamba = Mamba + transformer. The ability to attend to specific tokens is really key, and what is lost in recurrent models where sequence history is munged into a single state.
That's my point. It doesn't perform in terms of loss, even though it performs well enough in terms of compute
This is only partially true. I wouldn't say you could use *any* NN architecture for sequence-to-sequence prediction. You either have to model them as a potentially infinite sequence with an RNN of some sort (e.g. LSTM), or, depending on the sequence type, model them as a hierarchy of sub-sequences, using something like a multi-layered convolution or transformer.
The transformer is certainly well suited to current massively parallel hardware architectures, and this was also a large part of the motivation for the design.
While the transformer isn't the only way to do seq-2-seq with neural nets, I think the reason it is so successful is more than simply being scalable and well matched to the available training hardware. Other techniques just don't work as well. From the mechanistic interpretability work that has been done so far, it seems that learnt "induction heads", utilizing the key-based attention, and layered architecture, are what give transformers their power.
Isn't that essentially what mjburgess said in the parent post?
Not really, and no. Torch and CUDA align computation to hardware.
If it were just a matter of doing that, we would be fine with fully-connected MLP. And maybe that would work with orders of magnitude more data and compute than we currently throw at these models. But we are already pushing the cutting edge of those things to get useful results out of the specialized architecture.
Choosing the right NN architecture is like feature engineering: the exact details don't matter that much, but getting the right overall structure can be the difference between learning a working model and failing to learn a working model, from the same source data with the same information content. Clearly our choice of inductive bias matters, and the transformer architecture is clearly an improvement over other designs.
Surely you wouldn't argue that a CNN is "just" aligning computation to hardware, right? Transformers are clearly showing themselves as a reliably effective model architecture for text in the same way that CNNs are reliably effective for images.
This is wrong, or at least a simplification to the point of removing any value.
They are trained to maximise this, yes.
That doesn't really follow, but let's leave that.
Here's the rub. If how you describe them is all they're doing then a sequence of never-before-seen words would have no valid response. All words would be equally likely. It would mean that a single brand new word would result in absolute gibberish following it as there's nothing to go on.
Let's try:
Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.
Result: You now have two kjsdhlisrnj.
I would wager a solid amount that kjsdhlisrnj never appears in the input data. If it does pick another one, it doesn't matter.
So we are learning something more general than the frequencies of sequences of tokens.
I always end up pointing to this but OthelloGPT is very interesting https://thegradient.pub/othello/
While it's trained on sequences of moves, what it does is more than just "sequence a,b,c is followed by d most often"
Any NN "trained on" data sampled from an abstract complete outcome space (eg., a game with formal rules; mathematical sequences, etc) can often represent that space completely. It comes down to whether you can form conditional probability models of the rules, and that's usually possible because that's what abstract rules are.
1. P(number-word|tell me how many...) > P(other-kinds-of-words|tell me how many...)
2. P(two|I have one ... I add another ...) > P(one|...) > P(three|...) > others
This is trivial.
Right, learning more abstract rules about how things work is the goal and where the value comes in. Not all algorithms are able to do this, even if they can do what you describe in your first comment.
That's why they're interesting, othellogpt is interesting because it builds a world model.
It builds a model of a "world" whose structure is conditional probabilities, this is circular. It's like saying you can use a lego model to build a model of another lego model. All the papers which "show" NNs building "world" models arent using any world. It's lego modelling lego.
The lack of a world model only matters when the data NNs are trained on aren't valid measures of the world that data is taken to model. All the moves of a chess game are a complete model of chess. All the books ever written aren't a model of, well, anything -- the structure of the universe isnt the structure of text tokens.
The only reason all statistical algorithms, including NNs, appear to model the actual world is because patterns in data give this appearance: P(The Sun is Hot) > P(The Sun is Cold) -- there is no model of the sun here.
The reason P("The Sun is Hot") seems to model the sun, is because we can read the english words "sun" and "hot" -- it is we who think the machine which generates this text does so semantically.. but the people who wrote that phrase in the dataset did so; the machine is just generating "hot" because of that dataset.
Othellogpt is fed only moves and builds a model of the current board state in its activations. It never sees a board.
No it's like using a description of piece placements and having a picture in mind about what the current model looks like.
The "board" is abstract. Any game of this sort is defined by a series of conditional probabilities:
{P(Pawn_on_sqare_blah|previous_moves) ... etc.}
What all statistical learning algorithms model is sets of conditional probabilities. So any stat alg is a model of a set of these rules... that's the "clay" of these models.
The problem is the physical world isn't anything like this. The reason I say, "I liked that TV show" is because I had a series of mental states caused by the TV show over time (and so on). This isnt representable as a set of conditional probs in the same way.
You could imagine, at the end of history, there being a total set of all possible conditional probabilities: P(I liked show|my_mental_states, time, person, location, etc.) -- this would be uncomputable, but it could be supposed.
If you had that dataset then yes, NNs would learn the entire structure of the world, because that's the dataset. The problem is that the world cannot be represented in this fashion, not that NNs could model it if it could be. A decision tree could.
P(I liked the TV show) doesnt follow from any dataset ever collected. It follows from my mental states. So no NN can ever model it. They can model frequency associations of these phrases in historical text documents: this isnt a model of hte world
That would always be 1 or 0, but also that data is not fed into othellogpt. That is not the dataset. It is not fed in board states at all.
It learns it, but it is not the dataset.
It is the dataset. When you're dealing with abstract objects (ie., mathematical spaces), they are all isomorphic.
It doesnt matter if you "feed in" 1+1+1+1 or 2+2 or sqrt(16).
The rules of chess are encoded either explicit rules or by contrast classes of valid/invalid games. These are equivalent formulations.
When you're dealing with text tokens it does matter if "Hot" is frequently after "The Sun is..." because reality isnt an abstract space, and text tokens arent measures of it.
No. A series of moves alone provides strictly less information than a board state or state + list of rules.
If the NN learns the game, that is itself an existence proof of the opposite, (by obvious information-theoretic arguments).
Training is supervised, so you don't need bare sets of moves to encode the rules; you just need a way of subsetting the space into contrast classes of valid/invalid.
It's a lie to say the "data" is the moves, the data is the full outcome space: ({legal moves}, {illegal moves}) where the moves are indexed by the board structure (necessarily, since moves are defined by the board structure -- its an abstract game). So there's two deceptions here: (1) supervision structures the training space; and (2) the individual training rows have sequential structure which maps to board structure.
Complete information about the game is provided to the NN.
But let's be clear, the othellogpt still generates illegal moves -- showing that it does not learn the binary conditional structure of the actual game.
The deceptiveness of training a NN on a game whose rules are conditional probability structures and then claiming the very-good-quality conditional probability structures it finds are "World Models" is... maddening.
This is all just fraud to me; frauds dressing up other frauds in transparent clothing. LLMs trained on the internet are being sold as approximating the actual world, not 8x8 boardgames. I have nothing polite to say about any of this
There is nothing about illegal moves provided to othellogpt as far as I'm aware.
That is not true. Where is the information that there are two players provided? Or that there are two colours? Or how the colours change? Where is the information about invalid moves provided?
Not perfectly, no. But that's not at all required for my point, though is relevant if you try and use the fact it learns to play the game as proof that moves provide all information about legal board states.
How do you think the moves are represented?
All abstract games of this sort are just sequences of bit patterns, each pattern related to the full legal space by a conditional probability structure (or, equivalently, as set ratios).
Strip away all the NN b/s and anthropomorphic language and just represent it to yourself using bit sets.
Then ask: how hard is it to approximate the space from which these bit sets are drawn using arbitrarily deep conditional probability structures?
it's trivial
the problem the author sets up about causal structures in the world cannot be represented as a finite sample of bit set sequences -- and even if it could, that isnt the data being used
the author hasn't understood the basics of what the 'world model' problem even is
how does it work underneath?
"kjsdhlisrnj" is in the context, it gets tokenized, and now when the LLM is asked to predict/generate next-token sequences somehow "kjsdhlisrnj" is there too. it learns patterns. okay sure, they ger encoded somehow, but during infernce how does this lead to application of a recalled pattern on the right token(s)?
also, can it invent new words?
You are more speaking about n-gram models here. NNs do far more than that.
Or if you just want to say that NNs are used as a statistical model here: Well, yea, but that doesn't really tell you anything. Everything can be a statistical model.
E.g., you could also say "this is exactly the way the human brain works", but it doesn't really tell you anything how it really works.
My description is true of any statistical learning algorithm.
The thing that people are looking to for answers, the NN itself, does not have them. That's like looking to Newton's compass to understand his general law of gravitation.
The reason that LLMs trained on the internet and every ebook has the structure of human communication is because the dataset has that structure. Why does the data have that structure? this requires science, there is no explanation "in the compass".
NNs are statistical models trained on data -- drawing analogies to animals is a mystification that causes people's ability to think clearly he to jump out the window. No one compares stock price models to the human brain; no banking regulator says, "well your volatility estimates were off because your machines had the wrong thoughts". This is pseudoscience.
Animals are not statistical learning algorithms, so the reason that's uninformative is because it's false. Animals are in direct causal contact with the world and uncover its structure through interventional action and counterfactual reasoning. The structure of animal bodies, and the general learning strategies are well-known, and having nothing to do with LLMs/NNs.
The reason that I know "The cup is in my hand" is not because P("The cup is in my hand"|HistoricalTexts) > P(not "The cup is in my hand"|HistoricalTexts)
I mostly agree with your points, but I still disagree with this premise. Humans (and other animals) absolutely are statistical reasoning machines. They're just advanced ones which can process more than "text" - they're multi-modal.
As a super dumb-simple set of examples: Think about the origin of the phrase "Cargo Cult" and similar religious activities - people will absolutely draw conclusions about the world based on their learned observations. Intellectual "reasoning" (science!) really just relies on more probabilities or correlations.
The reason you know the cup is in your hand is because P("I see a cup and a hand"|HistoryOfEyesight) + P("I feel a cylinder shape"|HistoryOfTactileFeeling) + .... > P(Inverse). You can pretend it's because humans are intelligent beings with deep reasoning skills (not trying to challenge your smarts here!), but humans learn through trial and error just like a NN with reinforcement learning.
Close your eyes and ask a person to randomly place either a cup from your kitchen in your hand or a different object. You can probably tell which one is it is. Why? Because you have learned what it feels like, and learned countless examples of cups that are different, from years of passive practice. Thats basically deep learning.
I mean something specific by "statistics": modelling frequency associations in static ensembles of data.
Having a body which changes over time that interacts with a world that changes over time makes animal learning not statistical (call it, say, experimental). That animals fall into skinner-box irrational behaviour can be modelled as a kind of statistical learning, but it actually isnt.
It's a failure of ecological salience mechanisms in regulating the "experimental learning" that animals engage in. Eg., with the cargo cults the reason they adopted that view was because their society had a "big man" value system based on material acquisition and western waring powers seemed Very Big and so were humiliating. In order to retain their status they adopted (apparently irrational) theories of how the world worked (gods, etc).
From the outside this process might seem statistical, but it's the opposite. Their value system made material wealth have a different causal salience which was useful in their original ecology (a small island with small resources), but it went haywire when faced with the whole world.
Eventually these mechanisms update with this new information, or the tribe dies off -- but what's going wrong here is that the very very non-statistical learning ends up describable that way.
This is indeed, why we should be very concerned about people skinner-boxing themsleves with LLMs
The "experiment" of life is what defines the statical values! Experimentation is just learning what the statistical output of something is.
If I hand you a few dice, you'd probably be able to guess the statistical probability of every number for given roll. Because you've learned that through years of observation building a mental model. If I hand you a weighted die, suddenly your mental model is gone, and you can re-learn experimentally by rolling it a bunch. How can you explain experimental learning except "statistically"?
They can be wrong without being irrational. Building an airport doesn't make planes show up, but planes won't show up without an airport. If you're an island nation with little understanding of the global geopolitical environment of WWII, you'd have no idea why planes started showing up on your island, but they keep showing up, and only at an airport. It seems rational to assume they'd continue showing up to airports.
What is it if not statistical?
Also skinner boxes are, in a way, perfectly rational. There's no way to understand the environment, and if pushing a button feeds you, then rationally you should push the button when hungry. Humans like to think we're smart because we've invented deductive reasoning, and we quote "correlation is not causation" that we're not just earning to predict the world around us from past experiences.
For dice the ensemble average is the time-average: if you roll the dice 1000 times the probability of getting a different result doesn't change.
For almost everything in the world, action on it, changes it. There are vanishingly few areas where this isn't the case (most physics, most chemistry, etc.).
Imagine trying to do statistics but every time you sampled from reality the distribution of your sample changes not due to randomness, but because reality has changed. Now, can you do statistics? No.
It makes all the difference in the world to have a body and hold the thing you're studying. Statistics is trying to guess the shape of the ice cube from the puddle; animal learning is making ice cubes.
RL is doing just this, simulating an environment. And we can have an agent "learn" in that environment.
I think tying learning to a body is too restrictive. The
You strongly rely on the assumption that "something else" generates the statistics we observe, but scientifically, there exists little evidence whether that "something else" exists (see eg the Bayesian brain).
If learning in real life over 5-20 years shows the same result as a LLM being trained by billions of tokens, than yes it can be compared.
And there are a lot of people out there who do not a lot of reasoning.
After all optical illusions exist, our brain generalizes.
The same thing happens with words like the riddle about the doctor operating on a child were we discover that the doctor is actually a female.
And while llms only use text, we can already see how multimodal models become better, architecture gets better and hardware too.
I don't know what your motivation in comparison is; mine is science, ie., explanation.
I'm not interested that your best friend emits the same words in the same order as an LLM; i'm more interested that he does so because he enjoys you company whereas the LLM does not.
Engineer's overstep their mission when they assume that because you can substitute one thing for another, and sell a product in doing so, that this is informative. It isnt. I'm not interested in whether you can replace the sky for a skybox and have no one notice -- who cares? What might fool an ape is everything, and what that matters for science is nothing.
My thinking is highly influenced by brain research.
We don't just talk about a LLM we talk about a neuronal network architecture.
There is a direct link to us (neural networks)
I'm always puzzled by such assertions. A cursory look at the technical aspects of an iterated attention - perceptron transformation clearly shows it's just a convoluted and powerful way to query the training data, a "fancy" Markov chain. The only rationality it can exhibit is that which is already embedded in the dataset. If trained on nonsensical data it would generate nonsense and if trained with a partially non-sensical dataset it will generate an average between truth and nonsense that maximizes some abstract algorithmic goal.
There is no knowledge generation going on, no rational examination of the dataset through the lens of an internal model of reality that allows the rejection of invalid premises. The intellectual food already chewed and digested in the form of the training weights, with the model just mechanically extracting the nutrients, as opposed to venturing in the outside world to hunt.
So if it works "just like the human brain", it does so in a very remote sense, just like a basic neural net works "just like the human brain", i.e individual biological neurons can be said to be somewhat similar.
If a human spends the first 30 years of their life in a cult they will be also speaking nonsense a lot - from our point of view.
Sure, we have a nice inner loop, we do some pruning, picking and choosing, updating, weighting things based on emotions, goals, etc.
Who knows how complicated those things will prove to model/implement...
That's not really an explanation that tells people all that much, though.
I can explain that car engines 'just' convert gasoline into forward motion. But if a the person hearing the explanation is hoping to learn what a cam belt or a gearbox is, or why cars are more reliable now than they were in the 1970s, or what premium gas is for, or whether helicopter engines work on the same principle - they're going to need a more detailed explanation.
It explains the LLM/NN. If you want to explain why it emits words in a certain order you need to explain how reality generated the dataset, ie., you need to explain how people communicate (and so on).
There is no mystery why an NN trained on the night sky would generate nightsky-like photos; the mystery is why those photos have those patterns... solving that is called astrophysics.
Why do people, in reasoning through physics problems, write symbols in a certain order? Well, explain physics, reasoning, mathematical notation, and so on. The ordering of the symbols gives rise to a certain utility of immitating that order -- but it isnt explained by that order. That's circular: "LLMs generate text in the order they do, because that's the order of the text they were given"
That leaves loads of stuff unexplained.
If the LLM is capable of rewording the MIT license into a set of hard-hitting rap battle lyrics, but the training dataset didn't contain any examples of anyone doing that, is the LLM therefore capable of producing output beyond the limits of its training data set?
Is an LLM inherently constrained to mediocrity? If an LLM were writing a novel, does its design force it to produce cliche characters and predictable plotlines? If applied in science, are they inherently incapable of advancing the boundaries of human knowledge?
Why transformers instead of, say, LSTMs?
Must attention be multi-headed? Why can't the model have a simpler architecture, allowing such implementation details to emerge from the training data?
Must they be so big that leading performance is only in the hands of multi-billion-dollar corporations?
What's going on with language handling? Are facts learned in an abstract enough way that they can cross language barriers? Should a model produce different statements of fact when questioned in different languages? Does France need a French-language LLM?
Is it reasonable to expect models to perform basic arithmetic accurately? What about summarising long documents?
Why is it that I can ask questions with misspellings, but get answers with largely correct spelling? If misspellings were in the training data, why aren't they in the output? Does the cleverness that stops LLMs from learning misspellings from the training data also stop them from learning other common mistakes?
If LLMs can be trained to be polite despite having examples of impoliteness in their training data, can they also be trained to not be racist, despite having examples of racism in their training data?
Can a model learn a fact that is very rarely present in the training data - like an interesting result in an obscure academic paper? Or must a fact be widely known and oft-repeated in order to be learned?
Merely saying "it predicts the next word" doesn't really explain much at all.
Which conditional probability sequences can be exploited for engineering utility cannot be known ahead of time; nor is it explained by the NN. It's explained by investigating how the data was created by people.
Train a NN to generate pictures of the nightsky: which can be used for navigation? Who knows, ahead of time. The only way of knowing is to have an explanation of how the solar system works and then check the pictures are accurate enough.
The NN which generates photos of the nightsky has nothing in it that explains the solar system, nor does any aspect of an NN model the solar system. The photos it was trained on happened to have their pixels arranged in that order.
Why those arrangements occur is explained by astrophysics.
If you want to understand what ChatGPT can do, you need to ask OpenAI for their training data and then perform scientific investigations of its structure and how that structure came to be.
Talking in terms of the NN model is propaganda and pseudoscience: the NN didnt arrange the pixels, gravity did. Likewise, the NN isnt arranging rap lyrics in that order because it's rapping: singers are.
There is no actual mystery here. It's just we are prevented form access to the data by OpenAI, and struggle to explain reality which generated that data -- which requires years of actual science.
It has a lot of things already encoded regarding the solar system, but it cannot really access it, it cannot - as far as I know - run functions on its own internal encoded data, right? If it does something like that, it's because it learned that higher-level pattern based on training data.
The problem with NN arrangements in general is that we don't know if it's actually pulling out some exact training data (or a useful so-far-unseen pattern from the data!) or it's some distorted confabulation. (Clever Hans all again. If I ask ChatGPT to code me a nodeJS IMAP backup program it does, but the package it gleeful imports/require()s is made up.
And while the typical artsy arts have loose rules, where making up new shit based on what people wish for is basically the only one, in other contexts that's a hard no-no.
People specifically would like to know what the attention calculations add to this learning of the distribution
Just speculating but I think attention enables differentiation of semantic concepts for a word or sentence within a particular context. Like for any total set of training data you have a lesser number of semantic concepts (like let's say you have 10000 words, then it might contain 2000 semantic concepts, and those concepts are defined by the sentence structure and surrounding words, which is why they have a particular meaning), and then attention allows to differentiate those different contexts at different levels (words/etc). Also the fact you can do this attention at runtime/inference means you can generate the context from the prompt, which enables the flexibility of variable prompt/variable output but you lose the precision of giving an exact prompt and getting an exact answer
I'm not one to whine about downvotes but I just have to say, it's a bad feeling when I can't even respond to the negative feedback because there is no accompanying comment. Did I misinterpret something? Did you? Who will ever know when there is no information. :L
I find this take super weak sauce and shallow.
This recent $10,000 challenge is super super interesting imho. https://twitter.com/VictorTaelin/status/1778100581837480178
State of the art models are doing more than “just” predicting the probability of the next symbol.
You underestimate the properties of the sequential-conditional structure of human communication.
Consider how a clever 6yo could fake being a physicist with access to a library of physics textbooks and a shredder. All the work is done for them. You'd need to be a physicist to spot them faking it.
Of course, LLMs are in a much better position than having shredded physics textbooks -- they have shreddings of all books. So you actually have to try to expose this process, rather than just gullibly prompt using confirmation bias. It's trivial to show they work this way, both formally and practically.
The issue is, practically, gullible people aren't trying.
It isn't some kind of Markov chain situation. Attention cross-links the abstract meaning of words, subtle implications based on context and so on.
So, "mat" follows "the cat sat on the" where we understand the entire worldview of the dataset used for training; not just the next-word probability based on one or more previous words ... it's based on all previous meaning probability, and those meaning probablility and so on.
What about cases that are not present in the dataset?
The model must be doing something besides storing raw probabilities to avoid overfitting and enable generalization (imagine that you could have a very performant model - when it works - but it sometimes would spew "Invalid input, this was not in the dataset so I don't have a conditional probability and I will bail out")
You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.
This is tantamount to a straw man. Not only do few people use untuned base models, it completely ignores in-context learning that allows the model to build complex semantic structures from the relationships learnt from its training data.
Unlike base models, instruct and chat fine-tuning teaches models to 'reason' (or rather, perform semantic calculations in abstract latent spaces) with their "conditional probability structure", as you call it, to varying extents. The model must learn to use its 'facts', understand semantics, and perform abstractions in order to follow arbitrary instructions.
You're also confabulating the training metric of "predicting tokens" with the mechanisms required to satisfy this metric for complex instructions. It's like saying "animals are just performing survival of the fittest". While technically correct, complex behaviours evolve to satisfy this 'survival' metric.
You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:
For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.
For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.
For three, it shows a lack of understanding in how transformers perform semantic computation in-context from the relationships learnt by the feed-forward layers. If you're genuinely interested in understanding the computation model of transformers and how attention can perform semantic computation, take a look here: https://srush.github.io/raspy/
For a practical example of 'understanding' (to use the term loosely), give an instruct/chat tuned model the text of an article and ask it something like "What questions should this article answer, but doesn't?" This requires not just extracting phrases from a source, but understanding the context of the article on several levels, then reasoning about what the context is not asserting. Even comparatively simple 4x7B MoE models are able to do this effectively.
My mental justification for attention has always been that the output of the transformer is a sequence of new token vectors such that each individual output token vector incorporates contextual information from the surrounding input token vectors. I know it's incomplete, but it's better than nothing at all.
You're effectively steering the predictions based on adjacent vectors (and precursors from the prompt). That mental model works fine.
I thought the general consesus was: "transformers allow neural networks to have adaptive weights".
As opposed to the previous architectures, were every edge connecting two neurons always has the same weight.
EDIT: a good video, where it's actually explained better: https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay