return to table of content

Beyond self-attention: How a small language model predicts the next token

danielmarkbruce
29 replies
20h13m

This is a weird post. "What the transformer is actually doing"? You can just follow the code and see what it's doing. It's not doing something more or less than that. It's not doing some other thing.

gjm11
15 replies
19h55m

The post is long and complicated and I haven't read most of it, so whether it's actually any good I shan't try to decide. But the above seems like a very weird argument.

Sure, the code is doing what it's doing. But trying to understand it at that level of abstraction seems ... not at all promising.

Consider a question about psychology. Say: "What are people doing when they decide what to buy in a shop?".

If someone writes an article about this, drawing on some (necessarily simplified) model of human thinking and decision-making, and some experimental evidence about how people's purchasing decisions change in response to changes in price, different lighting conditions, mood, etc., ... would you say "You can just apply the laws of physics and see what the people are doing. They're not doing something more or less than that."?

I mean, it would be true. People, so far as we know, do in fact obey the laws of physics. You could, in principle, predict what someone will buy in a given situation by modelling their body and surroundings at the level of atoms or thereabouts (quantum physics is a thing, of course, but it seems likely that a basically-classical model could be good enough for this purpose). When we make decisions, we are obeying the laws of physics and not doing some other thing.

But this answer is completely useless for actually understanding what we do. If you're wondering "what would happen if the price were ten cents higher?" you've got no way to answer it other than running the whole simulation again. Maybe running thousands of versions of it since other factors could affect the results. If you're wondering "does the lighting make a difference, and what level of lighting in the shop will lead to people spending least or most?" then you've got no way to answer it other than running simulations with many different lighting conditions.

Whereas if you have a higher-level, less precise model that says things like "people mostly prefer to spend less" and "people try to predict quality on the basis of price, so sometimes they will spend more if it seems like they're getting something better that way" and "people like to feel that they're getting a bargain" and so on, you may be able to make predictions without running an impossibly detailed person-simulation zillions of times. You may be able to give general advice to someone with a spending problem who'd like to spend more wisely, or to a shopkeeper who wants to encourage their customers to spend more.

Similarly with language models and similar systems. Sure, you can find out what it does in some very specific situation by just running the code. But what if you have some broader question than that? Then simply knowing what the code does may not help you at all, because what the code does is gazillions of copies of "multiply these numbers together and add them".

Again, I make no claim about whether the particular thing linked here offers much real insight. But it makes zero sense, so far as I can see, to dismiss it on the grounds that all you need to do is read the code.

danielmarkbruce
7 replies
18h34m

It is very promising. In fact, in industry there are jokes about how getting rid of linguists has helped language modeling.

Trying to understand it at some level of abstraction that humans can fit in their head has been a dead end.

knightoffaith
6 replies
16h29m

Trying to build systems top-down using principles humans can fit in their head has arguably been a dead end. But this doesn't mean that we cannot try to understand parts of current AI systems at a higher level of abstraction, right? They may not have been designed top-down with human-understandable principles, but that doesn't mean that trained, human-understandable principles couldn't have emerged organically from the training process.

Evolution optimized the human brain to do things over an unbelievably long period of time. Human brains were not designed top-down with human-understandable principles. But neuroscientists, cognitive scientists, and psychologists have arguably had success with understanding the brain partially at a higher level of abstraction than just neurons, or just saying "evolution optimized these clumps of matter for spreading genes; there's nothing more to say". What do you think is the relevant difference between the human brain and current machine learning models that makes the latter just utterly incomprehensible at any higher level of abstraction, but the former worth pursuing by means of different scientific fields?

danielmarkbruce
5 replies
15h53m

I don't know neuroscience at all, so I don't know if that's a good analogy. I'll make a guess though - if you consider a standard RAG application. That's a system which uses at least a couple models. A person might reasonably say "the embeddings in the db are where the system stores memories. The LLM acts as the part of the brain that reasons over whatever is in working memory plus it's sort of implicit knowledge." I'd argue that's reasonable. But systems and models are different things.

People use many abstractions in AI/ML. Just look at all the functionality you get in PyTorch as an example. But they are abstractions of pieces of a model, or pieces of the training process etc. They aren't abstractions of the function the model is trying to learn.

knightoffaith
4 replies
15h37m

Right, I've used pytorch before. I'm just trying to understand why the question of "how does a transformer work?" is only meaningfully answered by describing the mechanisms of self-attention layers at the highest level of abstraction, with any higher level of abstraction being nonsense. More specifically, why we should have a ban on any higher level of abstraction in this scenario when we can answer the question of "how does the human mind work?" at not just the atom level, but also the neuroscientific level or psychological level. Presumably you could say the same thing about this question: The human mind is a bunch of atoms obeying the laws of physics. That's what it's doing. It's not something else.

I understand you're emphasizing the point that the connectionist paradigm has had a lot more empirical success than the computationalist paradigm - letting AI systems learn organically, bottom-up is more effective than trying to impose human mind-like principles top-down when we design them. But I don't understand why this means understanding bottom-up systems at higher level of abstractions is necessarily impossible when we have a clear example of a bottom-up system that we've had some success in understanding at a high level of abstraction, viz. the human mind.

danielmarkbruce
3 replies
14h51m

It would be great if they were good, but they seem to be bad, it seems that they must be bad given the dimensionality of the space, and humans latch onto simple explanations even when they are bad.

Think about MoE models. Each expert learns to be good at completing certain types of inputs. It sounds like a great explanation for how it works. Except, it doesn't seem to actually work that way. The mixtral paper showed that the activated routes seemed to follow basically no pattern. Maybe if they trained it differently it would? Who knows. It certainly isn't a good name regardless.

Many fields/things can be understood at higher and higher levels of abstraction. Computer science is full of good high level abstractions. Humans love it. It doesn't work everywhere.

knightoffaith
2 replies
14h31m

Right, of course we should validate explanations based on empirical data. We rejected the idea that there was a particular neuron that activated only when you saw your grandmother (the "grandmother neuron") after experimentation. But just because explanations have been bad, doesn't mean that all future explanations must also be bad. Shouldn't we evaluate explanations on a case-by-case basis instead of dismissing them as impossible? Aren't we better off having evaluated the intuitive explanation for mixtures of experts instead of dismissing them a priori? There's a whole field - mechanistic interpretability - where researchers are working on this kind of thing. Do you think that they simply haven't realized that the models they're working on interpreting are operating in a high-dimensional space?

danielmarkbruce
1 replies
13h54m

Mechanistic interpretability studies a bunch of things though. Like, the mixtral paper where they show the routing activations is mechanistic interpretability. That sort of feature visualization stuff is good. I don't know what % of the field is spending their time on trying to interpret the models in a way that involves higher level, human can explain, approximating the following code type work though? I'm certainly not the only one who thinks it's a waste of time, I don't believe anything I've said in this thread is original in any way.

I... don't know if the people involved in that specific stuff have really grokked they are working in high dimensional space? A lot of otherwise smart people work in macroeconomics, where for decades they haven't really made any progress because it's so complex. It seems stupid to suggest a whole field of smart people don't realize what they are up against, but sheesh it kinda seems that way doesn't it? Maybe I'll be eating my words in 10 years.

knightoffaith
0 replies
13h31m

They certainly understand they're working in a high dimensional space. No question. What they deny is that this necessarily means the goal of interpretability is a futile one.

But the main thrust of what I'm saying is that we shouldn't be dismissing explanations a priori - answers to "how does a transformer work?" that go beyond descriptions of self-attention aren't necessarily nonsensical. You can think it's a waste of time (...frankly, I kind of think it's a waste of time too...), but just like any other field, it's not really fair to close our eyes and ears and dismiss proposals out of hand. I suppose > Maybe I'll be eating my words in 10 years. indicates you understand this though.

xanderlewis
6 replies
19h43m

You’re spot on; it’s like saying you can understand the game of chess by simply reading the rules. In a certain very superficial sense, yes. But the universe isn’t so simple. The same reason even a perfect understanding of what goes on at the level of subatomic particles isn’t thought to be enough to say we ‘understand the universe’. A hell of a lot can happen in between the setting out of some basic rules and the end — much higher level — result.

danielmarkbruce
5 replies
18h31m

And yet...alpha zero.

xanderlewis
3 replies
18h27m

My entire point is that implementation isn’t sufficient for understanding. Alpha Zero is the perfect example of that; you can create an amazing chess playing machine and (potentially) learn nothing at all about how to play chess.

…so what’s your point? I’m not getting it from those two words.

danielmarkbruce
2 replies
18h3m

Understanding how the machine plays or how you should play? They aren't the same thing. And that is the point - trying to analogize to some explicit, concrete function you can describe is backwards. These models are gigantic (even the 'small' ones), they are looking to minimize a loss function by looking in multi thousand dimensional space. It is the very opposite of something that fits in a human brain in any explicit fashion.

gjm11
1 replies
15h34m

So is what happens in an actual literal human brain.

And yet, we spend quite a lot of our time thinking about what human brains do, and sometimes it's pretty useful.

For a lot of this, we treat the actual brain as a black box and don't particularly care about how it does what it does, but knowing something about the internal workings at various levels of abstraction is useful too.

Similarly, if for whatever reason you are interested in, or spend some of your time interacting with, transformer-based language models, then you might want some intuition for what they do and how.

You'll never fit the whole thing in your brain. That's why you want simplified abstracted versions of it. Which, AIUI, is one thing that the OP is trying to do. (As I said before, I don't know how well it does it; what I'm objecting to is the idea that trying to do this is a waste of time because the only thing there is to know is that the model does what the code says it does.)

danielmarkbruce
0 replies
14h42m

Sure, good abstractions are good. But bad abstractions are worse than none. Think of all the nonsense abstractions about the weather before people understood and could simulate the underlying process. No one in modern weather forecasting suggests there is a way to understand that process at some high level of abstraction. Understand the low level, run the calcs.

FeepingCreature
0 replies
2h25m

Alpha Zero didn't read the rules, it trained within the universe of the rules for 44 million games.

nl
6 replies
19h59m

A walk through with what the data at each point looks like is actually pretty useful.

danielmarkbruce
5 replies
18h41m

Sure, it is. But trying to explain it as though the weights have some goal is weird. They aren't trying to do anything. You have a loss function. The optimizer keeps moving weights around in an attempt to minimize the loss function. It's not more or less than that.

MarkusQ
4 replies
16h58m

This is just wrong.

First of all, you're rejecting teleological anthropomorphizing (saying it's weird to act as if "the weights have some goal") but then in the very next line you talk about the optimizer making "an attempt" to accomplish a goal. All of which misses the point, since the question is about _explanations_ not goals and intentions.

Then you reject out of hand any other level of explanation than the one you favor, saying "it's not more or less than that" when in fact it is both more and less than that; you can climb the ladder of abstraction either way to build more or less abstract explanations. We can dig down and talk about how the optimizer adjusts weights, or how tensor math works and how it's used in this case, or about how GPUs work, or gates, or transistors, etc. Or we could climb up and talk about (as this article does) and talk about what attention heads do, and why they work, when they work, when they don't, etc.

danielmarkbruce
3 replies
16h52m

The optimizer has a goal. The weights in the model do not. The optimizer isn't the model. There is no contradiction if you know how it works.

Climbing the layer of abstraction from model weights doesn't seem to work in this field. Just saying it's so doesn't make it so.

MarkusQ
2 replies
16h46m

Just saying it's so doesn't make it so.

Does that apply to you as well?

danielmarkbruce
1 replies
16h37m

Of course. You shouldn't take my word for it. You can learn the basics of AI/ML from a number of good texts. Simon Prince just released a very approachable text, although it doesn't cover much in the way of history to see the move to "more data/more compute, less human lead abstraction". I think Norvig's book covers that but I haven't read the latest version.

MarkusQ
0 replies
16h23m

You sure like to make assumptions, don't you? :)

drdeca
2 replies
19h50m

Understanding how a given CPU (+ the other computer hardware) works, does not suffice to understand what is going on when a particular program is running. For that, you need to either read the program, or an execution trace, or both, or something along these lines, which is specific to the program being run.

danielmarkbruce
1 replies
18h39m

This is the wrong analogy. The transformer block is a bunch of code and weights. It's a set of instructions laying out which numbers to run which operations on. The optimizer changes weights to minimize a loss function during training and then the code implementing a forward pass just runs during inference. That's what it is doing. It's not doing something else.

If the argument is that a model is a function approximator, then it certainly isn't approximating some function that performs worse at the task at hand, and it certainly isn't approximating a function we can describe in a few hundred words.

FeepingCreature
0 replies
10h52m

We have no reason at all to be certain of the latter.

richardw
1 replies
18h36m

I'll go with an example to demonstrate why that's not always enough. Many people are quite keen to know what this (for example) is actually doing:

  float InvSqrt(float x){
      float xhalf = 0.5f * x;
      int i = *(int*)&x;
      i = 0x5f3759df - (i >> 1);
      x = *(float*)&i;
      x = x*(1.5f - xhalf*x*x);
      return x;
  }
From https://betterexplained.com/articles/understanding-quakes-fa...

In my case I don't have a huge amount of time to chase down every rabbit hole, but I'd love to accelerate intuition for LLM's. Multiple points of intuition or comparison really help. I'm also not a python expert - what you see and what I see from a line of code will be quite different.

danielmarkbruce
0 replies
18h10m

The author is attempting to build an explicit mental model of what a bunch of weights are "doing". It's not really the same thing. They are minimizing the loss function.

People try to (and often do) generate intuition for architectures that will work given the lay out of data. But, the reason models are so big now is that trying to understand what the model is "doing" in a way humans understand didn't work out so well.

magicalhippo
0 replies
2h26m

You can just follow the code and see what it's doing. It's not doing something more or less than that.

And that's why we'll never have fun things like an obfuscated code contest. Oh wait[1]...

[1]: https://www.ioccc.org/

kmeisthax
10 replies
22h21m

I had the exact same idea after seeing Google point out that you can[0] get ChatGPT to regurgitate verbatim training data by asking it to repeat the same word over and over again[1]. I'm glad to see someone else actually bring it to fruition.

This, of course, brings two additional questions:

1. Is this "AI, hold the AI" approach more energy-efficient than having gradient descent backpropagation compress a bunch of training data into a model that can then be run on specialized AI coprocessors?

2. Will this result wind up being evidence in the ongoing lawsuits against OpenAI and Stability AI?

[0] Could. OpenAI now blocks generation if you fill the context window with a single word.

[1] https://arxiv.org/abs/2311.17035

refulgentis
2 replies
20h48m

I'm confused, you had the exact same idea that LLM output is based on probability of next token, which is based on the training data?

If that's the case, no, its unlikely this result will end up becoming evidence, that is well known and fundamental.

The author's contribution to discussion is showing this to a technical audience writing their own GPT, as they note, most "how to implement this?" focus on transformers

kmeisthax
1 replies
18h3m

Much of the sales hype and other literature surrounding LLMs specifically obfuscates the role that training data plays in the model. Training data is "learned from", but that's implying the data goes away after the training process ends and you have a model that's solely composed of uncopyrightable knowledge about how to write or draw. If the models are actually retaining training data, and we have a way to extract that data, then the models didn't learn - legally speaking[0], they copied training set data.

The idea I had wasn't "LLMs are based on probabilities", it was "what if you benchmarked an LLM against a traditional search index over the training corpus". The linked blog post doesn't completely rip out the LLMs entirely, just the feed-forward layer, but the result is what I thought would happen: an attention-augmented search index that is producing nearly identical probability distributions to the 66% of the model that was removed.

[0] Programmers talking about copyright usually get tripped up on this, so I'll spell it out: copyright is a matter of data provenance, not bit-exactness. Just because the weights are harder to inspect does not mean no copyright infringement has occurred. Compression does not launder copyright.

refulgentis
0 replies
17h33m

Should make sure to establish this up front: People know this, it's not controversial. It's not only known to a few. It's how it works.

Also note that this example purposefully minimizes training data down to an absurdity, so it is possible to correlate 1:1 that the next letter's probabilities to the input. The key of the rest of this comment,, and the discussions you reference, is the observation that's vastly harder once the training data is measured in terabytes, to the point the question becomes interesting.

The argument of which you're speaking, the people you think are speaking literally are speaking figuratively: they know it reproduces _some_ training data, i.e. 2+2=4 was surely in the training data. Or c.f. NY Times v. OpenAI, where they were able to get it to complete an article given the first ~5 paragraphs of the article.

The unsettled question, in US legal parlance, is if LLMs are sufficiently transformative of the training data that it becomes fair use.

Eschewing US legal parlance: where, exactly, on the spectrum of "completely original" to "photocopier with perfect recall" LLMs fall, given we know it isn't at either of those extremes? What responsibility does that give someone operating an LLM commercially to the entities who originated the training data?

noduerme
2 replies
18h14m

re: 2... if you copyright a work, then surely you also hold rights to a zip file of that work. So why not also the probability distribution of letters in that work?

zarzavat
0 replies
15h46m

To be precise, you don’t hold rights to a zip file, copyright doesn’t know anything about files. You hold rights to a work, an abstract legal concept. Your rights to the work allow you to control the reproduction of that work, and distributing a zip file is an instance of reproducing the work.

Probability distributions don’t contain enough information to reproduce a work (since they don’t preserve order). They are not copyrightable in and of themselves, and distributing a probability distribution of a work doesn’t amount to reproduction.

kmeisthax
0 replies
17h48m

If the probability distribution is enough to reproduce a copyrighted work to the level of substantial similarity, then yes, a copy has legally been made.

However, that's not the only question involved in a copyright lawsuit[0].

So far most of the evidence of copying has been circumstantial: a regurgitated Quake lighting function here, a Getty Images watermark there, but everything else has looked like wholly original output. We know from how these models are trained that copyrighted work is involved somewhere, but a court could just say it's Fair Use to scrape data and train a model on it. However, that defense is way harder to make if we can actually open up a model and show "ok, this is where and how it's storing copied training set data". At a minimum, it takes the "how much was used" Fair Use factor from "a few watermarks" to "your honor, the entire fucking Internet".

[0] As usual we will assume jurisdiction in US court

sureglymop
1 replies
19h15m

I found it interesting that in the arxiv paper you linked they are talking about an attack, ethics and responsible disclosure.

But when it comes to scraping the entirety of the internet to train such models that's never referred to as an attack.

kmeisthax
0 replies
18h26m

Scraping the whole web isn't considered an attack because, well, that's just how search engines work. That being said, there are all sorts of norms (e.g. robots.txt) qualifying what kinds of scraping are accepted.

As far as I can tell, AI researchers assumed they could just piggyback on top of those norms to get access to large amounts of training data. The problem is that it's difficult to call copying an attack unless you go full MAFIAA[0]brain and argue that monopoly rents on creative works are the only functional backstop to the 1st Amendment. Hell, even if you do, the EU and Japan[1] both have a statutory copyright exception explicitly legalizing AI training on other people's text. It's not even accepted dogma among copyright holders that this is an attack.

[0] Music And Film Industry Association of America, a fictional industry association purported to be the merger of the MPAA and RIAA announced on April 1st, 2006: http://mafiaa.org/

[1] Yes, the same country whose copyright laws infamously have no Fair Use equivalent. In Japan, it is illegal to review or parody a copyrighted work without a license, but it is legal to train an AI on it.

yorwba
0 replies
20h25m

This approach cannot possibly be more efficient than running the original model because it relies on running the original model to get the activations to search the text corpus for strings with similar activations to compute the next-token statistics. You don't get to skip many steps, and you end up having to do a bunch of extra work.

I'd be surprised if doing this with two completely separate corpora, one for training the model and the other to search for strings with similar activations, wouldn't lead to much the same results. Because the hard part is constructing similar activations for strings with similar next-token statistics in the first place.

Note that in the per-layer weights [0.01, 0.01, 0.1, 1.5, 6, 0.01] the penultimate layer ist the most important, where the input has already been transformed a lot. So you can't expect to use this to replace a transformer with a simple grep over the training data. (My guess as to why the penultimate layer has a much higher weight than the final one is that this is due to induction heads https://transformer-circuits.pub/2021/framework/index.html which implement copying repeated strings from the input, with the penultimate layer determining what to look for and the final layer doing the copying.)

bruce343434
0 replies
11h26m

In my experience before they blocked it: it hallucinates something that looks like training data. A GitHub readme that under closer inspection doesn't actually exist and is incoherent. Some informational brochure about nothing. A random dialogue.

tysam_and
9 replies
16h31m

Some of the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... ! If we do not have read the foundations of the field that we are in, we are doomed to be mystified by unexplained phenomena which arise pretty naturally as consequences of already-distilled work!

That said, the experiments seem very thorough, on a first pass/initial cursory examination, I appreciate the amount of detail that seemed to go into them.

The tradeoff between learning existing theory, and attempting to re-derive it from scratch, I think, is a hard tradeoff, as not having the traditional foundation allows for the discovery of new things, but having it allows for a deeper understanding of certain phenomena. There is a tradeoff either way.

I've seen several people here in the comments seemingly shocked that a model that maximizes the log likelihood of a sequence given the data somehow does not magically deviate from that behavior when run in inference. It's a density estimation model, do you want it to magically recite Shakespeare from the void?

Please! Let's stick to the basics, it will help experiments like this make much more sense as there already is a very clear mathematical foundation which clearly explains it (and said emergent phenomena).

If you want more specifics, there are several layers, Shannon's treatment of ergodic systems is a good start (though there is some minor deviation from that here, but it likely is a 'close enough' match to what's happening here to be properly instructive to the reader about the general dynamics of what is going on, overall.)

jackblemming
3 replies
16h15m

the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... !

which clearly explains it (and said emergent phenomena)

Very smart information theory people have looked at neural networks through the lens of information theory and published famous papers about it years ago. It couldn't explain many things about neural networks, but it was interesting nonetheless.

FWIW it's not uncommon for smart people to say "this mathematical structure looks like this other idea with [+/- some structure]!!" and that it totally explains everything... (kind of with so and so exceptions, well and also this and that and..). Truthfully, we just don't know. And I've never seen theorists in this field actually take the theory and produce something novel or make useful predictions with it. It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

There was this one posted recently on transformers being kernel smoothers: https://arxiv.org/abs/1908.11775

rrr_oh_man
0 replies
8h49m

It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

Reminds me of how my ex-client's data scientists would develop ML models.

randomNumber7
0 replies
15h54m

It's all try stuff and see what works, and then retroactively make up some crud on why it worked

People have done this in earlier days too. The theory around control systems was developed after PID controllers had been succesfully used in praxis.

Nevermark
0 replies
8h26m

I think there is more here than a backward look.

The article introduced a discrete algorithm method for approximating the gradient optimization model.

It would be interesting to optimize the discrete algorithm for both design and inference times, and see if any space or time advantages over gradient learning could be found. Or if new ideas popped as a result of optimization successes or failures.

It also might have an advantage in terms of algorithm adjustments. For instance, given the most likely responses at each step, discard the most likely whenever follow ups are not too far below - and see if that reliably avoided copyright issues.

A lot easier to poke around a discrete algorithm, with zero uncertainty as to what is happening, vs. vast tensor models.

uptownfunk
1 replies
13h29m

Ok but why didn’t Shannon get us gpt

david_draco
0 replies
11h54m

He was busy getting us towards wifi first.

supriyo-biswas
0 replies
10h5m

In another adjacent thread, people are talking about the implications of a neural network conforming to the training data with some error margin with regards to copyright.

Many textbooks on information theory already call out the content-addressable nature of such networks[1], and they’re even used in applications like compression due to this purpose[2][3], and therefore it’s no surprise that the NYT prompting OpenAI models with a few paragraphs of their articles reproduced them nearly verbatim.

[1] https://www.inference.org.uk/itprnn/book.pdf

[2] https://bellard.org/nncp/

[3] https://pub.towardsai.net/stable-diffusion-based-image-compr...

patcon
0 replies
15h31m

I appreciate what you're saying, but convergence (via alternative paths, of various depths) is its own signal. Repeated rediscovery perhaps isn't necessarily wastefulness, but affirmation and validation of deep truth for which there are multiple paths of arrival :)

3abiton
0 replies
16h18m

Kudos for pluggimg shannomçs masterpiece

quag
5 replies
18h16m

Is the author claiming that LLMs are Markov Chain text generators? That is, the probability distribution of the next token generated is the same as the probability of those token sequences in the training data?

If so, does it suggest we could “just” build a Markov Chain using the original training data and get similar performance to the LLM?

golol
2 replies
9h9m

LLMs are Marokov Chains in the following sense: States are vectors of context-length many tokens. Then the model describes a transitions matrix: For a given context-length sized vector of tokens it gives you probabilities for the next context-length sized vector of tokens.

rrr_oh_man
1 replies
8h44m

Could you elaborate what context length means in this context? Maybe an example?

danbruc
0 replies
7h29m

The length of the input in tokens. For the simple case of tokens just being characters, a LLM does nothing but take a string of length n, the context length, and calculate for each character in the alphabet the probability that this character is the next character following the input. Then it picks one character at random according to that distribution, outputs it as the first character of the response, appends it to the input, discards the first character of the input to get it back to length n and then repeats the entire process to produce the next character of the response.

two_in_one
1 replies
12h19m

From the post:

I implemented imperative code that does what I’m proposing the transformer is doing. It produces outputs very similar to the transformer.

This means there is probably a way to bypass transformers and get the same results. Would be interesting if it's more efficient. Like given foundation model train something else and run it on much smaller device.

yorwba
0 replies
11h38m

I explained that it's not bypassing transformers and not more efficient in another comment: https://news.ycombinator.com/item?id=39254966

awwaiid
3 replies
20h29m

A thousand hands on a Ouija board.

hnfong
2 replies
12h43m

Is that an analogy? If so, it is an extremely interesting one, and I would like to know where does it come from :D

two_in_one
1 replies
12h7m
ramblerman
0 replies
6h37m

I think OP was asking about the expression "a thousand hands on a ouija board", not what a ouija board is.

robrenaud
2 replies
14h51m

Is the behavior of that the attention + FF displacements tend point in the same direction known? I am kind of surprised they are even in the same latent space across layers. The FF network could be doing arbitrary rotations, right? I suspect I misunderstand what is going on.

yorwba
0 replies
11h32m

It's a 2D representation of very high-dimensional vectors. Something has to be left out and accurately depicting arbitrary rotations in the high-dimensional space is one of those things.

mirekrusin
0 replies
14h32m

Best to replace attention addition with scaling and see.

kgeist
1 replies
19h40m

I trained a small (~10 million parameter) transformer following Andrej Karpathy’s excellent tutorial, Let’s build GPT: from scratch, in code, spelled out

As soon as I learned about Andrej Karpathy's NanoGPT, I trained it on War and Peace (in Russian), and what I found interesting is that it almost grokked Russian grammar despite being just a 3 MB model. Russian language has a complex synthetic-inflectional structure. For example, preposition "na" ("upon") requires the following noun to be in accusative case, which is manifested as ending -a for animate masculine nouns, but as null ending for inanimate nouns, or as -ia for nouns which end in a "soft consonant", -u for feminine nouns, etc. etc. Or the verb "to use" requires the following noun to be in instrumental case if it's used as a tool.

Although it's not perfect and had mistakes, I found it interesting that NanoGPT was able to infer certain complex rules in just 3 minutes of training - and I searched in the texts for the exact examples it generated and found nothing verbatim.

However, despite understanding grammar more-less, semantically, it was complete nonsense.

lingeringdoubts
0 replies
17h8m

Not too surprising, since the inflections would be among the most common tokens in the training text.

jimmySixDOF
1 replies
20h24m

This was a good 3D visualization of the same systems and they probably should be read together for maximum effect ....

LLM Visualization (https://bbycroft.net/llm) https://news.ycombinator.com/item?id=38505211

tysam_and
0 replies
16h29m

I appreciate the effort that went into this visualization, however, as someone who has worked with neural networks for 9 years, I found it far more confusing than helpful. I believe it was due to trying to present all items at once instead of deferring to abstract concepts, however, I am not entirely sure of this fact. <3 :'))))

empiko
1 replies
20h2m

nice project, but the model that was being studied is really just a toy-model (both in size and training data). as such, this model can indeed be approximated by simpler models (I would suspect even n-gram LMs), but it might not be representative of how the larger LMs work.

Closi
0 replies
19h3m

This is probably true - i.e. you could make an even smaller model and then likely come up with an even-simpler explanation for how it worked.

robrenaud
0 replies
13h35m

Is the behavior that the attention + FF displacements tend point in the same direction known? I am kind of surprised they are even in the same latent space across layers. The FF network could be doing arbitrary rotations, right? I suspect I misunderstand what is going on.

robblbobbl
0 replies
53m

At first, thank you for the insights. Could you please provide a pdf version of your article with a proper formatting so it can be also read offline?

Imnimo
0 replies
16h8m

I'm having a very hard time understanding exactly what the author is claiming to show. I've read the "Interpretation: Why Does the Approximation Work?" section a few times, but it feels like it's just a mechanical description of the steps of a transformer. What's the core claim?