return to table of content

LLMs use a surprisingly simple mechanism to retrieve some stored knowledge

retrofrost
50 replies
1d

This is amazing work, but to me it highlights some of the biggest problems in the current AI zeitgeist, we are not really trying to work on any neuron or ruleset that isnt much different from the perceptron thats just a sumnation function. Is it really that suprising that we just see this same structure repeated in the models. Just because feedforward topologies with single neuron steps are the easiest to train and run on graphics cards does that really make them the actual best at accomplishing tasks? We have all sorts of unique training methods and encoding schemes that don't ever get used because the big libraries don't support them. Until, we start seeing real varation in the fundamental rulesets of neuralnets we are always just going to be fighting against the fact these are just perceptrons with extra steps.

visarga
41 replies
1d

Just because feedforward topologies with single neuron steps are the easiest to train and run on graphics cards does that really make them the actual best at accomplishing tasks?

You are ignoring a mountain of papers trying all conceivable approaches to create models. It is evolution by selection, in the end transformers won.

retrofrost
24 replies
1d

Just because papers are getting published doesn't mean its actually gaining any traction. I mean we have known that time series of signals recieves plays a huge role in how bio neurons functionally operate and yet we have nearly no examples of spiking networks being pushed beyond basic academic exploration. We have known glial cells play a critical role in biological neural and yet you can probably count the number of papers that examine using an abstraction of that activity in neural net, on both your hands and toes. Neuroevolution using genetic algorithms has been basically looking for a big break since NEAT. Its the height of hubris to say that we have peaked with transformers when the entire field is based on not getting trapped in local maxima's. Sorry to be snippy, but there is so much uncovered ground its not even funny.

gwervc
20 replies
1d

"We" are not forbidding you to open a computer, start experimenting and publishing some new method. If you're so convinced that "we" are stuck in a local maxima, you can do some of the work you are advocating instead of asking other to do it for you.

Kerb_
16 replies
23h49m

You can think chemotherapy is a local maxima for cancer treatment and hope medical research seeks out other options without having the resources to do it yourself. Not all of us have access to the tools and resources to start experimenting as casually as we wish we could.

erisinger
13 replies
23h16m

Not a single one of you bigbrains used the word "maxima" correctly and it's driving me crazy.

vlovich123
10 replies
23h6m

As I understand it a local maxima means you’re at a local peak but there may be higher maximums elsewhere. As I read it, transformers are a local maximum in the sense of outperforming all other ML techniques as the AI technique that gets the closest to human intelligence.

Can you help my little brain understand the problem by elaborating?

Also you may want to chill with the personal attacks.

erisinger
9 replies
22h53m

Not a personal attack. These posters are smarter than I am, just ribbing them about misusing the terminology.

"Maxima" is plural, "maximum" is singular. So you would say "a local maximum," or "several local maxima." Not "a local maxima" or, the one that really got me, "getting trapped in local maxima's."

As for the rest of it, carry on. Good discussion.

gyrovagueGeist
6 replies
21h0m

While "local maximas" is wrong, I think "a local maxima" is a valid way to say "a member of the set of local maxima" regardless of the number of elements in the set. It could even be a singleton.

Tijdreiziger
3 replies
20h30m

You can't have one maxima in the same way you can't have one pencils. That's just how English works.

pixl97
2 replies
19h23m

You can't have one local maxima, it would be the global maxima. So by saying local maxima you're assuming the local is just a piece of a larger whole, even if that global state is otherwise undefined.

reverius42
0 replies
14h56m

No, you can’t have one local maxima, or one global maxima, because it’s plural. You can have one local or global maximum, or two (or more) local or global maxima.

folli
0 replies
14h4m

"You can't have one local pencils, it would be the global pencils"

dragonwriter
1 replies
13h51m

No, a member of the set of local maxima is a a local maximum, just like a member of the set of people is a person, because it is a definite singular.

The plural is also used for indefinite number, so “the set of local maxima” remains correct even if the set has cardinality 1, but a member of the set has definite singular number irrespective of the cardinality of the set.

gyrovagueGeist
0 replies
4h5m

I've been convinced, thanks!

FeepingCreature
1 replies
22h39m

A local maxima, that is, /usr/bin/wxmaxima...

erisinger
0 replies
22h20m

Touché...

tschwimmer
0 replies
21h17m

yeah, not a Nissan in sight

antonvs
0 replies
17h34m

“Maxima” sounds fancy, making it catnip for people trying to sound smart.

mikewarot
1 replies
23h8m

MNIST and other small and easy to train against datasets are widely available. You can try out anything you like even with a cheap laptop these days thanks to a few decades of Moore's law.

It is definitely NOT out of your reach to try any ideas you have. Kaggle and other sites exist to make it easy.

Good luck! 8)

retrofrost
0 replies
22h17m

My pet project has been trying to use elixir with NEAT or HyperNEAT to try and make a spiking network, then when thats working decently drop some glial interactions I saw in a paper. It would be kinda bad at purely functional stuff, but idk seems fun. The biggest problems are time and having to do a lot of both the evolutionary stuff and the network stuff. But yeah the ubiquity of free datasets does make it easy to train.

haltIncomplete
2 replies
23h4m

All we’re doing is engineering new data compression and retrieval techniques: https://arxiv.org/abs/2309.10668

Are we sure there’s anything “net new” to find within the same old x86 machines, within the same old axiomatic systems of the past?

Math is a few operations applied to carving up stuff and we believe we can do that infinitely in theory. So “all math that abides our axiomatic underpinnings” is valid regardless if we “prove it” or not.

Physical space we can exist in, a middle ground of reality we evolved just so to exist in, seems to be finite; I can’t just up and move to Titan or Mars. So our computers are coupled to the same constraints of observation and understanding as us.

What about daily life will be upended reconfirming decades old experiment? How is this not living in sunk cost fallacy?

When all you have is a hammer…

I’m reminded of Einstein’s quote about insanity.

samus
0 replies
18h55m

If you abstract far enough then yes, everything what we are doing is somehow akin to what we have done before. But that then also applies to what Einstein has done.

aldousd666
0 replies
16h15m

Einstein didn't say that about insanity, but... systems exist and are consistently described by particular equations at particular scales. Sure we can say everything is quantum mechanics, even classical physics can technically be translated as a series of wave functions that explain the same behaviors we observe, if we could measure it... But it's impractical, and some of the concepts we think of as fundamental to certain scales, like nucleons, didn't exist at others, like equations that describe the energy of empty space. So, it's maybe not quite a fallacy to point out that not every concept we find to be useful, like deep learning inference, encapsulate every rule at every scale that we know about down to the electrons, cogently. Because none of our theories do that, and even if they did, we couldn't measure or process all the things needed to check and see if we're even right. So we use models that differ from each other, but that emerge from each other, but only when we cross certain scale thresholds.

typon
0 replies
23h44m

Do you really think that transformers came to us from God? They're built on the corpses of millions of models that never went anywhere. I spent an entire year trying to scale up a stupid RNN back in 2014. Never went anywhere, because it didn't work. I am sure we are stuck in a local minima now - but it's able to solve problems that were previously impossible. So we will use it until we are impossibly stuck again. Currently, however, we have barely begun to scratch the surface of what's possible with these models.

samus
0 replies
18h53m

Who said that we peaked with transformers? I sure hope we did not. The current focus on them is just institutional inertia. Worst case another AI winter comes, at the end of which a newer, more promising technology would manage to attract funding anew.

leoc
0 replies
22h56m

(The singulars are ‘maximum’ and ‘minimum’, ‘maxima’ and ‘minima’ are the plurals.)

foobiekr
6 replies
22h48m

"won"

They barely work for a lot of cases (i.e., anything where accuracy matters, despite the bubble's wishful thinking). It's likely that something will sunset them in the next few years.

victorbjorklund
3 replies
22h26m

That is how evolution works. Something wins until something else comes along and win. And so on forever.

Retric
2 replies
18h59m

Evolution generally favors multiple winners in different roles over a single dominate strategy.

People tend to favor single winners.

advael
1 replies
15h38m

I both think this is a really astute and important observation and also think it's an observation that's more true locally than of people broadly. Modern neoliberal business culture generally and the consolidated current incarnation of the tech industry in particular have strong "tunnel vision" and belief in chasing optimality compared to many other cultures, both extant and past

imtringued
0 replies
10h28m

In neoclassical economics, there are no local maxima, because it would make the math intractable and expose how much of a load of bullshit most of it is.

refulgentis
1 replies
19h33m

It seems cloyingly performative grumpy old man once you're at "it barely works and it's a bubble and blah blah" in response to a discussion about their comparative advantage (yeah, they won, and absolutely convincingly so)

wizzwizz4
0 replies
18h3m

That's like saying Bitcoin won cryptography.

dartos
3 replies
1d

I mean RWKV seems promising and isn’t a transformer model.

Transformers have first mover advantage. They were the first models that scaled to large parameter counts.

That doesn’t mean they’re the best or that they’ve won, just that they were the first to get big (literally and metaphorically)

tkellogg
1 replies
1d

Yeah, I'd argue that transformers created such capital saturation that there's a ton of opportunity for alternative approaches to emerge.

dartos
0 replies
22h58m

Speak of the devil. Jamba just hit the front page.

refulgentis
0 replies
19h29m

It doesn't seem promising, a one man band has been doing a quixotic quest based on intuition and it's gotten ~nowhere, and it's not for lack of interest in alternatives. There's never been a better time to have a different approach - is your metric "times I've seen it on HN with a convincing argument for it being promising?" -- I'm not embarrassed to admit that is/was mine, but alternatively, you're aware of recent breakthroughs I haven't seen.

nicklecompte
1 replies
23h24m

His point is that "evolution by selection" also includes that transformers are easy to implement with modern linear algebra libraries and cheap to scale on current silicon, both of which are engineering details with no direct relationship to their innate efficacy at learning (though indirectly it means you scale up the training data for more inefficient learning).

wanderingbort
0 replies
23h7m

I think it is correct to include practical implementation costs in the selection.

Theoretical efficacy doesn’t guarantee real world efficacy.

I accept that this is self reinforcing but I favor real gains today over potentially larger gains in a potentially achievable future.

I also think we are learning practical lessons on the periphery of any application of AI that will apply if a mold-breaking solution becomes compelling.

szundi
0 replies
18h13m

“end”

jjtheblunt
0 replies
12h19m

in the end transformers won

we're at the end?

antonvs
0 replies
17h42m

I’d say it’s more that transformers are in the lead at the moment, for general applications. There’s no rigorous reason afaik that it should stay that way.

ldjkfkdsjnv
4 replies
23h37m

Cannot understand people claiming we are in a local maxima, when we literally had an ai scientific breakthrough only in the last two years.

xanderlewis
3 replies
22h19m

Which breakthrough in the last two years are you referring to?

6gvONxR4sf7o
1 replies
19h6m

If you had to reduce it to one thing, it's probably that language models are capable few shot and zero shot learners. In other words, training a model to simply predict the next word on naturally occurring text, you end up with an tool you can use for generic tasks, roughly speaking.

xyzzy_plugh
0 replies
16h11m

It turns out a lot of tasks are predictable. Go figure.

ldjkfkdsjnv
0 replies
21h38m

the LLM scaling law

posix86
0 replies
21h58m

I don't understand enough about the subject to say, but to me it seemed like yes, other models have better metrics with equal model size i.t.o. number of neurons or asymptotic runtime, but the most important metric will always be accuracy/precision/etc for money spent... or in other words, if GPT requires 10x number of neurons to reach the same performance, but buying compute & memory for these neuros is cheaper, then GPT is a better means to an end.

ikkiew
0 replies
22h47m

the perceptron thats just a sumnation[sic] function

What would you suggest?

My understanding of part of the whole NP-Complete thing is that any algorithm in the complexity class can be reduced to, among other things, a 'summation function'.

MuffinFlavored
19 replies
1d2h

I don't understand how a "CSV file/database/model" of 70,000,000,000 (70B) "parameters" of 4-bit weights (a 4 bit value can be 1 of 16 unique numbers) gets us an interactive LLM/GPT that is near-all-knowledgable on all topics/subjects.

edit: did research, the 4-bit is just a "compression method", the model ends up seeing f32?

Quantization is the process of mapping 32-bit floating-point numbers (which are the weights in the neural network) to a much smaller bit representation, like 4-bit values, for storage and memory efficiency.

Dequantization happens when the model is used (during inference or even training, if applicable). The 4-bit quantized weights are converted back into floating-point numbers that the model's computations are actually performed with. This is done using the scale and zero-point determined during the initial quantization, or through more sophisticated mapping functions that aim to preserve as much information as possible despite the reduced precision.

so what is the relationship to "parameters" and "# of unique tokens the model knows about (vocabulary size)"?

At first glance, LLAMa only has a 32,000 vocabulary size and 65B parameters as compared to GPT-3,

The 65 billion parameters in a model like LLAMA (or any large language model) essentially function as a highly intricate mapping system that determines how to respond to a given input based on the learned relationships between tokens in its training data.
Filligree
17 replies
1d2h

It doesn't, is the simple answer.

The slightly more complicated one is that a compressed text dump of Wikipedia isn't even 70GB, and this is lossy compression of the internet.

MuffinFlavored
11 replies
1d2h

say the average LLM these days has a unique token (vocabulary) size of ~32,000 (not its context size, # of unique tokens it can pick between in a response. English words, punctuation, math, code, etc.)

the 60-70B parameters of models is basically like... just stored patterns of "if these 10 tokens in a row input, then these 10 tokens in a row output score the highest"

Is that a good summary?

The model uses its learned statistical patterns to predict the probability of what comes next in a sequence of text.

based on what inputs?

1. previous tokens in the sequence from immediate context

2. tokens summarizing the overall topic/subject matter from the extended context

3. scoring of learned patterns from training

4. what else?

numeri
6 replies
1d2h

Your suggested scheme (assuming a mapping from 10 tokens to 10 tokens, with each token taking 2 bytes to store) would take (32000 * 20) * 2 bytes = 2.3e78 TiB of storage, or about 250 MiB per atom in the observable universe (1e82), prior to compression.

I think it's more likely that LLMs are actually learning and understanding concepts as well as memorizing useful facts, than that LLMs have discovered a compression method with that high of a compression ratio, haha.

mjburgess
4 replies
1d2h

LLMs cannot determine the physical location of any atoms. they cannot plan movement, and so on.

LLMs are just completing patterns of text that have been given before, 'everthing ever written' is both a lot for any individual person to read; but also, almost nothing, in that to propertly describe a table requires more information

text is itself an extremely compressed medium which lacks almost any information about the world; it succeeds in being useful to generate because we have that information and are able to map it back to it

numeri
3 replies
1d2h

I didn't imply that they know anything about where atoms are, I was just pointing out the sheer absurdity of that volume of data.

I should make it clear that my comparison there is unfair and mostly just funny – you don't need to store every possible combination of 10 tokens, because most of them will be nonsense, so you wouldn't actually need that much storage. That being said, it's been fairly solidly proven that LLMs aren't just lookup tables/stochastic parrots.

mjburgess
2 replies
1d1h

fairly solidly proven that LLMs aren't just lookup tables/stochastic parrots

Well i'd strongly disagree. I see no evidence of this; I'm am quite well acquainted with the literature.

All empirical statistical AI is just a means of approximating an empirical distribution. The problem with NLP is that there is no empirical function from text tokens to meanings; just as there is no function from sets of 2D images to a 3D structure.

We know before we start that the distributions of text tokens are only coincidentally related to the distributions of meanings. The question is just how much value that coincidence has in any given task.

(Consider, eg., that if I ask, "do you like what i'm wearing?" there is no distribution of responses which is correct. I do not want you to say "yes" 99/100, or even 100/100 times. etc. what I want you to say is a word caused a mental state you have: that of (dis)liking what i'm wearing.

Since no statistical AI systems generate outputs based on causal features of reality, we know a priori that almost all possible questions that can be asked cannot be answered by LLMs.

They are only useful where questions have cannonical answers; and only because "cannonical" means that a text->text function is likely to be conidentally indistinguishable from a the meaning->meaning function we're interested in).

tel
1 replies
17h50m

That suggests that no statistical method could ever recover hidden representations though. And that’s patently untrue. Taken to its greatest extreme you shouldn’t even be able to guess between two mixed distributions even when they have wildly non-overlapping ranges. Or put another way, all of statistical testing in science is flawed.

I’m not saying you believe that, but I fail to see how that situation is structurally different from what you claim. If it’s a matter of degree, how do you feel things change as the situation becomes more complex?

mjburgess
0 replies
9h33m

Yes, I think most statistical testing in science is flawed.

But, to be clear, the reason it could ever work at all has nothing to do with the methods or the data itself, it has to do with the properties of the data generating process (ie., reality, ie., what's being measured).

You can never build representations from measurement data, this is called inductivism and it's pretty clearly false: no representation is obtained from just characterising measurement data. Theres no cases where I can think of that this would work -- temperature isnt patterns in thermometers; gravity isnt patterns in the positions of stars; and so on.

Rather you can decide between competing representations using stats in a few special cases. Stats never uncovers hidden representations, it can decide between different formal models which include such representations.

eg., if you characterise some system as having a power-law data generating process (eg., social network friendships), then you can measure some parameters of that process

or, eg., if you arrange all the data to already follow a law you know (eg., F=Gmm/r^2) then you can find G, 'statistically'.

This has caused a lot of confusion histroically: it seems G is 'induced over cases', but all the representaiton work has alerady been done. Stats/induction just plays the role of fine-tuning known representatios. it never builds any

pk-protect-ai
0 replies
1d1h

There is something wrong with these arithmetic: "(32000 * 20) * 2 bytes = 2.3e78 TiB of storage" ... The factorial is missing somewhere in there ...

HarHarVeryFunny
2 replies
1d

the 60-70B parameters of models is basically like... just stored patterns of "if these 10 tokens in a row input, then these 10 tokens in a row output score the highest"

Is that a good summary?

No - there's a lot more going on. It's not just mapping input patterns to output patterns.

A good starting point to understand it are linguist's sentence-structure trees (and these were the inspiration for the "transformer" design of these LLMs).

https://www.nltk.org/book/ch08.html

Note how there are multiple levels of nodes/branches to these trees, from the top node representing the sentence as a whole, to the words themselves which are all the way at the bottom.

An LLM like ChatGPT is made out of multiple layers (e.g. 96 layers for GPT-3) of transformer blocks, stacked on top of each other. When you feed an input sentence into an LLM, the sentence will first be turned into a sequence of token embeddings, then passed through each of these 96 layers in turn, each of which changes ("transforms") it a little bit, until it comes out the top of the stack as the predicted output sentence (or something that can be decoded into the output sentence). We only use the last word of the output sentence which is the "next word" it has predicted.

You can think of these 96 transformer layers as a bit like the levels in one of those linguistic sentence-structure trees. At the bottom level/layer are the words themselves, and at each successive higher level/layer are higher-and-higher level representations of the sentence structure.

In order to understand this a little better, you need to understand what these token "embeddings" are, which is the form in which the sentence is passed through, and transformed by, these stacked transformer layers.

To keep it simple, think of a token as a word, and say the model has a vocabulary of 32,000 words. You might perhaps expect that each word is represented by a number in the range 1-32000, but that is not the way it works! Instead, each word is mapped (aka "embedded") to a point in a high dimensional space (e.g. 4096-D for LLaMA 7B), meaning that it is represented by a vector of 4096 numbers (cf a point in 3-D space represented as (x,y,z)).

These 4096 element "embeddings" are what actually pass thru the LLM and get transformed by it. Having so many dimensions gives the LLM a huge space in which it can represent a very rich variety of concepts, not just words. At the first layer of the transformer stack these embeddings do just represent words, the same as the nodes do at the bottom layer of the sentence-structure tree, but more information is gradually added to the embeddings by each layer, augmenting and transforming what they mean. For example, maybe the first transformer layer adds "part of speech" information so that each embedded word is now also tagged as a noun or verb, etc. At the next layer up, the words comprising a noun phase or verb phrase may get additionally tagged as such, and so-on as each transformer layer adds more information.

This just gives a flavor of what is happening, but basically by the time the sentence has reached the top layer of the transformer it has been able to see the entire tree structure of the sentence, and only then have "understand" it well enough to predict a grammatically and semantically "correct" continuation from which it is able to predict continuation words.

MichaelZuo
1 replies
20h26m

Thanks for the explanation.

Since unicode has well over 64000 symbols, does that imply models, trained on a large corpus, must necessarily have at least 64000 ‘branches’ at the bottom layer?

HarHarVeryFunny
0 replies
16h18m

The size of the character set (unicode) doesn't really factor into this. Input words are broken down into multi-character tokens (some words will be one token, some split into two, etc), then these tokens mapped into the embedding vectors which is what the model is then operating on.

The linguistic sentence structure tree for any input sentence is a useful way to think about what is happening as the input sentence is fed into the model and processed through it layer by layer, but doesn't have any direct correspondence to the model. The model has a fixed number of layers of fixed max-tokens width, so nothing changes according to the sentence passing through it.

Note that the bottom level of the sentence structure tree is just the words of the sentence, so the number of branches is just the length of the sentence. The model doesn't actually represent these branches though - just the embeddings corresponding to the input, which are transformed from input to output as they are passed through the model and each layer does it's transformer thing.

wongarsu
0 replies
1d1h

That would be equivalent to a hidden markov chain. Those have been around for decades, but we have only managed to make them coherent for very short outputs. Even GPT2 beats any Markov chain, so there has to be more going on

Modern LLMs are able to transfer knowledge between different languages, so it's fair to assume that some mapping between human language and a more abstract internal representation happens at the input and output, instead of the model "operating" on English or Chinese or whatever language you talk with it. And once this exists, an internal "world model" (as in: a collection of facts and implications) isn't far, and seems to indeed be something most LLMs do. The reasoning on top of that world model is still very spotty though

ramses0
4 replies
1d2h

Is there some sort of "LLM-on-Wikipedia" competition?

ie: given "just wikipedia" what's the best score people can get on however these models are evaluated.

I know that all the commercial ventures have a voracious data-input set, but it seems like there's room for dictionary.llm + wikipedia.llm + linux-kernel.llm and some sort of judging / bake-off for their different performance capabilities.

Or does the training truly _NEED_ every book every written + the entire internet + all knowledge ever known by mankind to have an effective outcome?

ramses0
1 replies
1d

Not exactly, because LLM's seem to be exhibiting value via "lossy knowledge response" vs. "exact reproduction measured in bytes", but close.

AnotherGoodName
0 replies
14h27m

Lossy and lossless are more interchangeable in computer science than people give credit so i wouldn't dwell on that too much. You can optimally convert one into the other with arithmetic coding. In fact the actual best in class algorithms that have won the hutter prize are all lossy behind the scenes. They make a prediction on the next data using a model (often AI based) which is a lossy process and with arithmetic coding they losslessly encode the next data with bits proportional to how correct the prediction was. In fact the reason why the hutter prize is lossless compression is exactly because converting lossy to lossless with arithmetic coding is a way to score how correct a lossy prediction is.

CraigJPerry
0 replies
1d1h

> Or does the training truly _NEED_ every book every written + the entire internet + all knowledge ever known by mankind to have an effective outcome?

I have the same question.

Peter Norvig’s GOFAI Shakespeare generator example[1] (which is not an LLM) gets impressive results with little input data to go on. Does the leap to LLM preclude that kind of small input approach?

[1] link should be here because I assumed as I wrote the above that I would just turn it up with a quick google. Alas t’was not to be. Take my word for it, somewhere on t’internet is an excellent write up by Peter Norvig on LLM vs GOFAI (good old fashioned artificial intelligence)

Acumen321
0 replies
1d1h

Quantization in this context is the precision of each value in the vector or matrix/tensor.

If the model in question has a token embedding length of 1024, even if it was a 1 bit quantization, each token has 2^1024 possible values.

If the context length is 32,000 tokens, there are 32,000^2^1024 possible inputs.

whatever1
11 replies
1d2h

Llms seem like a good compression mechanism.

It blows my mind that I can have a copy of llama locally on my PC and have access to virtually the entire internet

krainboltgreene
4 replies
1d1h

have access to virtually the entire internet

It isn't even close to 1% of the internet, much less virtually the entire internet. According to the latest dump, Common Crawl has 4.3B pages, but Google in 2016 estimated there are 130T pages. The difference between 130T and 4.3B is about 130T. Even if you narrow it down to Google's searchable text index it's "100's of billions of pages" and roughly 100P compared to CommonCrawl's 400T.

fspeech
2 replies
1d

130T unique pages? That seems highly unlikely as that averages to over 10000 pages for each human being alive. If gp merely wants texts of interest to self as opposed to an accurate snapshot it seems LLMs should be quite capable, one day.

darby_eight
0 replies
22h38m

It doesn't seem that hard to believe given how much automatically generated "content" (mostly garbage) there is.

I think a more interesting question is how much information there is on the internet, especially after optimal compression. I'm guessing this is a very difficult question to answer, but also much higher than LLMs currently store.

cypress66
0 replies
6h10m

Is it? Every user profile in every website is a page. Every single tweet is a page.

whatever1
0 replies
13h23m

The internet to me and to most of the people is the 10 first search results for the various terms we search for.

Culonavirus
4 replies
1d1h

Yea except it's a lossy compression. With the lost part being hallucinated in at inference time.

Kuinox
2 replies
1d1h

If you've read the article, the LLM hallucinations aren't due to the model not knowing the information but a function that choose to remember the wrong thing.

sinemetu11
0 replies
1d

From the paper:

Finally, we use our dataset and LRE-estimating method to build a visualization tool we call an attribute lens. Instead of showing the next token distribution like Logit Lens (nostalgebraist, 2020) the attribute lens shows the object-token distribution at each layer for a given relation. This lets us visualize where and when the LM finishes retrieving knowledge about a specific relation, and can reveal the presence of knowledge about attributes even when that knowledge does not reach the output.

They're just looking at what lights up in the embedding when they feed something in, and whatever lights up is "knowing" about that topic. The function is an approximation they added on top of the model. It's important to not conflate this with the actual weights of the model.

You can't separate the hallucinations from the model -- they exist precisely because of the lossy compression.

ewild
0 replies
1d

even this place has people not reading the articles. we are doomed

AnotherGoodName
0 replies
16h13m

Lossy and lossless are way more transferable than people give credit.

Long winded explanation as best as i can in a HN comment. Essentially for state of the art compression both the encoder and the decoder have the same algorithm. They look at the bits encoded/decoded so far, they both run exactly the same prediction on those bits seen so far using some model that predicts based on past data (AI is fantastic for this). If the prediction was 99% likely that the next bit is a '1' the encoder only writes a fraction of a bit to represent that (assuming the prediction is correct) and on the other side the decoder will have the same prediction at that point and either read the next large number of bits to correct or it will be able to simple write '1' to the output and start on the prediction of the next bit given that now written '1'.

Essentially lossy predictions of the next data are great tools to losslessly compress data as those predictions of the next bit/byte/word minimize the data needed to losslessly encode that next bit/byte/word. Likewise you can trivially make a lossy compressor out of a lossless one. Lossy and lossless just aren't that different.

The longstanding Hutter prize for AI in fact judges the AI on how well it can compress data. http://prize.hutter1.net/ This is based in the fact that what we think of as AI and compression are quite interchangeable. There's a whole bunch of papers out on this.

http://prize.hutter1.net/hfaq.htm#compai

I have nothing to do with Hutter but i know all about AI and data compression and their relation.

nyrikki
0 replies
2h39m

PAC learning is compression.

PAC learnable, Finite VC dimensionality, and the following form of compression are fully equivalent.

https://arxiv.org/abs/1610.03592

Basically each individual neuron/perceptron just splits a space into two subspaces.

mike_hearn
7 replies
1d2h

This is really cool. My mind goes immediately to what sort of functions are being used to encode programming knowledge, and if they are also simple linear functions whether the standard library or other libraries can be directly uploaded into an LLMs brain as it evolves, without needing to go through a costly training or performance-destroying fine-tune. That's still a sci-fi ability today but it seems to be getting closer.

Animats
5 replies
1d1h

That's a good point. It may be possible to directly upload predicate-type info into a LLM. This could be especially useful if you need to encode tabular data. Somewhere, someone probably read this and is thinking about how to export Excel or databases to an LLM.

It's encouraging to see people looking inside the black box successfully. The other big result in this area was that paper which found a representation of a game board inside a LLM after the LLM had trained to play a game. Any other good results in that area?

The authors point out that LLMs are doing more than encoding predicate-type info. That's just part of what they are doing.

wongarsu
3 replies
1d

The opposite is also exciting: build a loss function that punishes models for storing knowledge. One of the issues of current models is that they seem to favor lookup over reasoning. If we can punish models (during training) for remembering that might cause them to become better at inference and logic instead.

qlk1123
0 replies
17h25m

I believe it will add some spice to the model, but you shouldn't go too far at that direction. Any social system has a rule set, which has to be learnt and remembered, not infered.

Two exmaples. (1) grammars in natural languages. You can just see in another commenter here uses "a local maxima", and then how people react to that. I didn't even notice becuase English grammar has never been native to me. (2) Mostly, prepositions between two languages, no matter how close they are, don't have a direct mapping. The learner just has to remember it.

kossTKR
0 replies
22h10m

Interesting. Reminds me of a sci-fi short i read years ago where AI's "went insane" when they had too much knowledge because they'd spent too much time looking through data and get a buffer overflow.

I know some of the smaller models like PHI-2 are training for reasoning specifically before by training on question answer sets, though this seems like the opposite to me.

azinman2
0 replies
17h4m

But how to do when pre training is basically predict the next token?

AaronFriel
0 replies
1d1h

It indeed is. An attention mechanism's key and value matrices grow linearly with context length. With PagedAttention[1], we could imagine an external service providing context. The hard part is the how, of course. We can't load our entire database in every conversation, and I suspect there are also challenges around training (perhaps addressed via LandmarkAttention[2]) and building a service efficiently retrieve additional key-value matrices.

The external service vector database may require tight timings necessary to avoid stalling LLMs. To manage 20-50 tokens/sec, answers must arrive within 50-20ms.

And we cannot do this in real-time, pausing the transformer when a layer produces a query vector stalls the batch, so we need a way to predict queries (or embeddings) several tokens ahead of where they'd be useful and inject the context in when it's needed, and to know when to page it out.

[1] https://arxiv.org/abs/2309.06180

[2] https://arxiv.org/abs/2305.16300

politician
0 replies
1d1h

Hah! Maybe Neo was an LLM. "I know kung-fu."

derefr
7 replies
1d2h

Help me understand: when they say that the facts are stored as a linear function… are they saying that the LLM has a sort of N-dimensional “fact space” encoded into the model in some manner, where facts are embedded into the space as (points / hyperspheres / Voronoi manifolds / etc); and where recalling a fact is — at least in an abstract sense — the NN computing / remembering a key to use, and then doing a key-value lookup in this space?

If so: how do you embed a KV-store into an edge-propagated graphical model? Are there even any well-known techniques for doing that “by hand” right now?

(Also, fun tangent: isn't the "memory palace" memory technique, an example of human brains embedding facts into a linear function for easier retrieval?)

jacobn
3 replies
1d2h

The fundamental operation done by the transformer, softmax(Q.K^T).V, is essentially a KV-store lookup.

The Query is dotted with the Key, then you take the softmax to pick mostly one winning Key (the Key closest to the Query basically), and then use the corresponding Value.

That is really, really close to a KV lookup, except it's a little soft (i.e. can hit multiple Keys), and it can be optimized using gradient descent style methods to find the suitable QKV mappings.

naveen99
2 replies
1d2h

Not sure there is any real lookup happening. Q,K are the same and sometimes even v is the same…

toxik
1 replies
1d

Q, K, V are not the same. In self-attention, they are all computed by separate linear transformation of the same input (ie the previous layer’s output). In cross-attention even this is not true, then K and V are computed by linear transformation of whatever is cross-attended, and Q is computed by linear transformation of the input as before.

ewild
0 replies
1d

yeah a common misconception people think because the input is the same they forget that their is a pre attention linear transofrmation for q k and v (using the decoder only version obv v is diff with encoder decoder bert style)

thfuran
0 replies
1d2h

isn't the "memory palace" memory technique, an example of human brains embedding facts into a linear function for easier retrieval?

I'm not sure I see how that's a linear function.

samus
0 replies
18h29m

The memory palace is a hack that works because in an evolutionary sense our brain's purpose is to help us navigate our world and be effective in it. To do that, it has to be really good at remembering locations, to plot paths through and between them, and to translate that into speech or motion.

bionhoward
0 replies
1d2h

[Layer] Normalization constrains huge vectors representing tokens (input fragments) to positions on a unit ball (I think), and the attention mechanism operates by rotating the unconstrained ones based on the sum of their angles relative to all the others.

I only skimmed the paper but believe the point here is that there are relatively simple functions hiding in or recoverable from the bigger network which specifically address certain categories of relationships between concepts.

Since it would, in theory, be possible to optimize such functions more directly if they are possible to isolate, could this enable advances in the way such models are trained? Absolutely.

After all, one of the best criticisms of “modern” AI is the notion we’re just mixing around a soup of linear algebra. Allowing some sense of modularity (reductionism) could make them less of a black box and more of a component driven approach (in the lagging concept space and not just the leading layer space)

vsnf
6 replies
1d2h

Linear functions, equations with only two variables and no exponents, capture the straightforward, straight-line relationship between two variables

Is this definition considering the output to be included in the set of variables? What a strange way to phrase it. Under this definition, I wonder what an equation with one variable is. Is a single constant an equation?

olejorgenb
1 replies
1d2h

I would think `x = 4` is considered an equation, yes?

pessimizer
0 replies
1d1h

And linear at that: x = 0y + 4

pb060
0 replies
1d2h

Aren’t functions and equations two different things?

ksenzee
0 replies
1d2h

I think they're trying to say "equations in the form y = mx + b" without getting too technical.

hansvm
0 replies
1d2h

It's just a change in perspective. Consider a vertical line. To have an "output" variable you have to switch the ordinary `y=mx+b` formulation to `x=c`. The generalization `ax+by=c` accommodates any shifted line you can draw. Adding more variables increases the dimension of the space in consideration (`ax+by+cz=d` could potentially define a plane). Adding more equations potentially reduces the size of the space in consideration (e.g., if `x+y=1` then also knowing `2x+2y=2` wouldn't reduce the solution space, but `x-y=0` would, and would imply `x=y=1/2`, and further adding `x+2y=12` would imply a lack of solutions).

Mind you, the "two variable" statement in this news piece is a red-herring. The paper describes higher-dimension linear relationships, of the form `Mv=c` for some constant matrix `M`, some constant vector `c`, and some variable vector `v`.

On some level, the result isn't _that_ surprising. The paper only examines one layer (not the whole network), after the network has done a huge amount of embedding work. In that layer, they find that under half the time they're able to get over 60% of the way there with a linear approximation. Another interpretation is that the single layer does some linear work and shoves it through some nonlinear transformations, and more than half the time that nonlinearity does something very meaningful (and even in that under half the time where the linear approximation is "okay", the metrics are still bad).

I'm not super impressed, but I don't have time to full parse the thing right now. It is a bit surprising; if memory serves, one of the authors on this paper had a much better result in terms of neural network fact editing in the last year or two. This looks like a solid research idea, solid work, it didn't pan out, and to get it published they heavily overstated the conclusions (and then the university press release obviously bragged as much as it could).

01HNNWZ0MV43FF
0 replies
1d2h

Yeah I guess they mean one independent variable and one dependent variable

It rarely matters because if you had 2 dependent variables, you can just express that as 2 equations, so you might as well assume there's exactly 1 dependent and then only discuss the number of independent variables.

estebarb
4 replies
1d2h

I find this similar to what relation vectors do in word2vec: you can add a vector of "X of" and often get the correct answer. It could be that the principle is still the same, and transformers "just" build a better mapping of entities into the embedding space?

PaulHoule
3 replies
1d1h

I think so. It’s hard for me to believe that the decision surfaces inside those models are really curved enough (like the folds of your brain) to really take advantage of FP32 numbers inside vectors: that is I just don’t believe it is

  x = 0 means “fly”
  x = 0.01 means “drive”
  x = 0.02 means “purple”
but rather more like

  x < 1.5 means “cold”
  x > 1.5 means “hot”
which is one reason why quantization (often 1 bit) works. Also it is a reason why you can often get great results feeding text or images through a BERT or CLIP-type model and then applying classical ML models that frequently involve linear decision surfaces.

taneq
2 replies
1d1h

Are you conflating nonlinear embedding spaces with the physical curvature of the cerebellum? I don't think there's a direct mapping.

PaulHoule
1 replies
1d1h

My mental picture is that violently curved decision surfaces could look like the convolutions of the brain even though they have nothing to do with how the brain actually works.

I think of how tSNE and other algorithms sometimes produce projections that sometimes look like that (maybe that’s just what you get when you have to bend something complicated to fit into a 2-d space) and frequently show cusps that to me look like a sign of trouble (took me a while in my PhD work to realize how Poincaré sections from 4 or 6 dimensions can look messed up when a part of the energy surface tilts perpendicularly to the projection surface.)

I still find it hard to believe that dense vectors are the right way to deal with text despite the fact that they work so well. For images it is one thing because changing one pixel a little doesn’t change the meaning of an image, but changing a single character of a text can completely change the meaning of the text. Also there’s the reality that if you randomly stick together tokens you get something meaningless, so it seems almost all of the representation space covers ill formed texts and only a low dimensional manifold holds the well formed texts. Now the decision surfaces really have to be nonlinear and crumpled over all but I think there’s a definitely a limit on how crumpled those surfaces can be.

Y_Y
0 replies
1d

This is interesting. It makes me think of an "immersion"[0], as in a generalization of the concept of "embedding" in differential geometry.

I share your uneasiness about mapping words to vectors and agree that it feels as if we're shoehorning some more complex space into a computationally convenient one.

[0] https://en.wikipedia.org/wiki/Immersion_(mathematics)

mikewarot
3 replies
22h41m

I wonder if this relation still holds with newer models that have have even more compute thrown at them?

My intuition is that the structure inherent to language makes Word2Vec possible. Then training on terabytes of human text encoded with Word2Vec + Positional Encoding makes it possible to then have the ability to predict the next encoding at superhuman levels of cognition (while training!).

It's my sense that the bag of words (as input/output method) combined with limited context windows (to make Positional Encoding work) is a huge impedance mismatch to the internal cognitive structure.

Thus I think that given the orders of magnitude more compute thrown at GPT-4 et al, it's entirely possible new forms of representation evolved and remain to be discovered by humans probing through all the weights.

I also think that MemGPT could, eventually, become an AGI because of the unlimited long term memory. More likely, though, I think it would be like the protagonist in Memento[1].

[1] https://en.wikipedia.org/wiki/Memento_(film)

[edit - revise to address question]

autokad
2 replies
21h34m

sorry if I misread your comment, but you seem to be indicating that LLMs such as chat gpt (which use gpt 3+) are bag of words models? they are sequence models.

mikewarot
1 replies
21h0m

I edited my response... I hope it helps... my understanding is that the output gives probabilities for all the words, then one is chosen with some random thrown in (via the #temperature) then fed back in... which to me seems to equate to bag of words. Perhaps I mis-understood the term.

smaddox
0 replies
20h53m

Bag of words models use a context that is a "bag" (i.e. an unorder map from elements to their counts) of words/tokens. GPT's use a context that is a sequence (i.e. an ordered list) of words/tokens.

leobg
3 replies
1d3h

In one experiment, they started with the prompt “Bill Bradley was a” and used the decoding functions for “plays sports” and “attended university” to see if the model knows that Sen. Bradley was a basketball player who attended Princeton.

Why not just change the prompt?

  Name, University attended, Sport played
  Bill Bradley,

numeri
2 replies
1d2h

This is research, trying to understand the fundamentals of how these models work. They weren't actually trying to find out where Bill Bradley went to university.

leobg
1 replies
22h46m

Of course. But weren’t they trying to find out whether or not that fact was represented in the model’s parameters?

wnoise
0 replies
21h44m

No, they were trying to figure out if they had isolated where facts like that were represented.

uoaei
1 replies
22h28m

This is the "random linear projections as memorization technique" perspective on Transformers. It's not a new idea per se, but nice to see it fleshed out.

If you dig into this perspective, it does temper any claims of "cognitive behavior" quite strongly, if only because Transformers have such a large capacity for these kinds of "memories".

tel
0 replies
17h48m

Do you have a reference on “random linear projections as memorization”? I know random projections quite well but haven’t seen that connection.

i5heu
1 replies
1d2h

So it is entirely possible to decouple the reasoning part from the information part?

This is like absolutely mind blowing if this is true.

learned
0 replies
1d1h

A big caveat mentioned in the article is that this experiment was done with a small set (N=47) of specific questions that they expected to have relatively simple relational answers:

The researchers developed a method to estimate these simple functions, and then computed functions for 47 different relations, such as “capital city of a country” and “lead singer of a band.” While there could be an infinite number of possible relations, the researchers chose to study this specific subset because they are representative of the kinds of facts that can be written in this way.

About 60% of these relations were retrieved using a linear function in the model. The remaining appeared to have nonlinear retrieval and is still a subject of investigation:

Functions retrieved the correct information more than 60 percent of the time, showing that some information in a transformer is encoded and retrieved in this way. “But not everything is linearly encoded. For some facts, even though the model knows them and will predict text that is consistent with these facts, we can’t find linear functions for them. This suggests that the model is doing something more intricate to store that information,” he says.
zyklonix
0 replies
23h47m

This reminds me of the famous "King - Man + Woman = Queen" embedding example. The fact that embeddings have semantic properties in them explains why simple linear functions would work as well.

wslh
0 replies
1d2h

Can we roughly say that LLMs produces (training mode) a lot of IF-THENs in an automatic way from a vast quantity of information (nor techniques) that was not available before?

seydor
0 replies
21h38m

Does this point to a way to compress entire LLMs by selecting a set of relations?

robertclaus
0 replies
23h32m

I think this paper is cool and I love that they ran these experiments to validate these ideas. However, I'm having trouble reconciling the novelty of the ideas themselves. Isn't this result expected given that LLM's naturally learn simple statistical trends between words? To me it's way cooler that they clearly demonstrated not all LLM behavior can be explained this simply.

aia24Q1
0 replies
1d2h

I thought "fact" means truth.