return to table of content

Better and Faster Large Language Models via Multi-Token Prediction

deskamess
27 replies
1d6h

Side track: There is so much going on this space. I wish there was a chronological flow of a machine learning scenario/story with all the terms being introduced as we meet them (data, pre-training, training, inference, mixture of experts, RAG). Like someone walking me through a factory explaining what happens at each stage (like Mr Rogers used to do). Most of the time I do not know where the terms fit in the big picture. When I first came across pre-training I thought it was something done to the data before training happened but it was actually another training.

jstummbillig
10 replies
1d5h

People waste too much time building out stuff that is really bad in AI right now.

Of course, everything is, but instead of taking on the task of patching that up, the better approach would be to pretend there will be something that is a lot better than GPT-4 in the near future (because there will be) and design a differentiated product under that premise.

sunir
6 replies
1d4h

That’s an interesting idea. What do you mean?

I understand the prompts as a service is short term… but what is a long term product you see?

redblacktree
3 replies
1d4h

I may be misunderstanding your meaning, but I'm not convinced that "prompts as a service" is short term. I think we'll see a number of apps pop up that will be essentially that, i.e. powered by a generative AI, but with a great UX. Not everyone is good at prompting, and although it is a skill many will develop, packaging up great prompts in niche problem areas still looks like an area of opportunity to me. I'm not talking necessarily about chat experiences, but apps that can, as an example, maintain task lists for you after consuming your incoming communications.

objektif
2 replies
1d

How many times did you have issues communicating with your spouse because of prompting issues? And how did you resolve it?

Why would you need an extra layer here?

redblacktree
0 replies
1h31m

I don't understand your comment. I was talking about apps built on LLMs where the prompts aren't given by the users, but the LLM is still an important part of the functionality.

malnourish
0 replies
21h38m

Why do PR firms and copy editors exist?

jstummbillig
1 replies
1d4h

I think a general way to answer this is by considering for any domain you know: What would you pay a human to do right now, that LLMs frustratingly can't, but should in theory, if only they were a bit better and more consistent?

This could mean: Instead of diving into langchain and trying to program your way out of a bad model, or trying to do weird prompts, just write a super clear set of instructions and wait for a model that is capable of understanding clear instructions, because that is an obvious goal of everyone working on models right now and they are going to solve this better than your custom workaround can.

This is not a rigid rule, just a matter of proportions. For example, you should probably be willing to try a few weird intermediary prompt hacks, if you want to get going with AI dev right now. But if most of what most people do will probably be solved by a somewhat better model, that's probably a cause for pause.

mercer
0 replies
23h33m

I suppose with an eye on open-source, an interesting 'rule' would be to set a cut-off point for models that can run locally, and/or are considered to be feasible locally soon.

JKCalhoun
2 replies
23h47m

Can I assume AI are continuing their training as they interact with people when deployed? Are ChatGPT, Claude, learning from my interactions with them? I do, BTW, correct them when they unknowingly (I assume) steer me wrong.

One wonders, if that's the case, how quickly an AI might improve if it has something close to Google's search site throughput. I mean fielding several billion queries a day, for a year — that would be some pretty stellar training right there I would think.

jstummbillig
0 replies
21h18m

Can I assume AI are continuing their training as they interact with people when deployed?

Yes, you can. Some of the big providers are fairly clear on where in their products this happens, and all offer a way out (mostly when paying for api access)

One wonders, if that's the case, how quickly an AI might improve if it has something close to Google's search site throughput

Indeed. Another possibility is that user input will turn out to be increasingly less important for upcoming state of the art models.

astrange
0 replies
19h58m

They don't train as they go. Training is incredibly expensive.

They do take your feedback and presumably do something with it. Your actual queries are only indirectly useful since they might have private info in them.

highwaylights
7 replies
1d6h

I feel your pain and excitement in equal measure.

It can be hard to know where to start with some of these concepts, especially so given that a lot of recent developments (e.g. RAG) are developing so rapidly that there’s unlikely to be a reference book you could refer to anytime soon that would be current.

That said, I do find that documentation is getting better depending on where you look. The documentation for higher level tools like LlamaIndex is a good starting point for understanding the concepts (not so much in terms of explaining the concepts, but showing where they fit into the overall picture, then you can deep-dive elsewhere on the different parts).

YouTube has always been a mixed bag of very little solid information in a sea of non-experts trying to attract clicks for the latest trends, so it’s not a great starting point IMHO.

phkahler
3 replies
1d1h

> YouTube has always been a mixed bag

As an outsider but avid reader of this stuff linked from HN, I would recommend the channel 3blue1brown. He's got several NN and AI related videos, and the couple I've seen were pretty good.

https://www.youtube.com/watch?v=aircAruvnKk

renonce
2 replies
1d1h

Yeah but the other side of the coin is that they only explain the very basic concepts that are already settled for several years, not any of these "latest trends"

objektif
1 replies
1d

Which latest trends you mean?

renonce
0 replies
10h18m

Anything that is not settled for several years, like papers published last year or so. Like RingAttention, quantization/pruning, rotary embedding, distillation, RLHF, L2 regularization, multimodal, MoE etc.

objektif
0 replies
1d

Llamaindex docs are absolutely terrible IMO. I have gone through it so many times but still do not understand the terms and organization. Router for querying router query engine?

mercer
0 replies
23h42m

how would you rate Yannic Kilcher, if you're aware of him?

deskamess
0 replies
17h14m

The RAG explanation was pretty good. It had the right level of description and it explored the sub-topics and defined them.

berkes
2 replies
1d6h

Most of the time I do not know where the terms fit in the big picture.

Nor do the majority of "AI" experts and consultants that I see on LinkedIn, Twitter or in podcasts.

The S/N ratio is very low in this field. Just pick some documentation from "industry leaders" like Langchain and see that not only is it already and always outdated, it sometimes simply contradicts itself.

In the "blockchain hype" this was similar, so I guess it's a trait of the hype train.

pixl97
0 replies
1d5h

Nor do the majority of "AI" experts...

I mean yes, this is what a rapidly expanding field looks like that's probing the boundaries of its problem space. Kind of like following physics in the early to mid 1900s. Different classes of problems have barely been tested against each other, much less fully explored themselves.

In some ways it reminds me of the earlier days of the internet when progress was still very rapid.

highwaylights
0 replies
1d6h

Totally agree with the above, although I’m not sure that documentation on tools like Langchain is a reflection of the hype in the way social media is. I think in that case it’s just a reflection of the pace things are moving at.

saddabbas
1 replies
1d5h

I recommend checking out Machine Learning Q and AI by Sebastian Raschka

deskamess
0 replies
17h15m

Based on the TOC it seems pretty deep. At this point I am looking for something high level.

deskamess
0 replies
17h16m

As far as definitions go that was actually pretty good.

snorkel
0 replies
21h1m

Strongly recommend watching Andrej Karpathy’s “Lets build GPT-2” videos on YouTube which dives into an actual PyTorch implementation, then download the code and study it carefully. Then study “Spreadsheets is all you need” to see what the internal data structures look like.

throw310822
23 replies
1d9h

Apologies in advance for the super naive question; but assuming that we can create vectors that encode for the meaning of entire sentences, what prevents us from training llms to predict those vectors instead of single words?

faabian
9 replies
1d8h

Author here -- that's a very good point and as I understand work in progress in different teams. Training autoencoders for language is actually super easy given the small amount of information contained in text (compared to vision/video), the hard part is making the model focus on the semantic part if all signal we have comes from exact match in token space. Hence Yann LeCun's ideas on joint embedding predictive architectures. Note also that there is always a trade-off between auxiliary tasks giving more signal but shifting the focus. In our case, we noticed degradation if the number of predicted tokens is too high. So latent prediction methods need to sort out what is useful.

mike_hearn
8 replies
1d8h

Aren't the models already doing this, in a way? We know they can do things like write rhyming poems and song lyrics that do make perfect sense, so at some point the activations must be encoding some sort of overall plan for the upcoming sentences, even if maybe every word isn't predicted yet.

mjburgess
3 replies
1d7h

so at some point the activations must be encoding some sort of overall plan for the upcoming sentences

This isn't obviously the case, compare this "intelligent designer" view with evolution: there was no prior plan for rabbits. it's sufficient to create the appearance of design that sequential steps are simply probabilistically modulated by prior ones.

Consider a continuation of "the cat..." merely a distribution over all possible words suffices to create the illusion of a plan, suppose: "the cat sat..." then, "on.., the..." etc. follow from the training data.

I think there's a strong argument against trying to model entire sentences exactly because the system isn't modelling semantics: one should expect accuracy to drop off a cliff if there is no actual plan. ie., predicting "sat on the mat" from "cat" shouldnt be a valid prediction, because of the infinite number of possible continuations that as a whole is terrible (eg., what about "chased the mouse" etc.). The space of all possible sentences to continue from "the cat" is infinite, which much of that space actually useful; whereas the number of words is very small, very fininte, and many of them not useful.

The only reason that "the cat sat..", "the cat sat on..." is reasonable is because each sequential word can be modulated by the prompt to seem as if planned.

edmara
2 replies
1d1h

The modelling is advanced enough that you can't fundamentally distinguish it from (lossy, limited) planning in the way you're describing.

If the KQV doesn't encode information about likely future token sequences then a transformer empirically couldn't outperform Markov text generators.

mjburgess
1 replies
1d

No one is spending $10-50mil building a markov text model of everything ever digitised; if they did so, their performance would approach a basic LLM.

Though, more simply, you can just take any LLM and rephrase it as a markov model. All algorithms which model conditional probability are equivalent; you can even unpack a NN as a kNN model or a decision tree.

They all model 'planning' in the same way: P(C|A, B) is a 'plan' for C following A, B. There is no model of P("A B C" | "A B"). Literally, at inference time, no computation whatsoever is performed to anticipate any future prediction -- this follows both trivially form the mathematical formalism (which no one seems to want to understand); or you can also see this empirically: inference time is constant regardless of prompt/continuation.

The reason 'the cat sat...' is completed by 'on the mat' is that it's maximal that P(on|the cat sat...), P(the|the cat sat on...), P(mat|the cat sat on the...)

Why its maximal is not in the model at all, nor in the data. It's in the data generating process, ie., us. It is we who arranged text by these frequencies and we did so because the phrase is a popular one for academic demonstrations (and so on).

As ever, people attribute "to the data" or worse, "to the LLM" no properties it has.. rather it replays the data to us and we suppose the LLM must have the property that generates this data originally. Nope.

Why did the tape recorder say, "the cat sat on the mat"? What, on the tape or in the recorder made "mat" the right word? Surely, the tape must have planned the word...

edmara
0 replies
8h3m

Why it's maximal is not in the model at all, nor the data

It replays the data to us and we suppose the LLM must have the property that generates this data originally.

So to clarify, what you're saying is that under the hood, an LLM is essentially just performing a search for similar strings in its training data and regurgitating the most commonly found one?

Because that is demonstrably not what's happening. If this were 2019 and we were talking about GPT-2 it would be more understandable but SoTA LLMs can in-context learn and translate entire languages which aren't in their dataset.

Also RE inference time, when you give transformers more compute for an individual token, they perform better https://openreview.net/forum?id=ph04CRkPdC

flawsofar
1 replies
1d4h

In case you’re thinking that rhyming requires planning, that’s just as silly as a rabbit tanning.

You can make things up as you go, and the constraints emerge from the flow.

gbasin
0 replies
1d4h

great comment

faabian
1 replies
1d7h

Yes. Otherwise next-token models wouldn't be nearly as good as they are. But the question is how to train these capabilities most efficiently! We had some interesting findings on how with increasing model/dataset scale/data quality, capabilities can move from "only learnable with multi-token prediction" to "indifferent" and "multi-token prediction actually hurts". This depends on the capability itself, induction e.g. matures way earlier in this sense than code generation capabilities.

mike_hearn
0 replies
1d6h

Is it possible that anti-scaling effect occurs because you are removing some middle layers to free up space for the extra output heads? I only scanned the paper quickly but what happens if you treat the technique as strictly additive and don't keep parameter sizes fixed?

marcyb5st
4 replies
1d8h

Not a stupid question in my opinion.

The problem is that once you have the vectors representing the answers you need something like another model that goes back to a word representation of said answers. Something like a diffusion model but for text. Additionally, the function that this diffusion model will approximate won't be injective, but at best surjective and at worst not even a function (in the mathematical meaning) since many textual represantions are possible given an embedding, and most of those won't be valid (not grammatically valid, no sense sentences, ...).

Finally remember that the embeddings are a "lossy" representation of some datum and so the inverse function will lose a lot of the nuances/context/... .

LLMs avoid the problems above by predicting the next (now next n tokens) in a way that is self consistent with the query and the previous n tokens, so the function they approximate should be mostly surjective.

throw310822
1 replies
1d8h

The problem is that once you have the vectors representing the answers you need something like another model that goes back to a word representation of said answers. Something like a diffusion model but for text.

Could it be just a smaller llm that takes as input both the semantic vector and the prompt, and is trained to predict the output tokens based on those? A model with high linguistic abilities and very little reasoning skills.

marcyb5st
0 replies
1d7h

I think what you suggest would be very similar to a encoder-decoder architecture, which has been abandoned in favor of decoder-only architectures (https://cameronrwolfe.substack.com/p/decoder-only-transforme...). So I am guessing that what you suggest has already been tried and didn't work out, ut not sure why (the problems I mentioned above or something else).

Sorry, that's where the limit of my knowledge is. I work on ML stuff, but mostly on "traditional" deep learning and so I am not up to speed with the genAI field (also, the sheer amount of papers coming out makes it basically impossible stay up to date of you're not in the field).

HanClinto
1 replies
1d4h

Something like a diffusion model but for text.

This actually sounds amazingly useful.

soulofmischief
0 replies
1d1h

I've played around with the concept in the past but there are some issues yet to be solved to make them more practical than current generation LLMs.

People are working on it though: https://arxiv.org/pdf/2305.09515

yorwba
2 replies
1d8h

You still need to convert between words and sentence vectors somehow. You could try using a faster model for that, but I suspect that the output quality will suffer.

magicalhippo
1 replies
1d7h

LLMs somehow need to do this anyway, implicitly or not. Has anyone tried to do it more explicitly?

That is to break out the idea that characters are formed into words, and words into sentences, and a sentence is a sequence of "concepts" for the lack of a better description.

So have one NN which takes a sequence of tokens and predicts an moderately-dimensional "word vector", which is fed into another which predicts a high-dimensional "concept vector".

Then the "thinking layer" would map a sequence of "concept vectors" to "concept vectors", and then you'd have some layers which does the reverse of the input layers to output tokens which can be printed.

Thought being that by splitting it up like this you could swap out the decode and encode layers independently to translate, for example, and so on.

Just a shower thought.

flawsofar
0 replies
1d4h

A sentence is both a sequence and a tree of phrases. Phrases have a head and one or more valents; they’re relations.

If you wanted to create an embedding algorithm for phrases, you could and you could throw a transformer at it.

I don’t know how you get the output of higher levels to diffuse to phrase and word levels.

jerpint
1 replies
1d8h

My understanding is that tokenization is part of the bottleneck. When you break up a sentence to tokens, each token gets a vector representation. The dictionary of all tokens would be infinite if it was at the sentence level

faabian
0 replies
1d8h

Vectors can do what one-hot vectors cannot do -- no one said inputs need to be rows from an token_id -> vector embeddings map. Basically, we are doing this already by moving from one-hot vectors to n-tuples of one-hot vectors, increasing the effective vocabulary size from V to V^n.

wangii
0 replies
1d6h

the problem is then the total number of computation drops dramatically therefore leads to much less "thinking" power. i think the idea originated from an understanding that when we write/speak, we have an overall idea. my current hypothesis is it's probably an illusion.

you may want to search for "filler" papers to read.

everforward
0 replies
1d4h

Also a noob here, if we encoded, trained on and synthesized sentence vectors, wouldn’t that move the AIs ability to create novel things up from sentences to words?

I.e. we currently operate on words (roughly) so the AI can only use words it knows but can synthesize unique sentences from words. If the AI operates on sentences, wouldn’t it only be able to regurgitate sentences it has seen before? So it could synthesize novel paragraphs, but not sentences?

I’m not convinced that sentences are a useful abstraction for AI (in English, anyways). They’re barely useful to humans. Check out your average chat conversation, email, YouTube comment, etc. There’s a very good chance the sentences aren’t actually sentences, or that they haven’t even bothered to use punctuation.

I just don’t think sentences map to a semantic device. A sentence could be two words or half an English paper depending on the writer. It could traverse a half dozen ideas or a single one. Where a sentence ends generally is more about the writer than the semantics.

bjourne
0 replies
1d7h

That's hierarchical prediction. On one level you predict the style of the paragraph, on another the tone and form of the sentence, and on the third the next word. However, this form of prediction is quite difficult since predictions from the layers affect each other.

mg
21 replies
1d7h

Currently, LLMs start from scratch for each output token, right?

Lets say you ask an LLM

    What makes bananas yellow?
And it replies

    Bananas are yellow due to a pigment called bromelain.
I would think that the concept of "pigment" and "bromelain" are already somehow activated in the neural net when it outputs "a". Because now it can't change its mind anymore and follow up with "an optical illusion that makes humans perceive every bent object as yellow". So it seems to have already planned ahead to talk about the pigment called bromelain.

Would it be possible to capitalize on the work that has already been done when the LLM outputs "a"? Could the state of the neural net be somehow preserved for the next answer?

HarHarVeryFunny
5 replies
1d4h

1. Bananas are yellow due to a biochemical process known as carotenogenesis, which involves the synthesis and accumulation of carotenoid pigments.

2. Bananas are yellow due to a specific carotenoid called beta-cryptoxanthin, which gives the fruit its characteristic yellow hue.

3. Bananas are yellow due to a gradual increase in the concentration of carotenoid pigments as the fruit ripens and chlorophyll levels decrease.

4. Bananas are yellow due to a series of enzymatic reactions that convert starch into sugars and break down the green chloroplasts, revealing the underlying yellow carotenoids.

5. Bananas are yellow due to a change in the pH levels within the fruit cells during ripening, which triggers the production of yellow carotenoid pigments.

6. Bananas are yellow due to a genetic trait inherited from their wild ancestors, which enabled the development of carotenoid pigments as a way to attract seed dispersers.

7. Bananas are yellow due to a complex interplay between various plant hormones, such as ethylene and abscisic acid, which regulate the ripening process and pigment formation.

8. Bananas are yellow due to a metabolic shift from chlorophyll synthesis to carotenoid synthesis as the fruit reaches maturity.

9. Bananas are yellow due to a natural defense mechanism that involves the production of carotenoid pigments, which protect the fruit from oxidative stress during ripening.

10. Bananas are yellow due to a evolutionary adaptation that helps the fruit stand out against the green foliage, making it more visible to potential seed dispersers.

The output of an LLM is usually randomly sampled from the top few highest probability next token/word predictions, but the model itself has no idea which word the sampler will pick. It presumably has some conceptual plan of what could follow "a", or any of it's other suggestions, but any such plan (high level prediction) is then rethought from scratch once "a" is generated.

The model not only can, but has to, change it's mind after each word generated, so this "planning ahead" is very ephemeral - more like a freestyle rapper making it up on the fly than someone thinking deeply about how best to reply and how to express it.

scoot
2 replies
1d4h

10. Bananas are yellow due to a evolutionary adaptation

Did an LLM really make this basic grammatical error?

HarHarVeryFunny
1 replies
1d3h

Yes it did (free Claude Sonnet), but presumably only because it was trained on examples of us making the same mistake!

scoot
0 replies
1d3h

presumably only because it was trained on examples of us making the same mistake

That's what makes it surprising.

Workaccount2
1 replies
1d3h

There has to be more going on somewhere in the system. You can ask GPT4 to describe something (Vermont) and ask it to end to description with a chosen word (house). It will then usually be able to write out a descriptive paragraph that ultimately lands on your chosen word pretty seamlessly - "...embodying a serene and rural charm that culminates in the warm, welcoming feel of a cozy house."

Models do struggle with these tests, for sure, but from an analytical standpoint, a "next token predictor" should not be able to ever correctly land on the right token 100 tokens in the future.

Edit: Thinking about it, I suppose it is possible that the model can encode a "destination" in the first token. Like a pool shot that is artfully bounced off many bumpers to hit a ball, perhaps the LLM can encode a "path" to a destination token in the first token generated. Which might be even crazier as it suggests that the model is playing a meta-game with being able to precisely manipulate the individual layers of output, even though those layers are disparate from token to token.

HarHarVeryFunny
0 replies
1d2h

The "next token predictor" description is a bit too literal, and anyways incomplete. A transformer has a lot of layers (e.g. 96 for GPT-2) and it's only at the input that it has pure token embeddings (ignoring positional encoding). At the output it's a built-up embedding that's decodable into a token.

One way to think of what all the intermediate layers are doing is to consider them as levels of a linguistic parse tree with the leaves (words) at the bottom and trunk ("sentence") at the top, except in the transformer the evolving embeddings at each level contain semantic as well as syntactic information. This largely hierarchical view of language was the motivation for the transformer design.

It seems we should really think of each layer of the transformer as an independent predictor, with increasingly abstract and more semantically complete information available as we ascend the transformer layers towards the output. Predict next token is only what the transformer is being trained to do at the output layer. At the inner layers (i.e. the bulk of what the transformer is doing), it will be predicting at these higher levels of representation held at those layers.

These models do struggle (although getting better) at ending on a given word rather than starting on it, and understandably so since random sampling and continual resetting after each output token (= new input sequence) means that planning ahead at level of word specificity is simply not an option. They have to continually adapt to next sampled token, and take it from there.

I'm guessing that ability to end on a chosen word is due to continued salience of that word during generation, prediction of sentence fragments using/ending with that word, and opportunistic stopping when it has been emitted and the sentence is complete. Kind of the same way you might do it yourself if you just started talking immediately without planning, while trying to end on a given word.

faabian
4 replies
1d7h

To some degree, attention is already a mechanism to make computations from previous tokens useful later. (You can think of the KV cache as a representation of the text so far and all the models thoughts on it.) And since language models are trained on sequences end-to-end, I think this is likely to happen. Multi-token prediction encourages this behavior explicitly but only for the small n token window you define.

That said, there are many works attempting to increase the compute utilization of transformer language models (early exit, mixture of depths) and novel architectures (SSMs etc.).

jacobsimon
3 replies
1d6h

Thanks for highlighting the KV cache, I’ve been wondering the same thing and hadn’t come across that or didn’t remember.

edmara
2 replies
1d2h

Transformers are still stateless, KV cache is just a compute-saving measure (but otherwise correctly described)

jacobsimon
1 replies
1d

Oh huh. Why not make it stateful, like re-use and compute just the “diff” when you add a new token? Assuming it’s not that easy because each token can affect attention globally.

I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.

edmara
0 replies
7h37m

Because attention is all you need.

I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)

You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would

a) be far more compute intensive to train and run (especially train)

B)be susceptible to all of the issues that RNNs have.

C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it

berkes
4 replies
1d6h

I always presumed that "a pigment" is one token.

I'm a total amateur in this field though.

jasonjmcghee
1 replies
1d4h

what constitutes a token is super unintuitive.

my gut said "pig" and "ment" on this one, which happens to be right, but my gut would also say "para" and "graph" but no, "paragraph" is a single token which falls way outside the "normal" length I see of 3-4 characters

In either case, I do consistently see spaces between characters included as part of the token following the space.

" paragraph" (10 characters) is the longest token I've seen- and now I wonder what the longest token is

jasonjmcghee
0 replies
1d4h

New winner " communication" at 14

sanxiyn
0 replies
1d5h

No, it isn't, at least on OpenAI tokenizer.

ralusek
0 replies
1d5h

Tokens are not multiple words, but are actually usually parts of words. The rule of thumb OpenAI uses is that there will be 100 tokens for every 75 words.

If you want to just see the tokens for yourself, though, just enter some text here:

https://platform.openai.com/tokenizer

wewtyflakes
0 replies
1d2h

I wonder if this means that these types of models would perform better (or worse?) for languages that do not have this sort of forward-looking grammar.

pbh101
0 replies
1d7h

The alternative theory is that any word starting with a vowel sound is exceedingly uncommon after ’a’ in its training set, so it doesn’t need to plan ahead, just predict the distribution of the most likely next words and choose.

Which is my understanding of how they work and the dynamic at play.

nicklecompte
0 replies
1d5h

Maybe look at it another way: ask GPT to complete the following

  Bananas are yellow due to a

  Bananas are yellow due to an
In the first case it might respond

  Bananas are yellow due to a pigment called bromelain.
In the second case it might respond

  Bananas are yellow due to an organic compound called bromelain, which is a yellow pigment.
So in either case GPT could have picked "a" or "an" without any impact on the semantic meaning of its response. In the extreme case, you could see the LLM operating according to a dumb heuristic:

  The token following "due to" is "a" with 55% probability, "an" with 45% probability.
In reality it is of course more sophisticated than this. But this dumb heuristic would explain the behavior.

And if you didn't actually include any facts about bromelain in the pretraining data, LLMs absolutely could autocomplete this with something about "an optical illusion." GPT-3 made factual mistakes like that pretty routinely, but I recall it figured out the grammatical rules of "a" and "an."

I don't think the concept actually needs to be pre-activated as you said, though I agree with faabian that this "preactivation" probably does happen in some implicit/emergent sense.

avianlyric
0 replies
1d7h

The output of most LLMs is stochastic. The core LLM is given token, and outputs a set of ranked tokens, with a “confidence”, to go next. Then there’s normally a filtering and search stage, where those ranked token are either feed back into the LLM to get more ranked tokens and used to for a short probability tree. I.e. if we pick the top N-ranked tokens and put them back in, each of those tokens results in a new set of N-ranked tokens.

By looking at that tree some basic filtering is done. Such as picking the branch that has the highest summed confidence, or the branch that has the fewest repeated tokens, or the fewest tokens that match with input tokens, or more often some combination of the above, plus a random choice weighed by summed confidences.

That how you can give a LLM with complete fixed weights, which is all LLM, the same input multiple times, but get different outputs.

So to answer your specific question, it can “change its mind”. Every token produced creates a new opportunity for the stochastic output filters to pick a new path through all the possible outputs.

Xcelerate
19 replies
1d6h

Do LLMs not consider the probability distribution over all combinations of tokens up to a certain output length with regard to sequence prediction? I assumed they did that already.

If they don’t, I’m amazed they work as well as they do. Consider 2-bit sequence prediction with the following possible outcomes and associated probabilities:

    00: p=0.36
    01: p=0.04
    10: p=0.30
    11: p=0.30
So the most likely 2-bit sequence is 00. But on the basis of predicting the next token (bit) alone, we have:

     0: p=0.40
     1: p=0.60
which suggests that 1 is the next bit and leads to a suboptimal starting point for predicting the bit after that. The error is even more prominent with longer sequences as the joint probability distribution becomes more unfactorizable into marginal distributions (as I would expect any minimal algorithmic description of real-world data to be).

Edit: now that I think about this a bit more, a cool research project that would be really simple to carry out might be to modify the cross-entropy loss function to consider only the nth future token in the text training data, and then plot LLM performance vs n, assuming that for all current LLM models we just have n=1.

My hypothesis is that you can mostly bypass all of the resource blow-up involved in predicting the joint probability distribution over the next 1 through n tokens (which scales as x^n) by just predicting the nth token directly, since doing so would implicitly require a better data model (at least for human-generated text; this wouldn’t be the case for all types of data).

elcomet
7 replies
1d4h

I think you're not looking at this from the right perspective. A LLM is designed to sample text, that follows the training distribution. It is not designed to tell you the "most likely" text that follows, and we don't actually want that. This would mean you have no diversity in your outputs.

In your example, sampling a 0 in 40% of cases and a 1 in 60% of cases does make sense for chat applications.

For applications where we do care about the most likely sentence (e.g. question answering), then beam search helps, as others have mentioned.

Another thing to consider is that the model can "look ahead" and precompute what the future tokens might be. And it can then use this to predict the current token. In fact, some work have been investigating this, such as [1].

And a final note, predicting one token at a time is what we are doing as humans when we speak, so clearly it is not a wrong approach. We are doing this "look ahead" in our mind before speaking.

[1] https://arxiv.org/abs/2404.00859

tmoertel
4 replies
1d3h

And a final note, predicting one token at a time is what we are doing as humans when we speak...

I wouldn't be surprised if we could predict token groups. When speaking off the cuff, people often rely on well-worn phrases and cliches.

abakker
2 replies
1d2h

Indeed. An interesting reference to this is the work Millman Parry did to describe the key phrases in the Odyssey and the queues they gave to help someone memorize the poem.

Also, this is maybe a semantic point, but, I am not predicting any words I speak. Not in a statistical sense. I have intent behind my words, which means I have an abstraction of meaning that I want to convey and I assemble the correct words to do that. no part of that is "predictive"

thwarted
1 replies
1d2h

describe the key phrases in the Odyssey and the queues they gave to help someone memorize the poem.

Queues of words give cues to help memorize.

abakker
0 replies
1d2h

lol. Hurray for speech to text, I guess.

ivalm
0 replies
1d2h

I think that corresponds to well worn phrases being a “single token”

wantsanagent
0 replies
1d2h

"It is not designed to tell you the "most likely" text that follows,"

It is exactly designed to do that. A temperature of 0 this is what you are approximating. The crucial point though is that it is the most likely next word given the proceeding multi-token context, not just the previous token.

Xcelerate
0 replies
1d

It is not designed to tell you the "most likely" text that follows, and we don't actually want that. This would mean you have no diversity in your outputs.

No, we specifically do want "most likely" to follow; the goal is to approximate Solomonoff induction as well as possible. See this recent paper by Hutter's team: https://arxiv.org/pdf/2401.14953

Quote from the paper:

"LLMs pretrained on long-range coherent documents can learn new tasks from a few examples by inferring a shared latent concept. They can do so because in-context learning does implicit Bayesian inference (in line with our CTW experiments) and builds world representations and algorithms (necessary to perform SI [Solomonoff Induction]). In fact, one could argue that the impressive in-context generalization capabilities of LLMs is a sign of a rough approximation of Solomonoff induction."

In your example, sampling a 0 in 40% of cases and a 1 in 60% of cases does[n't] make sense for chat applications.

I didn't say anything about sampling. A sequence prediction model represents a mapping between an input sequence and a probability distribution over all possible output sequences up to a certain length.

My example uses a binary alphabet, but LLMs use an alphabet of tokens. Any chat application that expresses its output as a string of concatenated symbols from a given alphabet has a probability distribution defined over all possible output sequences. I'm simply comparing the fundamental limitations of any approach to inference that restricts its outcome space to sequences consisting of one symbol (and then layers on a meta-model to generate longer sequences by repeatedly calling the core inference capability) vs an approach that performs inference over an outcome space consisting of sequences longer than one symbol.

puttycat
2 replies
1d5h

You're mixing training loss (cross-entropy/surprisal of next token) and post-training prediction decoding (done e.g. with beam search)

Xcelerate
1 replies
1d5h

Training loss considers only the next single token, right? (I’m not up-to-date on the SOTA.)

I thought post-training prediction still only directly predicts the next token and beam search is sort of a meta-model applied over that (i.e., it is a model on top of the output of the model that performs next-token prediction—beam search considers at each iteration a subset of the current next-token predictions ranked by their probability to use as multiple starting points for predicting the next token, while keeping track of the joint probabilities to prune the set of candidate sequences at each step).

Seems like beam search would fail drastically in cases where the true (unknown) probability distribution over all sequences of tokens of length n has very low conditional probabilities for the first few tokens, each given the computed joint probability of the prior predicted tokens. That is, the true values of p(t2|t1), p(t3|t2,t1), p(t4|t3,t2,t1), ... as derived from the unknown p(t1,t2,...,tn) are very small, but very high when computed via a next-token prediction model.

I’m suggesting to modify both. Use cross-entropy of the nth token for training loss. Use cross-entropy of nth token for post-training prediction and then work backward from there to the beginning of your sequence prediction.

namibj
0 replies
1d4h

The problem is that a position's probability output is conditioned via attention on all previous positions.

If you want to be better you need to switch to DDPMs for example (e.g. an encoder-only transformer to predict diffusion transition probabilities in parallel, then apply steps of denoising).

The problem is just that these don't work so well from auto regressive decoder transformers, and encoder-decoder architectures like e.g. Google's T5 have fallen out of favor since about LLAMA dropped.

hiddencost
1 replies
1d4h

It's called the Markov assumption. It was basically the single most important piece of mathematics in the field for decades. It allowed us to solve otherwise intractable problems given the limited compute budgets of the time.

Xcelerate
0 replies
1d4h

Sure, and it's probably the wrong assumption to make in this case if our eventual goal is to capture general reasoning ability via LLMs.

sebzim4500
0 replies
1d5h

This is how they work and it's a real problem when doing prediction with low temperatures.

IIRC you see weird patterns in LLM outputs since "an" is often less likely than "a" so you end up with fewer nouns beginning with vowels than you would expect.

faabian
0 replies
1d5h

Language models factor the joint probability p(y, x) as p(y, x) = p(y|x) p(x) which is exact. I.e. if you train a language model on your distribution and sample with temperature 1, you will get the exact same distribution out. If you sample at lower temperature or even greedily, evidently, you will get other distributions.

cgearhart
0 replies
1d3h

What you’ve described is basically the problem with greedy sampling in the decoder. Many other local optimization sampling strategies exist (e.g., beam search) and there’s been a lot of work on more global sampling (e.g., speculative decoding).

HanClinto
0 replies
1d4h

This is a fascinating point.

If I'm reading you right, you're saying that a simple way to do this would be to calculate logits for not just the next token, but also n+1 -- all at the same time. If one of the n+1 logits is chosen, then do an infill on the skipped token for the next step, then resume.

This could get us around the example that you gave for only a linear increase in the vocabulary size -- so looking an extra token ahead only increases vocab size by a factor of 2, and looking at a third token is a total factor of 3.

This seems really promising!

BoiledCabbage
0 replies
1d3h

0: p=0.40 1: p=0.60 which suggests that 1 is the next bit and leads to a suboptimal starting point for predicting the bit after that. The error is even more prominent with longer sequences as the joint probability distribution becomes more unfactorizable into marginal distributions (as I would expect any minimal algorithmic description of real-world data to be).

Can someone explain this part a bit more? I'm not seeing the issue. From what I see, if the first token (t1) output is a zero, then the next token (t2) would have probabilities 0:p=.90 and 1:p=.10. (And t2 0/1:p= .50/.50 if t1=1)

Mathematically, those line up with the initial distribution, so what's the concern? That's how conditional probability works.

albertzeyer
6 replies
1d8h

For those who know speculative decoding: This is basically self-speculative decoding. It still auto-regressively feeds the predicted label sequence through the network again, and only keeps the prediction up to the point where it matches. So it will not get worse in performance but only faster (here up to 3 times, which is normal for speculative decoding).

Due to the multi-task training, it will however also get better. (This idea is already quite old, to predict multiple targets into the future as an auxiliary loss.)

Nice work.

techbruv
2 replies
1d4h

So it will not get worse in performance but only faster

A bit confused by this statement. Speculative decoding does not decrease the performance of the model in terms of "accuracy" or "quality" of output. Mathematically, the altered distribution being sampled from is identical to the original distribution if you had just used regular autoregressive decoding. The only reason you get variability between autoregressive vs speculative is simply due to randomness.

Unless you meant performance as in "speed", in which case it's possible that speculative decoding could degrade speed (but on most inputs, and with a good selection of the draft model, this shouldn't be the case).

jasonjmcghee
0 replies
1d4h

I think parent is saying the same thing as you. Pointing out to folks unfamiliar, speculative decoding doesn't trade quality for speed.

albertzeyer
0 replies
1d4h

Yes that's what I mean, speculative decoding does not decrease the performance in terms of quality. I guess my wording was confusing on this.

imtringued
2 replies
1d5h

The problem with speculative decoding is that there are hardly any models that support it and adding support takes extra GPU time. If speculative decoding also improves planning performance, then it will be more readily adopted.

albertzeyer
1 replies
1d3h

What do you mean? Speculative decoding can be done with any auto-regressive model. Normally you use another much faster model to predict the next N subwords, and then you use the big model to verify whether it gets the same output, or maybe just reranked. Evaluating N subwords in one go is much faster compared to doing it subword by subword. That's why this is faster. Not all N words might match, so then you might need to redo the prediction for M < N subwords, but there are many simple cases where a faster and weaker model is still accurate enough. In the very extreme case, where N-1 subwords are always wrongly predicted, it would be slightly slower, but usually you get quite a big speedup, e.g. 3x faster or so.

The nice thing here is that you actually don't need another smaller model but the model itself already predicts the next N subwords.

Or maybe you mean it's not implemented in some of the common software? I'm not sure about that, but I thought it's a quite popular feature now.

Havoc
4 replies
1d9h

How does that still end up making grammatical sense?

If token/word +1 and +2 are predicted independently then surely often it won’t ?

wongarsu
2 replies
1d8h

They just throw the predictions for +1 and +2 away, and only generate them for more efficient training.

The abstract doesn't make that clear, but from the description of figure 1: "During inference, we employ only the next-token output head. Optionally, the other three heads may be used to speed-up inference time"

Maybe you can use all three heads if you take the top prediction from all of them, but that prevents you from doing any of the common sampling strategies. I'm not sure how many people actually run an LLM with temperature 0 outside of benchmarks, unless they do something even better than applying a temperature

faabian
0 replies
1d8h

Exactly, but there is also a rejection sampling based method for speculative sampling: https://arxiv.org/abs/2302.01318

Havoc
0 replies
1d7h

Thanks for explaining

vletal
0 replies
1d9h

The n+1-th token is discarded if it is unlikely given the n-th token.

nicklecompte
2 replies
1d6h

I haven't read the paper in full detail yet, but I do have a minor editorial comment: while the appendix L.2 was satisfactory, I thought the condensed argument in 5.2 was a bit too sloppy. In particular,

  H(X) + H(Y) = H(X | Y) + 2I(X ; Y) + H(Y | X)

  By discarding H(Y | X) - which appears again when predicting at the following position - we observe that 2-token prediction increases the importance of I(X ; Y) by a factor of 2.
The argument about "discarding" was not clear to me - if you're predicting the third token Z, then shouldn't H(Y | X) be contained in the implicit context C, and therefore can't be freely discarded? I don't think this argument was clarified in the appendix. But this is mostly about presentation, I wasn't so confused as to doubt the gist of the argument.

faabian
1 replies
1d6h

Thanks for the feedback! Let me try to state it better:

In the end, we only use the next-token head for generating. So which parts of the 2-token target H(X) + H(Y) are "auxiliary" in the sense that they help learning and which are "wasted"? H(X | Y) and I(X; Y) are useful for next-token generation while, by definition, H(Y | X) is the information quantity not related to the next token X. So we could say: "multi-token prediction trades the useful information I(X; Y) from H(Y) for the wasted computations on H(Y | X)". However, note that H(Y | X) is a next-token entropy for predicting Y from the prefix (C, X). If the attention mechanism allows to transfer computations already made for predicting Y|X to the next step, these computations may actually not have been wasted -- it was just pre-computations.

stealthcat
0 replies
1d5h

Did you have some small toy experiments to prove this?

bradley13
2 replies
1d8h

I read an article that pointed out that LLMs literally have a one-dimensional window onto the world. Everything is just a sequence of tokens.

Maybe this sort of multi more fiction takes their view into 1.1 dimensions? In any gas, there us s real argument for expanding that window, somehow, into two or more dimensions.

mike_hearn
1 replies
1d8h

Well, it feels like architecturally there's a lot of scope to do better for coding tasks specifically. Like, if you had FAIR level resources and wanted to train a really great Java coding model for example it would make sense to train the model to predict ASTs rather than tokens. You'd still need some kind of joint normal LLM for predicting comments, identifier names and so on, but you wouldn't model the program itself as a stream of tokens. Instead it would predict things like "add an if block", "add a method call block with 4 parameters" and so on.

You could also train the model to expect certain context window positions to be reserved for things like "type members at the current cursor" and then integrate the inferencing loop with IDE/LSP-style static analysis. This would allow the model to see more information than is actually contained in the text.

I think the reason we're not seeing models like this right now is the cost of doing such research combined with the fact that AI people are all Python-heads, and Python doesn't benefit from much IDEs.

bradley13
0 replies
1d7h

That sounds right. My vague idea of a "second dimension" could well be some sort of structure - be it ASTs for programming languages or for natural language.

Another possibility would be some sort of fixed knowledge base, which could be program language documentation or "common sense" like CYC wants to provide.

ralusek
1 replies
1d5h

Given that LLMs appear to, in large part, "think" by virtue of feeding its own input into itself, people have consistently noticed that insisting that the model "think out loud" results in higher quality reasoning. i.e. "chain of thought" reasoning will contrast simply having the model answer a question directly with first having it write out things like:

- restating what it thinks is being asked of it

- expressing a high level strategy over what sort of information it might need in order to answer that question

- stating the information it knows

- describing how that information might inform its initial reasoning

etc...

I'd be concerned that going about this by having the model predict the next multiple tokens at any given time would essentially have the opposite effect.

Chain of thought prompting appears to indicate that a model is "smarter" when it has n + m tokens than when it just has n tokens as input. As such, getting the next 5 tokens for a given n might net worse results than getting the next 1 token at n, then the next 1 token at n + 1, and so on.

imtringued
0 replies
1d5h

If the LLM had an affordable model it would always generate enough tokens for the task at hand. The fact that this particular method would require more tokens would be irrelevant. If you don't have an affordable model, then you would always be at the mercy of the LLM being biased towards answering with an estimate instead of the actual answer.

Also, most speculative decoding strategies produce identical output compared to running the model sequentially. If the prediction is wrong, the token gets discarded and the speedup is lost.

bravura
1 replies
1d4h

I wonder if, instead of just predicting the next n tokens, it could also predict like 128, 512, 2048 etc tokens ahead. Thus learning long-term discourse structure.

HanClinto
0 replies
1d4h

Might be good to have some flexibility in where those particular tokens are placed, but yeah -- I could see value in creating a "pool" of tokens that should be used at some point in the future in the answer.

riku_iki
0 replies
1d2h

Its interesting that they got good results on 200B and 0.8 epoch training set, but once scaled it to 1T and 4 epoch, got degradation in vast majority of benchmarks (Table 1).

lucidrains
0 replies
1d4h

wow, so prophet net does work! i spent so much time experimenting with it back in the day, but just lacked the scale to see a positive result.

jmount
0 replies
1d1h

After inventing multi-token, one then invents a useful language oriented hierarchy (such as sections, paragraphs, sentences, and words).

hhcoder
0 replies
1d7h

I am curious what happens if the multiple tokens predicted interfere with one another. Say I ask "What are the colors of the rainbow?", if one of the tokens is a repeated color, how do we resolve that?

bjornsing
0 replies
1d7h

I’ve been thinking about this, but I’m leaning more towards letting the LLM output a small PixelCNN or similar model over the next N tokens. That way the LLM can describe conditional probabilities over the coming tokens.

WhitneyLand
0 replies
14h17m

The use of the word “head” in machine learning does not seem consistent, in case anyone else is confused by that.

There’s multihead attention and multiple output heads as a concept in the paper.

Multihead attention is about focusing on different areas of the input in transformer architectures, and the biological analogy here is head as a central processing unit.

An output head refers to the final layer of a neural network, of which you could have more than one producing different outputs based on the same previous layers. This is also a loose biological analogy, but instead of head as cpu, think more along the lines of head being on one end of the body.

In neither case is there any analogy to a tape head that reads data.