I've just started this video, but already have a question if anyone's familiar with GPT workings - I thought that these models chose the next word based on what's most likely. But if they choose based on "one of the likely" words, could (in general) that not lead to a situation where the list of predictions for the next word are much less likely? Running possibilities of "two words together", then, would be more beneficial if computationally possible (and so on for 3, 4 and n words). Does this exist?
(I realize that choosing the most likely word wouldn't necessarily solve the issue, but choosing the most likely phrase possibly might.)
Edit, post seeing the video and comments: it's beam search, along with temperature to control these things.
The temperature setting is used to select how rare of a next token is possible. If set to 0 the. The top of the likely list is chosen, if set greater than 0 then some lower probability tokens may be chosen.
Can this be potentially dangerous -- e.g. if a user types "The answer to the expression 2 + 2 is", isn't there a chance it chooses an output beyond the most likely one?
Yes, although it's also possible that the most likely token is incorrect and perhaps the next 4 most likely tokens would lead to a correct answer.
For example if you ask a model what is 0^0, the highest probability output may be "1", which is incorrect. The next most probable outputs may be words like "although", "because", "Due to", "unfortunately", etc. as the model prepares to explain to the user that the value of the expression is undefined; because there are many more ways to express and explain the undefined answer than there are to express a naively incorrect answer, the correct answer is split across more tokens so that even if eg the softmax value of "1" is 0.1 and across "although"+"because"+"due to"+"unfortunately">0.3, at temperature of 0, "1" gets chosen. At slightly higher temperatures, sampling across all outputs would increase the probability of a correct answer.
So it's true that increasing the temperature increases the probability that the model outputs tokens other than the single-most-likely token, but that might be what you want. Temperature purely controls the distribution of tokens, not "answers".
Not sure if you were making a joke, but 0^0 is often defined as 1.
https://en.wikipedia.org/wiki/Zero_to_the_power_of_zero
I honestly had forgot that, if I ever knew it. But I think the point stands that in many contexts you'd rather have the nuances of this kind of thing explained to you - able to represented by many different sequences of tokens, each individually being low probability - instead of simply taking the single-highest probability token "1".
I'd rather it recognize it should enter a calculator mode to evaluate the expression, and then can give context with the normal GPT behavior
perhaps a hallucination
You really couldn't come up with an actual example of something that would be dangerous? I'd appreciate that, because I'm not seeing reason to believe that an "output beyond the most likely one" output would end up ever being dangerous, as in, harming someone or putting someone's life at risk.
Thanks.
That depends on how many people are putting blind faith in terrible AI. If it's your doctor or your parole board, AI making a mistake could be horrible for you.
Unless you screw something, a different next token does not mean wrong answer. Examples:
(80% of the time) The answer to the expression 2 + 2 is 4
(15% of the time) The answer to the expression 2 + 2 is Four
(5% of the time) The answer to the expression 2 + 2 is certainly
(95% of the time) The answer to the expression 2 + 2 is certainly Four
This is how you can asp ChatGPT the same question few times and it can give you different words each time, and still be correct.
That assumes that the model is assigning vanishingly small weights to truly incorrect answers, which doesn't necessarily hold up in practice. So I think "Unless you screw something" is doing a lot of work there
I think a more correct explanation would be that increasing temperature doesn't necessarily increase the probability of a truly incorrect answer proportionately to the temperature increase (because the same correct answer could be represented by many different sequences of tokens), but if the model assigns a non-zero value to any incorrect output after applying softmax (which it most likely does), increasing the temperature does increase the probability of that incorrect output being returned.
This is where the semi-ambiguity of the human languages helps a lot with.
There are multiple ways to answer with "4" that are acceptable, meaning that it just needs to be close enough to the desired outcome to work. This means that there isn't a single point that needs to be precisely aimed at, but a broader plot of space that's relatively easier to hit.
The hefty tolerances, redundancies, & general lossiness of the human language act as a metaphorical gravity well to drag LLMs to the most probable answer.
That’s why we use top p and top k! They limit the probability space to a certain % or number of tokens ordered by likelihood
Yes, but the chance is quite small if the gap between "4" and any other token is quite large.
Can you explain how it chooses one of the lower-probability tokens? Is it just random?
Reducing temperature reduces the impact of differences between raw output values giving a higher probability to pick other tokens.
Oops backwards. Increasing temperature...
Thanks, learnt something new today!
It is the part of softmax layer, but not all the time.
Something like this does exist, production systems rarely use greedy search but have more holistic search algorithms.
An example is Beam Search:https://www.width.ai/post/what-is-beam-search
Essentially we keep a window of probabilities of predicted tokens to improve the final quality of output.
Thanks, that's exactly what I was looking for! Any idea if it's possible to use beam search on local models like mistral? It sounds like the choice of beam search vs say top-p or top-k should be in the software and not embedded, right?
This is actually a great question for which I found an interesting attempt: https://andys.page/posts/llm_sampling_strategies/
(No affiliation)
If you use HuggingFace models, then a few simpler decoding algorithms are already implemented for `generate` method of all supported models.
Here is a blog post that describes it: https://huggingface.co/blog/how-to-generate.
I will warn you though that beam search is typically what you do NOT want. Beam search approximately optimizes for the "highest likely sequence at the token level." This is rarely what you need in practice with open-ended generations (e.g. a question-answering chat bot). In practice, you need "highest likely semantic sequence," which is much harder problem.
Of course, various approximations for semantic alignment are currently in the literature, but still a wide open problem.
I have no idea why you say this. Most of our pipelines will run greedy, for reproducibility.
Maybe we turn the temp up if we are returning conversational text back to a user.
There's a whole bunch of different normalization and sampling techniques that you can perform that can alter the quality or expressiveness of the model, e.g. https://docs.sillytavern.app/usage/common-settings/#sampler-...
There’s some fancier stuff too like techniques that take into account where recent tokens were drawn from in the distribution and update either the top_p or the temperature so that sequences of tokens have a minimum unlikeliness. Beam search is less common with really large models because the computation is really expensive.
In practice, beam search doesn't seem to work well for generative models.
Temperature and top_k (two very similar parameters) were both introduced to account for the fact that human text is unpredictable stochastically for each sentence someone might say as such - as shown in this 2021 similar graph/reproduction of an older graph from the 2018/2019 HF documentation: https://lilianweng.github.io/posts/2021-01-02-controllable-t...
It could be that beam search with much longer length does turn out to be better or some merging of the techniques works well, but I don't think so. The query-key-value part of transformers is focused on a single total in many ways - in relation to the overall context. The architecture is not meant for longer forms as such - there is no default "two token" system. And with 50k-100k tokens in most GPT models, you would be looking at 50k*50k = A great deal more parameters and then issues with sparsity of data.
Just everything about GPT models (e.g. learned positional encodings/embeddings depending on the model iteration) is so focused on bringing the richness of a single token or single token index that the architecture is not designed for beam search like this one could say. Without considering the training complications.
Yes, this is a fundamental weakness with LLMs. Unfortunately this is likely unsolvable because the search space is exponential. Techniques like beam search help, but can only introduce a constant scaling factor.
That said, LLM reach their current performance despite this limitation.