A simplified explanation, which I think I heard from Karpathy, is that transformer models only do computation when they generate (decode) a token. So generating more tokens (using CoT) gives the model more time to “think”.
Obviously this doesn’t capture all the nuance.
I have another explanation. LLMs are essentially trained on "A B", i.e. is it plausible that B follows A.
There's simply a much larger space of possibilities for shorter completions, A B1, A B2, etc. that are plausible. Like if I ask you to give a short reply to a nuanced question, you could reply with a thoughtful answer, a plausible superficially correct sounding answer, convincing BS, etc.
Whereas if you force someone to explain their reasoning, the space of plausible completions reduces. If you start with convincing BS and work through it honestly, you will conclude that you should reverse. (This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.)
This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.
I was going to write pretty much this exact same comment. I am an amateur in how LLMs work, definitely, but I always thought this was the plausible explanation.
If I want the "assistant "LLM to tell me "How much 5 times 2 is", if I feed it the line "5 * 2 = " as if it's already started giving that answer, it will very likely write 5*2 = 10.
Since LLMs operate on semantic relationships between tokens, the more a bunch of tokens are "close" to a given "semantic topic", the more the LLM will keep outputting tokens in that topic. It's the reason why if you ask an LLM to "review and grade poetry", eventually it starts saying the same thing even about rather different poems -- the output is so filled with the same words, that it just keeps repeating them.
Another example:
If I ask the LLM to solve me a riddle, just by itself, the LLM may get it wrong. If, however, I start the answer, unravelling a tiny bit of the problem it will very likely give the right answer, as if it's been "guided" onto the right "problem space".
By getting LLMs to "say" how they are going to solve things and checking for errors, each words basically tugs onto the next one, honing in on the correct solution.
In other words:
If an LLM has to answer a question -- any question --, but right after we ask the question we "populate" its answer with some text, what text is more likely to make the LLM answer incorrectly?
- Gibberish nonsense
- Something logical and related to the problem?
Evidently, the more gibberish we give to it, the more likely it is to get it wrong, since we're moving away from the "island of relevant semantic meaning", so to speak. So if we just get the LLM to feed itself more relevant tokens, it automatically guides itself to a better answer. It's kind of like there's an "objective, ideal" sequence of tokens, and it can work as an attractor. The more the LLM outputs words, the more it gets attracted to that sequence...that...."island of relevant semantic meaning".
But, again, I know nothing of this. This is just how I view it, conceptually. It's probably very wrong.
That reminds me ... You know how LLMs have a hard time being corrected? If I ask it not to format responses as bullet lists, after 1-2 rounds it does it again. Why? Because the context is filled with examples where it has used bullet lists, and it acts like an attractor.
I ask it not to start phrases with "However..." and it does it again. Maybe just having the word However in the prompt acts like an attractor that compels the LLM to use it, even when I actually asked the opposite. Probably also the fault of heavy handed RLHF telling it to balance any user position with the opposite take.
This is one of many ways of LLMs are being crippled by terrible UI controls. You can't do simple things like edit the conversation history to make it forget things.
if you haven't already, I recommend trying the openai playground instead of chatgpt. It is the same underlying ai (i.e. gpt4), but you have much more control over the inputs.
Bonus 1: Since you pay per token, it's much cheaper than a chatgpt abo
Bonus 2: You can increase the context window dramatically (iirc 8000 being the max for playground, while 2000 is the max for chatgpt)
You can edit the conversation history though. You need to try alternative apps/UIs instead of the product websites like ChatGPT. Those are only for collecting more training data from users instead of being the most useful interface possible.
Using a 3rd party interface to the LLMs (like typingmind.com) is both better and cheaper than using chatgpt.
Facebook had a paper about "system 2" LLM attention, where they identified which parts of the input would be distracting for the LLM and just deleted them.
https://arxiv.org/abs/2311.11829
It helps, but it still gets stuck in local optima based on what it started with. I've never seen it turn around and correct its faulty reasoning unless it tried to actually run the code and observed an Exception. If I respond with "but have you considered XYZ?", my leading question will usually cause it to correct itself, even when it wasn't incorrect.
We need some way to generate multiple independent thoughts in parallel. Each separate thought is constructed using chain of thought to improve the reliability. Then you have some way to "reduce" these multiple thoughts into a single solution. The analogy would be a human brainstorming session where we try to attack the same problem from multiple angles and we try to decorrelate each idea/approach.
We already have that, it's called beam decoding, and there are three of thought solutions as well, for each beam you can pick the one with the best logprob, but it's not a given that the result will be better because logprob only capture the model decisiveness not correctness, so it'll still fail if a model is confidently wrong.
I think this is different, because you could include tool use in the branches. E.g.
1. rewrite the following question in five different ways.
2. For each version of the question, write python code to do the work.
3. Look at all the outputs, write an answer
And this is why taking your time to write a detailed software help request delivers a good chance that you will solve your problem all by your lonesome.
A rubber duck is all you need.
Yes, my fear of stack overflow moderators has caused me to solve many problems before I even finish writing the question.
I think you're right. I would go a step further and say that all learning is roughly synonymous with reducing the output space, and that humans do the exact same thing. There are more ways to get the wrong answer to a math problem than there are to get the right answer. When you learn someone's name, you're narrowing your output to be a single name rather than all plausible names.
The output of a generative model is practically infinite. I suspect it's possible to continually narrow the space of completions and never converge on a single output. If this turns out to be true, it would bode well for the scalability of few-shot learning.
Actually, one of the best ways is pretending to be more extreme than them. Agree with them on everything, which is disarming, but then take it a step or two even further. Then they're like, "now hang on, what about X and Y" trying to convince you to be more reasonable, and pretty soon they start seeing the holes and backtrack to a more reasonable position.
https://www.pnas.org/doi/abs/10.1073/pnas.1407055111
This begs the question: why is it that giving them more time to "think" yields better answers, and is there any limit to that? If I make them write hundreds of pages of explanation, there must be a diminishing returns of some kind. What influences the optimal amount of thinking?
My guess is that good answers are more well reasoned than answers that are short and to the point, and this is picked up in training or fine-tuning or some other step.
And probably the optimal amount of thinking has something to do with the training set or the size of the network (wild guesses).
I think it's fairly simple: you're creating space for intermediary tokens to be generated, where those intermediary tokens represent "thoughts" or a simulated internal dialog.
Without that, it's analogous to asking someone a question and they immediately start responding from some information they'd heard before, rather than taking some time to have an inner dialog with themself.
There's a recent paper which seeks to explicitly perform time-to-think using pause tokens[1].
There are obviously pros and cons to each, but nothing excludes us from combining the two either.
1. Think before you speak: Training Language Models With Pause Tokens https://arxiv.org/abs/2310.02226v2
Look at it from an algorithmic perspective. In computer science many algorithms take a non-constant number of steps to execute. However, in transformers models, there are a limited number of decoder blocks, and a limited number of FFN layers in each block. This presents a theoretical upper bound on the complexity of the algorithms a decoder network can solve in a single token generation pass.
This explains why GPT4 cannot accurately perform large number multiplication and decimal exponentiation. [0]
This example can extend to general natural language generation. While some answers can be immediately retrieved or generated by a "cache" / algorithm which exists in latent space, some tokens have better quality when their latent-space algorithm is executed in multiple steps.
[0] https://www.semanticscholar.org/reader/817e52b815560f95171d8...
This paper suggests that a large language model should "think ahead" by predicting not only the next token but also a "supporting thought." The approach involves generating all tokens simultaneously, allowing for a single forward pass that produces both the next token and a supporting thought, which might consist of, for example, 16 tokens.
This supporting thought influences the model's prediction. The process is then extended to multiple supporting thoughts by ingeniously masking cross-attention between thoughts to ensure their independence. So in essence we can fill all the remaining context with supporting thoughts and benefit from all of them in the same single forward pass.
The supporting thoughts themselves are trained with the objective to maximize the probability of a longer sequence ahead, using RL. So they are trained to optimize for longer-term, instead of the myopic next token prediction task.
https://arxiv.org/abs/2403.09629
The autoregressive transformer architecture has a constant cost per token, no matter how hard the task is. You can ask the most complicated reasoning question, and it takes the same amount of computation to generate the next token compared to the simplest yes / no question. This is due to architectural constraints. Letting the LLM generate "scratch" data to compute (attend to relevant information) is a way of circumventing the constant cost limitation. The harder the task, the more "scratch" you need so more relevant context is available for future tokens.
That's flatly wrong. Each successive token costs progressively more. The deeper a token is in the sequence, the more past states it has to attend to. As a proof, just remember how slow it gets when the context is large, and how snappy when you first start a chat.
You're both kinda right. The type of computation that happens for that attention step that you refer to is parallel. I would say the thing that is "constant" is the computation graph depth (the number of sequential computations) which is actually important in computing certain functions.
https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/
Flash attention, which is widely used, is no longer parallel. The attention matrix is solved batch by batch.
The way I worded it, it might seem wrong - and I agree with you. When I said "constant" I meant without any optimizations to speed up shorter contexts, so with full designed context, architecturally, it is constant. You can pad shorter active contexts with zeroes and avoid attending to empty spaces as an optimization, but that is just an optimization, not an architectural property. If you want "more computation" you fill the context with relevant data (chain of thought, or n-shot stuff), which is the "trick" Karpathy alluded to (it provides more context to attend to), and I agree with that analysis.
The tokens are also necessary to store information, or at least off-load it from neuron activations.
E.g. if you asked an LLM "think about X and then do Y", if the "think X" part is silent, the LLM has a high chance of:
a) just not doing that, or
b) thinking about it but then forgetting, because the capacity of 'RAM' or neuron activations is unknown but probably less than a few tokens.
Actually, has anyone tried to measure how much non-context data (i.e. new data generated from context data) a LLM can keep "in memory" without writing it down?
I don’t think commonly used LLM architectures have internal state that carries over between inference steps, so shouldn’t that be none? Unless you mean the previously generated tokens up to the context limit which is well defined.
Correct, there's no internal state, but CoT techniques simulate this by providing a space for the model to generate tokens which represent intermediary thoughts.
Sorry, I meant the information that is inferred (from scratch on every token) from the entire context, and is then reduced to that single token. Every time a token is generated, the LLM looks at the entire context, does some processing (and critically, this step generates new data that is inferred from the context) and then the result of all that processing is reduced to a single token.
My conjecture is that the LLM "knows" some things that it does not put into words. I don't know what it is, but it seems wasteful to drop the entire state on every token. I even suspect that there is something like a "single logic step" of some conclusions from the context. Though I may be committing the fallacy of thinking in symbolic terms of something that is ultimately statistical.
Do LLM not also think when they encode the prompt? If Karpathy's explanation is accurate, longer prompts should also help even if they don't contain additional information, just by virtue of giving more time to think.
The time processing the longer prompt isn't being spent churning (i.e. "thinking") on the problem at hand, it's spend calculating attention matrices between all the tokens. The time spent on this is a function of the number of flops you have available.
So no, if you just fill up your context window to garbage, the LLM will not perform better at your task/question.
Does every token requires a full model computation?
No, you can cache some of the work you did when processing the previous tokens. This is one of the key optimization ideas designed into the architecture.
So my experience creating products on GPT3.5-Turbo is that there is an upper limit to how much instructional complexity the model can handle at a time. It isn't really about "adding computation", though you are doing this. The key is to construct the process so that the model only has to focus on a limited scope to make the decision on.
In effect you are kind of creating a tree structure of decisions that build off of each other. By generating intermediate tokens the model can now only pay attention to the smaller set of already collapsed decisions. It is a little more complicated than that as the model will create anticipatory behavior where intermediate steps get biased by an incorrect result that the model anticipates.
Also I should say it isn't just instructional complexity, it is ambiguity which creates the upper limit on capability.
This is true. You can get a similar effect by asking the model to plan its path first without writing any code, then asking it to review its plan for deficiencies, and finally asking it to enact the plan and write the code.
Do you think there is a fundamental difference between masked language modelling vs causal language modelling? I feel like most LLMs are decoder only models just cause they are easier to train because their attention mask is fixed
One of the things I’ve been doing with the models I’ve been using with coding is adding the stack and primary dependencies in the system prompt and then asking or conversing. It has helped out a lot, or at least feels like it has.
That's what I thought at first, but that actually doesn't make sense, the amount of work done on a string is the same even if the string is followed by padding due to the mask used in attention. Then I realised that an LLM's working memory is limited to its activations, which can be limiting. But it can extend its working memory by writing partial results to the output and reading it in. E.g. if you tell it to "think of a number" without telling you what it is it can't do that, there is nowhere to store that number, it has no temporary storage other than the tape. But if you ask it to "think step by step" you let it store intermediate results (thoughts) on the tape, giving it extra storage it can use for thinking.