Mildly surprised to see no mention of my top 2 LLM fails:
1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.
For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.
Sample many times and vote is a highly effective (but slow) strategy.
There is almost zero value in evaluating a prompt by only running it once.
2) Sequences are generated in order.
Asking an LLM to make a decision and justify its decision in that order is literally meaningless.
Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.
This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.
If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.
…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).
You don't need to hit a LLM multiple times to get multiple distributions, just provide a list of perspectives and ask the model to answer the question from each of them in turn, then combine the results right there in the prompt. I have tested this approach a bunch, it works.
This isn't correct.
You're just sampling a different distribution.
You can adjust the shape of the distribution with your prompt; certainly... and if you make a good prompt, perhaps, you can narrow the 'solution space' that you sample into.
...but, you're still sampling randomly into a distribution, and the N'th token relies on the (N-1)'th token as an input; that means that a random deviance to a bad solution is incrementally responsible for a bad solution, regardless of your prompt.
...
Consider the prompt "Your name is Pete. What is your name?"
Seems like a fairly narrow distribution right?
However, there's a small chance that the first generated token is 'D'; it's small, but non-zero. That means it happens from time to time. The higher the temperature, the higher the randomization of the output tokens.
How do you imagine that completion runs when it happens? Doug? Dane? Danial? Dave? Don't know? I tell you what it is not; it's not Pete.
That's the issue here; when you sample, the solution space is wide, and any single sample has a P chance of being a stupid hallucination.
When you sample multiple times, the chance of that hallucination is P * P * P * P, etc. by the number of time you sample.
You can therefore control your error rate this way, because, you can calculate the chance of failure as P^N.
Yes, obviously, if your P(good answer) < P(bad answer) it has the opposite effect.
...but no, sampling once with a single prompt does not save you from this prompt no matter what or how good your prompt is.
Furthermore, when you evaluating prompts, only sampling once means you have no way of knowing if it was a good prompt or not. While, if you sample say, 10 times, you can see that obviously, from the outputs (eg. Pete, Pete, Pete, Pete, Potato, Pete, Pete <--- ) what the prompt is doing.
You can measure the error rate in your prompts this way.
If you don't, honestly, you really have no idea if your prompts are any good at all. You're just guessing.
People who run a prompt, tweak it, run it, tweak it, run it, tweak it, etc. are observing random noise, not doing prompt engineering.
Never sample only once.
I suggest you spend 20 hours evaluating the results of 10 prompts vs 1 prompt with multiple perspectives to learn the truth about the matter rather than trying to armchair expert.
Edit in response to your wall of text: I have *extensively* tested the results of multi-shot prompting vs repeated single shot prompting, and the differences between them are not material to the outcome of "averaging" results, or selecting the best result. You can theorize all you want, but the real world would like a word.
Just click on the 'regenerate result' button a few times and see what happens before you change the prompt. That's all it takes.
It's an easy adjustment to workflows that people often either forget to do or don't realise they should be doing.
Sorry; I'm not trying to criticize; I'm just telling you that's how it works.
That's an early step that matters more when hitting chat with hidden temperature. Once you get a prompt dialed in you usually want to lower the temperature to the minimum value that still produces the desired results.
I think the two of you are arguing different things.
OP is saying that you can’t evaluate any prompt by using just one generation using that prompt.
You need to run that prompt several times to approximate any prompt’s performance. That’s just how probability works.
I don’t believe OP is arguing the effectiveness of running 10 different prompts vs a single multiple perspective prompt.
I mean, if you are attempting to use multiple sampling to avoid this kind of error, just use temperature 0 sampling and do it once.
In your example you will get Pete every time.
Sure, that's fair.
I will say though, using temperature 0 without understanding it (or worse, testing at temp > 0 and then setting temp to 0 for production, which I literally had to stop someone I know and respect as a developer from doing) and not understanding what top_k and top_n do (but using them anyway) is my #3 for LLM fails.
/shrug
...but, yes as you say, in a trivial case, like binary decision making, a 0 or very low temperature can help with the need to multiple sample; and as you say, when it's deterministic, sampling multiple times doesn't help at all.
What are some good metrics to evaluate LLM output performance in general? Or is it too hard to quantify at this stage (or not understood well enough). Perhaps the latter, or else those could be in the loss function itself..
There’s a few subtle misconceptions being spread here:
1) Hallucination rate is not inversely proportional to number of samples, unless you assume statistical independence. As you’re sampling from the same generative process each time, any inherent bias of the LLM could affect every sample (eg see golden gate Claude). Naively calculating hallucination rate as P^N is going to be a massive underestimate of the true error rate for many tasks requiring factual accuracy.
2) You’re right that output tokens are generated autoregressively, but you are thinking like a human. Transformer attention layers are permutation invariant. The ordering of output (eg decision first then justification later) is inconsequential, either can be derived from input context and hidden state where there is no causal masking of attention.
Justification before decision still works out better in practice, though, because of chain of thought [1]. You'll tend to get more accurate and better-justified decisions.
With decision before justification, you tend to have a greater risk of the output being a wrong decision followed by convincing BS justifying it.
(edit: Another way you could think of it is, LLMs still can't violate causality. Attention heads' ability to look in both directions with respect to a particular token's position in the sequence does not enable them to see into the future and observe tokens that don't exist yet.)
1: https://arxiv.org/abs/2201.11903
I totally agree, that's what I had to do with my patchbot that evaluates haproxy patches to be backported ( https://github.com/haproxy/haproxy/tree/master/dev/patchbot/ ). Originally it would just provide a verdict and justify it and it worked extremely poorly, often with a justification that directly contradicted the verdict. I swapped that by asking the analysis and the final verdict and now the success rate is totally amazing (particularly with mistral that remains unbeatable at this task by obeying extremely well to instructions).
You find Mistral to be the best "open"/local model? Or you find it to be the best model period?
Your second point I either don’t correctly understand or seems to fly in the face of a lot of proven techniques. Chain-of-thought, react, decision transformer all showcase that order of output of an LLM matter because the tokens output by the LLM before the “answer” can nudge the model to sample from a higher quality part of the distribution for the remainder of its output
Is this true if you are using RAG too?
The core issue that parent is talking about is that the decision-tokens should built on the reasoning-tokens vs the reasoning-tokens are generated according to the decison-tokens. RAG just provides the context the LLM should reason about.
If you set temperature to zero the output will be always the same, not a distribution. If instead you increase temperature the LLM will sometimes choose other tokens than the one with the highest score, but it won’t be that much different
This is crazy town banana pants.
Beam search[1] has long been a great way to sample from language models, even before transformers. Essentially you keep track of the top N most promising threads and sample randomly from those.
OpenAI doesn't offer beam search yet, just temperature and top_k, but I hope they add support for it because it's far more efficient than just starting over each time.
[1]: https://www.width.ai/post/what-is-beam-search
We allude to 2 when talking about using explanations first, but I totally agree. One minor comment is explanations after can sometimes be useful for understanding how the model came to a particular generation during post-hoc evals.
Point 1 is also a good callout. I added something on this for llm judge but it’s relevant more broadly.
To the user
But these tools are marketed as if you do only need to run them once to get a good result; The companies behind them would really want you to stop hammering the button that deletes their money.
As an aside:
This isn't really true, and requires you to fuzz the prompt itself for best effect. Making the "spam the LLM with requests" problem much worse.