This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.
I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.
I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.
This all runs locally / free using ollama.
Oh this is fun! So you basically define personalities by picking well-known people that are probably represented in the training data and ask them (their LLM-imagined doppelganger) to vote?
In the research literature, this process is done not by "agent" voting but by taking a similarity score between answers, and choosing the answer that is most representative.
Another approach is to use multiple agents to generate a distribution over predictions, in sort of like bayesian estimation.
Any chance you could expand on both of these, even enough to assist in digging deeper into them? TIA.
The TLDR is you can prompt the LLM to take different perspectives than its default, then combine those. If the LLM is estimating a number, the different perspectives give you a distribution over the truth, which shows you the range of biases and the most likely true answer (given wisdom of the crowd). If the LLM is generating non-quantifiable output, you can find the "average" of the answers (using embeddings or other methods) and select that one.
Ah ok, so both are implemented via a call(s) to the LLM, as opposed to a standard algorithmic approach?
Once you have bayesian prior distributions (which it makes total sense for llms to estimate) you can do tons of nifty statistical techniques. It's only the bottom layer of the analysis stack that's LLM generated.
for my use case (generating an interesting H1), using a similarity score would defeat the purpose.
In this approach, I'm looking for the diamond in the rough. It's often dissimilar from the others. With this approach, the diamond can still get a high number of votes.
That approach definitely has promise. I would have agents rate answers and take the highest rated rather than vote for them though, since you're losing information about ranking and preference gradients with n choose 1. Also, you can do that whole process in one prompt, in case you're re-prompting currently, it's cheaper to batch it up.
For clarification on the first part. The research suggests you can utilize the same prompt over multiple runs as the input to picking the answer.
I'd be curious to see some examples and maybe intermediate results?
here's some examples[0]:
this one scored high:
Pinned Down - Powerful Analytics Without the Need for Engineering or SQL
this one scored low:
Analytics Made Accessible for Everyone.
Each time I've compared the top scoring results to those at the bottom, I've always preferred the top scoring variations.
0 - https://docs.google.com/spreadsheets/d/1hdu2BlhLcLZ9sruVW8a_...
I love the spreadsheet. That's exactly what I was looking for. Thank you!