I'm not sure people in these comments are reading this paper correctly.
This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.
Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)
And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.
This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.
https://en.wikipedia.org/wiki/Lorenz_system
Years ago weather simulations started tweaking input params and running their models over and over. Discarding outliers, taking averages. It works pretty well.
Because LLM's mostly have random seeds (aka temperature) feeding them the same input and averaging the output is going to get you a better guess.
Lorenz also gives some clues (if not an outright explanation) as to why the "hallucination" problem is likely unsolvable.
If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather.
LLMs already are intelligent. They're not the same as humans, but they are able to give intelligent answers to highly nontrivial questions.
I have yet to see an LLM that is cooperative. The magic of collaborating with someone is that we can both understand the problem and reason about it.
The current degree of LLM intelligence is not compelling for a social creature like me.
Have you ever talked to real average people?
I would say an LLM is more intelligent than at least some people I know. And in the domain of programming, most people I know. Simply by the fact that most people don't know programming.
LLMs are idiot savants that can do a few things very well and fail horribly at others. And they require careful prodding to correctly process tricky logical questions, exposing what they are at the core: text expanders and parroters. Highly useful of course to save typing effort and to aggregate insights over large context lengths. If anything, dealing with LLMs has helped me appreciate the capabilities of people more.
They're much more than that. You can ask an LLM a question that it has never seen before, and it will give you a logical, reasonable answer. That requires knowledge of the world and the ability to reason.
LLMs aren't the same as humans, but neither are dogs or cats, and they're obviously intelligent in their own ways.
They will give that answer because they are forced to give it. The softmax amplifies whatever marginal outputs of the model head to a probability distribution. This means that if they don't have an answer, they are quite likely to "hallucinate" it. This is of course influenced by the patterns they learned. And directing them to be more structured also utilitizes pattern of structured thinking that is either part of finetuning or somewhere to be found in training data.
The cat/dog vs. human analogy is a very bad comparison since their brains work fundamentally like human brains, while transformers are something completely different.
Ever talked to a sales person? They also start making up things when they don't know.
You can't seem to accept that a computer can be intelligent. Can an ant be intelligent? Can an ant brain produced in a lab be intelligent? Can a computer simulated ant brain be intelligent? Can can LLM that is way smarter than an ant be intelligent?
Nobody in their right mind expects truth from sales person. You deal with them to negotiate about price, not to inform yourself about a topic.
Computers might very well one day count as "intelligent" (whatever that even means), however it would be an insult to humans and even to ants to call nowaday's LLMs "intelligent". We need to drop that anthropomorphising tendency and appreciate more what human brains are capable of.
This is ChatGPT's snarky reply to your argument:
Anything that can write that is intelligent.
Are they? You realize that's entirely speculative right? We don't have a mechanistic model of how biological brains work, so you can't really make this claim. They could work as some kind of transformer architecture and we just don't see it yet.
So is your brain. So is mine.
I brought up the dog/cat analogy because those animals, while intelligent, are unbelievably dumb in some ways that are difficult for humans to comprehend. When people say that LLMs can't reason, they typically bring up certain tasks where the LLM falls on its face. I could bring up cases in which my dog fails in some task in a way that is completely incomprehensible to me. He's intelligent, but he has some puzzling blind spots.
Transformers mechanically work very differently from the human brain, but they also share a lot in common. They are a neural system that learns an internal representation of the world, and which is able to use that representation to reason about novel situations and give rational answers.
Programmers aren’t any better than someone who doesn’t know how to program.
Programming skill isn’t a measure of intelligence.
Go outside. Talk to real people. Touch some grass.
I have a friend called Nick, but we call him Nikipedia, since he has a crazy amount of facts stored into his brain. When we go to quizzes, our group is most likely to win.
I can tell you this: LLM's know more than Nick and would beat these quizzes every single time.
You can use any definition of "intelligence" that makes you happy, no problem.
Surprised to read that.
I use them as a cooperative partner by default.
Also: quite a few people have had instances work with other instances, sometimes of the same model and sometimes of other models.
Cooperation is more than an i/o loop. Layering and pooling models is nice though.
Perhaps "conceptualization" is the indicator here.
Perhaps I'm up too late, but I can't think what else is there to cooperation besides two or more agents doing things in alignment with some goal? (Regardless of who or what sets that goal).
Also I don't know what you mean by "conceptualization".
It's fuzzy because intelligence is relative right.
I mean "being able to conceive an idea". As humans, two or more of us can reason our way to a conclusion without domain knowledge. There is an upper limit where the idea is incomplete (assuming respectful ignorance), but it's generative nonetheless.
With an LLM I have to prompt engineer to guide it. I would rather have it generate novel concepts to push domain boundaries. They work great as knowledge bases though.
That sounds like step-by-step thinking?
I generally have to in humans, too. I mean, you and I are prompting each other, aren't we?
For me the difference between prompting a human and prompting an AI is that I can reset the AI, I can't make a human forget a previous analogy that had only confused them. (And likewise, I don't expect that I fully forget bad analogies which confuse me, though I do try).
IMO, that's their weakest part. We had knowledge bases before — where each claim can be easily localised within the model, corrected when it needs to be, verified in advance, and which give predictable output — LLMs are none of those things.
LLMs are much better at understanding the question (constant time for a fixed-length output, even when the query is phrased badly and relatively complex), and being able to synthesise things in the form of "${x} won't work, try ${y}".
Huh. Do you think integrating the Semantic Web metadata and ontologies in LLM training can help us bootstrap conceptual modeling using natural language?
I really can't relate to that experience. On the contrary I think this is something LLMs are really good at.
You could convince me with a React agent in a shared environment.
Do you have any models that you find compelling? Maybe a domain model that you like or have wanted to try.
Don't get me wrong, I still use LLMs, but they just really need that extra augmentation for any non-trivial task.
Is it even allowed to ask questions??
Edit: my sience fiction joke in the 90s was AI though bots chatting in irc channels. They could seemlesly integrate human intelligence that way.
Up until this point, I agree.
This puts humans on too high a pedestal: LLMs aren't magic, and we're not magic either.
(There's other reasons for me to think Transformers aren't the answer, but not this kind of reasoning).
We pretty much are compared to present-day neural architectures. How many simulated neurons and synapses are in the largest architectures, and how do those numbers compare to humans?
It’s a non starter to assume that virtual “synapses and neurons” behave like ours do. We barely understand how ours works.
Also, modern LLMs built on the transformers architecture no longer use the neuron-inspired perceptron style topology for most of their compute.
I’ve heard that spiking NNs are supposed to mimic organic brains more closely, but I haven’t read into them much yet.
The attention mechanism is in practice implemented using three linear layers. The matrix multiplication to average the output and to implement the masking is the only non-neuronal part of that computation, but it can be seen as an activation function.
Usually, linear perceptrons and ReLUs or GeLUs are used. Due to the enormous compute requirements to evaluate models of interesting size, other types of neuronal networks and activation functions have received very little attention (pun intended) so far.
Using ReLU instead of sigmoid is a significant departure with regards to how closely it models actual neurons.
Using non fully connected layers is as well. Our brains likely aren’t fully connected, but the connections that matter are made stronger through living life and learning.
If you squint, it’s kind of like training a dense series of linear layers, but that’s not what we’re doing anymore (for the better)
Comparing NNs to organic brains is an apples to oranges comparison, is what I’m saying.
I agree that the biggest difference is the missing ability of an artificial neuronal network to adapt.
Lack of adaption is mainly a feature, we choose not to train them in real-time and instead make available fixed models with repeatable behaviour. We could, if we wanted to, update the model weights continuously in response to feedback.
I think the biggest difference is that they need far more examples than we need, to learn anything.
Unknown for the actual largest due to secrecy; 1% for the largest public models… but also organic ones are definitely a bit different from digital ones, and the jury is still out if those differences matter and if so by how much.
The comparison would therefore be with a mid-sized rodent, horse, or raven rather than a human.
(But even that's misleading, because the LLM doesn't have to use tokens to represent "contract left supracoracoideus" and "lay egg").
Edit: also, I've not heard much suggestion that anyone knows how certain genes do things like giving humans the inherent capability to recognise and create smiles or other similar reflexes, so we don't really know how much of our brains a pre-trained by evolution; furthermore, I think organic life is more sample-efficient for learning things than any AI so far.
Tokens aren't a necessary differentiator here. There is no fundamental technical reason why tokenization is used, it just has certain practical advantages. And the distinction almost disappears when we look at multimodal transformers, which process images, audio, and video broken apart into sequences of blocks of binary data.
There's no reason for any specific tokenisation, but the Transformer always has some tokenisation.
Tokens are allowed to be blocks of pixels, for example. No reason we couldn't have a token be a specific muscle or sensory nerve.
What I'm saying is that Large Language Models don't have a body, so no nerves and muscles to have to be represented within them; conversely, organic life does have those things and thus organic brains must spend some of their complexity on those things.
This means they have the possibility to equal us for language even with no capacity for vision, walking, tying shoelaces, or playing catch.
Even if from a technical perspective you're right, I think people need to be careful with the "x is not special" talk. It is a put down and it's how things like human and animal rights get obliterated and how the environment gets ruined.
"Trees aren't special", "Dolphins aren't special", "Koala's suck, let's put a mine here instead", "Pigs don't have emotions or are dumb, so it's fine to factory farm" etc.
Indeed. But I said "X is not magic", rather than "X is not special" — until we have an answer to the hard problem of consciousness (or agree which of the 40 definitions of the word "consciousness" we're using when discussing if an AI has it), we can't possibly determine if an LLM has it or not.
(My gut feeling says "LLMs are not conscious", but my gut has had a lot of false beliefs over the years as well as correct ones, so I give it a corresponding level of trust).
Fair enough then. I sort of use the terms interchangeably in this context.
When you think about it, a bird is “magic” in the sense there is a whole universe and eco system to give that bird the platform for existence. A real living bird isn’t just a concept.
So sometimes I wonder if we just say we’re insignificant because it’s a simpler way to think. It makes the idea of death and loss easier to bear.
If I tell myself I’m just a spec of dust and that I’m bit special, it can be quite comforting.
Conceptually we understand things about how birds work but the fact there is a blob of millions or billions of cells functioning to produce a bird, which can fly, completely autonomously is quite peculiar and there is a type of magic or wonder to it all which makes me think birds are both special and magic if you think differently about existence and not just the intellectual concept of a bird.
My gut feeling is that consciousness isn’t as deep and mysterious as people think it is. It’s possible that consciousness is an inevitable result of putting a sufficiently intelligent mind into a body and, as a result, the mind can’t help but weave a story about itself that connects events together.
Similarly with other properties of intelligence and the brain that we like to think are mysterious and deep.
I don't get the argument. I don't think something being magic will stop humans from exploiting it. At the end of the day intelligent people are great at coming up with excuses as to why they should do something bad. "Just chop that one tree down, its in the wrong place anyway" "Just kill that one dolphin, its old anyway" when taken together these add up to bad outcomes we dislike. Much better to discourage / fine / ban all tree chopping and dolphin killing and let select professionals remove sick trees and dolphins.
The weather isn’t magic either. It’s produced by physical mechanisms. But everyone would probably agree that a model simulating some rough aggregate of those mechanisms isn’t “weather” itself.
On the other hand. Take that weather model and render its output into a stereoscopic 3D world with photorealistic particle systems and whatever. To someone wearing a Vision Pro or similar high-def VR headset, the model is now “the weather” in the system their senses occupy. It’s missing a lot of actual sensory cues — the rain isn’t wet, the wind won’t chill your skin, and so on. But it’s close enough for some convincing applications. A caveman with no experience with technology would undoubtedly believe himself transported into a different world with real weather.
LLMs are a bit like that now. Their simulation abilities took such a sudden leap, we’re like cavemen wearing headsets.
The only way I can model what you're trying to say, is if I assume you think "the mind" is a separate kind of substance, and not merely information processing that just happens to be implemented on biological electrochemistry in our skulls.
A (philosophical) dualist can easily say that no computation is ever intelligent. I don't think this can ever be said by a (philosophical) materialist.
On the contrary, sit and listen in a college cafeteria, and it quickly becomes apparent most conversation participants are LLMs.*
These are not synonyms, true.
I don't see uncertainty of intelligence as a property of an LLM as being equivalent to certainty of weather control as a effect of a forecast.
Among other things, whether weather was controlled would tend to be agreed by all observers, while it's often unclear if intelligence is being observed in these threads. :-)
---
* While my last line was a joke, humans in LLM mode was not. We can drive on autopilot, and get where we need to go while not being able to remember how we got there. We definitely converse on autopilot, indistinguishably from LLMs talking to each other, after an opening line every word of every sentence in the entire exchange perfectly predictable to a stranger. Are the speakers intelligent? What about the stranger who knows what they will say next? To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent.
GTA 5 is a simulation. Do you expect to be arrested out side your front door for the car you stole in game?
Weather forecasting is a simulation, it tells you what the weather will look like in the next few days. It gets better as we get more sensors, collect more data and build more accurate models based on those two factors. It will never leap to weather.
Language forecasting (because this is what an LLM is) is a simulation. It tells you what the next token (word) will be based on what came before it. It gets better as we collect more data and hone and refine these models. It will never make the leap to intelligence.
To say that LLMs are intelligent means that language is a requirement for intelligence. Thats some fairly magical thinking... buy any sufficently advanced technology...
Intelligence breaks the pattern here. A simulated intelligence is intelligent, just as simulated math is math and simulated computers are computers. The point of contention shouldn't be whether LLMs are intelligences or simulated intelligences, but whether they're simulating something else.
Right. This is Searle's "a simulated plane won't get you to Japan" argument.
That's true. But a simulated calculator is perfectly effective for doing your taxes.
Like Searle’s Chinese Room argument [0]?
I think a challenge with the simulated-is-real math/calculator argument is that the simulation operates syntactically thru derivation without meaning.
E.g. a simulation of ZF set theory cannot tell you the truth value of the Axiom of Choice - because it’s independent of the ZF axioms (it is undecidable in the Gödel incompleteness sense).
But “Although originally controversial, the axiom of choice is now used without reservation by most mathematicians” [1] - I guess it’s truth is self-evident semantically.
So because of incompleteness, simulated math/calc will always be “missing” something.
Of course a LLM will happily say A of C is true (or not) but is it just parroting from the dataset or hallucinating?
[0]: https://plato.stanford.edu/entries/chinese-room/
[1]: https://en.m.wikipedia.org/wiki/Axiom_of_choice
Eh - I'm not really interested in rehashing this old argument. I'm just trying to point out the flaw in Searle's plane analogy.
Not sure if it counts but there is a police chase video online some place with a guy on drugs who claims he thought he was playing gta. The way he throws people out of their vehicle and crashes their car suggests he wasnt lying.
Due to quantum theory and chaos theory it is impossible to simulate any system to 100%. Yet, this does not mean it is impossible to design intelligent systems which are indistinguishable from their 'real' counterparts. Maybe we are at the level where a fly can be simulated accurately enough to make a distinction moot, maybe we have enough compute to simulate a mouse. We will get to a point where we can simulate a human brain. It will be indistinguishable from intelligence. I don't think the methodology really matters. In the end everything is compute.
When I was a kid, it was the definition of intelligence that separated humans from animals.
And there's a reason "dumb" means "mute" and independently "stupid".
It may well be an incorrect requirement. It may be a single form of intelligence out of many which happen to correlate in humans, but not in minds created by artifice.
But it does have a history.
Excuse the bluntness, but you're the CTO of a fintech company. Your analysis of people's social life is probably the as valuable as a janitors.
Let's address what is being said, rather than who is saying it. The latter doesn't turn into an interesting conversation.
What's being said is incredibly uninteresting, mostly because of the source.
I expect an observant janitor would have quite useful insights into people's social lives
LLMs were specifically trained to emulate human interaction patterns. Of course we sound like them at times. It's the things we can do that they can't that are relevant.
If I study Einstein and learn to do a really good impression, the statement "Einstein often sounds like karmacondon" will be true. That does not make me Einstein.
Wrong alt, hooande ;)
Some people report speaking like this: opening their mouths and not knowing how the sentence will end.
I don't experience that, I think.
Possibly used to? I have in the past had some autonomous verbal responses, for a bit this included echoing greetings — great when it's "hello", embarrassing when it's "happy birthday".
Kinda; System 1, system 2 — the best LLMs do better than most people's system 1, worse than most people's system 2. Bat and ball, $1.10.
Why is it so important to you that everyone recognizes this intelligence? What is at stake in your mind here?
This impulse towards reductivism/behaviorism in order to defend the LLMs is still profoundly interesting. It always ends up feeling like the person wants to be like an LLM, not the other way around. I think people feel lost in a deep way, and this line of thought becomes deeply comforting.
Like, so many people it seems want the future and themselves to become comprehensible all at once. "Why worry so much about myself? Im just a stochastic parrot like an LLM anyway.. Attention is all I need!"
I get it, life is hard. But we need to keep the dream alive. You gotta hope for better.
All this makes the future sound do dull. Like I am gonna wake up one day and all pizza will be shitty, tasteless pizza, but everyone will tell me: "well really look at it, it has cheese, sauce, toppings... Its pizza! You can eat it."
I don't think many people believe that LLMs are a way to AGI (whatever that actually means). But LLMs can still have many valid uses even if their prospects are limited in scope.
I recently read an interesting thread that laid out the case for LLMs being a path to AGI: https://old.reddit.com/r/singularity/comments/13ox85j/how_do...
The argument boils down to the idea that language isn't simply strings of words or bits of factual information, but an actual encoding of logic. By training statistical models on vast amounts of logic, we've given them a generalizable ability to perform logic. A sufficiently advanced LLM could thus potentially fulfill some definition of AGI.
To be clear, this doesn't in any way imply that LLMs could ever fit the definition of artificial consciousness, which would be a completely different form of strong AI. They're effectively just mathematical functions (albeit extremely complicated ones), which simply take inputs and return outputs without any intervening subjective experience. Even if they can perform a complicated task, retrieve and effectively summarize complicated information, or say all the right things as a conversational partner, they have no concept of the meaning of their output.
Maybe that limitation in itself puts a ceiling on their potential. Maybe the best possible LLM can only ever be 99.99% effective, and that 0.01% of the time it will go completely off the rails and disregard its instructions or hallucinate something ridiculous. Maybe the only way to overcome that is by keeping a human or a true artificial consciousness in the loop, in which case LLMs would still be extremely useful, but a flawed AGI if "AGI" at all. Or maybe a sufficiently advanced LLM and/or a sufficiently advanced error correction architecture will actually be enough to mitigate those issues.
I don't have a strong opinion on where LLMs are ultimately headed, but I'm looking forward to seeing how it all unfolds. It's amazing how capabilities that were strictly in the realm of sci-fi so quickly became mundane.
So are human brains, which are subject to the laws of physics, and which work just as mechanistically as any computer.
Unless you hold a dualist view that the brain accesses a spiritual realm outside of the physical world, then the fact that a computer operates mechanistically does not mean that it lacks consciousness.
The process of a human responding to a prompt isn't the same process an LLM follows. It involves subjectively experiencing being asked the question, having feelings about the question, possibly visualizing something related to the question, possibly reflecting on memories, wondering about how possible answers might be received and affect their future reputation, expressing their answer with a range of different emotions, and so on.
There may be aspects of the brain that behave like statistical models, but the broader system seems more complex than that. I don't see that as in any way inherently spiritual. I expect that it could be artificially reproduced one way or another, but would be extremely complicated.
It's not the same process, but it is a deterministic function, which was one of your objections to LLMs. Humans operate according to physical laws, after all.
LLMs are definitely here to stay. Even if they don't turn out to be the road to AGI, they can be used by all sorts of sub-AGI agents as a "language centre". An encoder can be used to extract meaning from input, and an autoregressive decoder conditioned on the agent's internal state can be used to keep a conversation going. What's not clear at all is whether the traditional transformer architecture will endure.
Please tell Sam Altman ASAP
Thanks!
You think he doesn’t know?
Everything he says is marketing for OpenAI.
Same as any other CEO with their company.
There are plenty of people - technical and non-technical - who seem to be acting like AGI is right around the corner thanks to LLMs, and who are, more broadly, vastly overstating the current capabilities of LLMs. I’m observing this in real life as much as on the internet. There are two very distinct groups of people that stand out to me: (1) High level execs with vested interests around AI and (2) Managers who haven’t even bothered to create an OpenAI account and are asking their subordinates to use ChatGPT for them, in what is an unforeseen usage of LLMs: by human proxy.
I think you are missing a step. A lot of people believe AI will advance so much that it will be indistinguishable from the best possible human reasoning. The evolution of LLMs just give us a clue of the speed of improvement of AI. That does not mean that LLMs, which are one form of AI, will become AGI. It is just one path that AI is following. It will probably become a subset of something more advanced.
Except that a weather forecasting model can't experiment on weather, but a LLM system may be designed to be able to perform experiments and take feedbacks?
But if I set the temperature to 0, the model will pick the highest probable token and the output will be always the same. But we already know that by no mean it can guarantee a correct answer. So how can multiple runs be better?
Yes, but picking the most similar output from a bunch of queries with a higher temperature is not the same thing as the output from a single low temperature query.
Possibly, but it stills doesn’t explain why multiple runs will result in better answer. In the work, the authors also hasn’t compared the multiple runs results with the single run using zero temperature. So, maybe all the overhead is just to achieve the same result already encoded in the networks? I don’t know.
Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer. And that by retaining memory, reasoning capacity and continuous learning , which is not the case with LLM.
Btw: HN is for discussion. If some just want to vote for the beauty contest, please leave.
I found this other paper that tests Temperature: https://arxiv.org/abs/2402.05201
It appears that temperature has no impact on problem solving performance. So this paper isn't getting improved performance because the token for the correct answer is more probable.
My theory is that the multiple queries are allowing the whole probability space of possible answers to be sampled. Not just the probabilities of the most likely output token, but the probabilities of all possible internal model states.
And sampling that probability space of the whole model state and finding the average is a very different mathematical operation to just picking a single model state at random and then picking the most probable output tokens.
I wonder if there is a clever/more efficient shortcut that could come from before the sample is taken on each token. We have the logits after all.
If I'm reading this correctly, they had to discard Llama 2 answers and only use GPT-3.5 given answers to test the hypothesis.
GPT-3.5 answering questions through the OAI API alone is not an acceptable method of testing problem solving ability across a range of temperatures. OpenAI does some blackbox wizardry on their end.
There are many complex and clever sampling techniques for which temperature is just one (possibly dynamic) component
One example from the llama.cpp codebase is dynamic temperature sampling
https://github.com/ggerganov/llama.cpp/pull/4972/files
Not sure what you mean by whole model state given that there are tens of thousands of possible tokens and the models have billions of parameters in XX,XXX-dimensional space. How many queries across how many sampling methods might you need? Err..how much time? :)
This is a bad analogy.
Here’s what is actually happening with no “common sense but wrong” understanding of it:
- You have a set of probabilities per token.
- You randomize them.
This is not a “bad student being asked multiple times” it is a system with randomized probabilities, creating a probability distribution.
If you want to see what a probability distribution looks like (eg. An electron cloud) then sampling only once is the wrong way to do it.
You basically have two distributions; the first one is the LLM, the second one is the shape generated by adding the random factor in the temperature.
This allows you to escape the “local maxima” encoded in the LLM distribution to find highly probable solutions that are outside the sample space of the “zero temperature”.
If you want a better analogy, look up at the night sky full of stars. Draw circle in the sky; that’s the LLM distribution.
The result from a zero temperature will be the brightest point in that circle.
When you push the temperature up, you blur the sky randomly. Some points become brighter, some dimmer, but the radius of the circle increases.
If there is a very bright point outside the sample circle 10x brighter than the brightest point inside it then repeated random samples will repeatedly find it.
It makes perfect sense that an expanded probability distribution sampled repeatedly could find a “good average solution” if that solution is significantly better than the best “zero temp” solution.
This is the same reason we have 'temp' at all; by widening the solution space probability distribution, you can find better maxima. Turns out, sampling multiple times lets you have more chances to find better maxima.
This is more like "well that seems obviously like a good idea" than "somewhat counterintuitive"; it's just slow and expensive to do it.
You can also adjust the probability distribution by other existing methods, obviously, what's surprising here is not that it works, but that it seem to work so well; probably (and I note they did not try this in their paper), a multi-sample + voting on the output from other methods would also be highly effective.
Just from reading comments around, it feels intuitive to me that looking at a heatmap of cascading pendulum would be more “accurate” than looking at just one snapshot, and also that joints on the pendulums don’t necessarily need to be interlinked between iterations of simulations
According to their code, they used temperature 1. https://anonymous.4open.science/r/more_agent_is_all_you_need...
Could multiple agents be used such that tokens emitted from LLM A is passed to B and output of B is passed to A meaning 2 agents will be being used to generate an output in a simple round Robin way? Both will share context in this case. My computer isn't big enough run two large models but this can be tried on tiny models perhaps.
I realize that for more than two and very specialised agents this will require some intelligent way to pass the output to specialist agents only. And also this means that their must be some overlap between the agents.
That is what’s already been done under the term "multi-agent". This paper argues that there’s no need for any such message-passing or context sharing, you just literally run the same query several times on the same model, fully independently, and then pick a "typical" reply according to some similarity metric.
The paper says that it enhances multi-agent methods. It is not a replacement for that. It's an enhancement for existing methods.
Running the same query several times on the same model and taking the consensus opinion is still a multi-agent method.
how is chain-of-thought multi-agent?
How is what’s described here chain-of-thought?
They were replying to this:
I expect that to give you something close to the confidence of the underlying model to some specific claim, which is good, but I still expect legends (urban and cultural) to be high-ranked.
They'd be very human mistakes, but still mistakes.
I think the only way past that is to build a world model, look for contradictions, and then look for new evidence to resolve those contradictions.
Be interesting to plug this into a bayesian optimization like framework: find out regions of language space where the models maximally disagree and then target those areas for extra training
This is a very similar idea to ensemble models, which have been used for a long time in ML and proven to be very good. You average out the results of several predictors (or you let them vote and pick the most common prediction value), thereby reducing the noise in the prediction by choosing the common denominator of multiple predictions.
This is done in aerospace as well… however, even different teams clean room writing to the same spec have the tendency to make the same errors in their code, which ends up breaking the statistical model when this model was selected.
I'm not sure you have read the paper at all. Chain of thought prompting is not a multi-agent algorithm. The paper says that it enhances existing methods such as prompt engineering (chain of thought) and multi-agent debate. The sampling method presented in the paper is orthogonal to those methods.
Not my experience. I had multiple LLMs hallucinate hard when asked same question multiple times. The only way to break the cycle is to follow everything with questions demanding clarifications. "are you sure?" "this is wrong, correct the answer".
Good news is that you can use this setup for self supervised RL (artificial dreaming? increasing contrast?).
I had a very similar idea a few months ago. I wanted to use this approach to have the LLM provide the probability that the generated answer is correct. The probability would simply be what fraction of all generated answers was the one selected. (Each generated answer would be generated with a different seed and the question would be of single choice kind.) The two issues I found were 1) the cost, 2) on some problems, LLMs can be wrong more often than they are not.
Hopefully, as inference gets cheaper and of higher quality, someone will come up with a more feasible solution.
I don't think this type of method can scale indefinitely, it's essentially just "better" sampling within dense areas of knowledge space. It cannot help with better exploration outside these dense areas, because these explorations won't have a consensus among agents almost by definition.
My impression from github copilot is that hallucinations are the result of certain true facts having a low likelihood and copilot giving you the most likely answer anyway.
Typically I have a certain library that does things in a very unorthodox and undocumented way and when I ask copilot for an example it gives me wonderful, totally understandable code of made up functions that I wouldnt need in the first place if the library worked that way.
I dont think that running that query multiple times would help.