The effectiveness of search goes hand-in-hand with quality of the value function. But today, value functions are incredibly domain-specific, and there is weak or no current evidence (as far as I know) that we can make value functions that generalize well to new domains. This article effectively makes a conceptual leap from "chess has good value functions" to "we can make good value functions that enable search for AI research". I mean yes, that'd be wonderful - a holy grail - but can we really?
In the meantime, 1000x or 10000x inference time cost for running an LLM gets you into pretty ridiculous cost territory.
Yeah, Stockfish is probably evaluating many millions of positions when looking 40-ply ahead, even with the limited number of legal chess moves in a given position, and with an easy criteria for heavy early pruning (once a branch becomes losing, not much point continuing it). I can't imagine the cost of evaluating millions of LLM continuations, just to select the optimal one!
Where tree search might make more sense applied to LLMs is for more coarser grained reasoning where the branching isn't based on alternate word continuations but on alternate what-if lines of thought, but even then it seems costs could easily become prohibitive, both for generation and evaluation/pruning, and using such a biased approach seems as much to fly in the face of the bitter lesson as be suggested by it.
Yes absolutely and well put - a strong property of chess is that next states are fast and easy to enumerate, which makes search particularly easy and strong, while next states are much slower, harder to define, and more expensive to enumerate with an LLM
The cost of the LLM isn't the only or even the most important cost that matters. Take the example of automating AI research: evaluating moves effectively means inventing a new architecture or modifying an existing one, launching a training run and evaluating the new model on some suite of benchmarks. The ASI has to do this in a loop, gather feedback and update its priors - what people refer to as "Grad student descent". The cost of running each train-eval iteration during your search is going to be significantly more than generating the code for the next model.
You're talking about applying tree search as a form of network architecture search (NAS), which is different from applying it to LLM output sampling.
Automated NAS has been tried for (highly constrained) image classifier design, before simpler designs like ResNets won the day. Doing this for billion parameter sized models would certainly seem to be prohibitively expensive.
I'm not following. How do you propose search is performed by the ASI designed for "AI Research"? (as proposed by the article)
Fair enough - he discusses GPT-4 search halfway down the article, but by the end is discussing self-improving AI.
Certainly compute to test ideas (at scale) is the limiting factor for LLM developments (says Sholto @ Google), but if we're talking moving beyond LLMs, not just tweaking them, then it seems we need more than architecture search anyways.
Well people certainly are good at finding new ways to consume compute power. Whether it’s mining bitcoins or training a million AI models at once to generate a “meta model” that we think could achieve escape velocity. What happens when it doesn’t? And Sam Altman and the author want to get the government to pay for this? Am I reading this right?
Isn't evaluating against different effective "experts" within the model effectively what MoE [1] does?
[1] https://en.wikipedia.org/wiki/Mixture_of_experts
No - MoE is just a way to add more parameters to a model without increasing the cost (number of FLOPs) of running it.
The way MoE does this is by having multiple alternate parallel paths through some parts of the model, together with a routing component that decides which path (one only) to send each token through. These paths are the "experts", but the name doesn't really correspond to any intuitive notion of expert. So, rather than having 1 path with N parameters, you have M paths (experts) each with N parameters, but each token only goes through one of them, so number of FLOPs is unchanged.
With tree search, whether for a game like Chess or potentially LLMs, you are growing a "tree" of all possible alternate branching continuations of the game (sentence), and keeping the number of these branches under control by evaluating each branch (= sequence of moves) to see if it is worth continuing to grow, and if not discarding it ("pruning" it off the tree).
With Chess, pruning is easy since you just need to look at the board position at the tip of the branch and decide if it's a good enough position to continue playing from (extending the branch). With an LLM each branch would represent an alternate continuation of the input prompt, and to decide whether to prune it or not you'd have to pass the input + branch to another LLM and have it decide if it looked promising or not (easier said than done!).
So, MoE is just a way to cap the cost of running a model, while tree search is a way to explore alternate continuations and decide which ones to discard, and which ones to explore (evaluate) further.
How does MoE choose an expert?
From the outside and if we squint a bit; this looks a lot like an inverted attention mechanism where the token attends to the experts.
Usually there’s a small neural network that makes the choice for each token in an LLM.
From what I can gather it depends, but could be a simple Softmax-based layer[1] or just argmax[2].
There was also a recent post[3] about a model where they used a cross-attention layer to let the expert selection be more context aware.
[1]: https://arxiv.org/abs/1701.06538
[2]: https://arxiv.org/abs/2208.02813
[3]: https://news.ycombinator.com/item?id=40675577
I don't know the details, but there are a variety of routing mechanisms that have been tried. One goal is to load balance tokens among the experts so that each expert's parameters are equally utilized, which it seems must sometimes conflict with wanting to route to an expert based on the token itself.
This confuses me. Positions that seem like they could be losing (but haven’t lost yet) could become winning if you search deep enough.
Yes, and even a genuinely (with perfect play) losing position can win if it sharp enough and causes your opponent to make a mistake! There's also just the relative strength of one branch vs another - have to prune some if there are too many.
I was just trying to give the flavor of it.
Chess engines typically assume that the opponent plays to the best of their abilities, don't they?
The Contempt Factor is used by engines sometimes.
"The Contempt Factor reflects the estimated superiority/inferiority of the program over its opponent. The Contempt factor is assigned as draw score to avoid (early) draws against apparently weaker opponents, or to prefer draws versus stronger opponents otherwise."
https://www.chessprogramming.org/Contempt_Factor
Imagine that there was some non-constructive proof that white would always win in perfect play. Would a well constructed chess engine always resign as black? :P
To do that, the LLM would have to have some notion of "lines of thought". They don't. That is completely foreign to the design of LLMs.
Right - this isn't something that LLMs currently do. Adding search would be a way to add reasoning. Think of it as part of a reasoning agent - external scaffolding similar to tree of thoughts.
Pruning isn't quite as easy as you make it sound. There are lots of famous examples where chess engines misevaluate a position because they prune out apparently losing moves that are actually winning.
Eg https://youtu.be/TtJeE0Th7rk?si=KVAZufm8QnSW8zQo
Self-evaluation might be good enough in some domains? Then the AI is doing repeated self-evaluation, trying things out to find a response that scores higher according to its self metric.
Being able to fix your errors and improve over time until there are basically no errors is what humans do, so far all AI models just corrupt knowledge they don't purify knowledge like humanity did except when scripted with a good value function from a human like AlphaGo where the value function is winning games.
This is why you need to constantly babysit todays AI and tell it to do steps and correct itself all the time, because you are much better at getting to pure knowledge than the AI is, it would quickly veer away into nonsense otherwise.
You got to take a step back and look at LLMs like ChatGPT. With 180 million users and assuming 10,000 tokens per user per month, that's 1.8 trillion interactive tokens.
LLMs are given tasks, generate responses, and humans use those responses to achieve their goals. This process repeats over time, providing feedback to the LLM. This can scale to billions of iterations per month.
The fascinating part is that LLMs encounter a vast diversity of people and tasks, receiving supporting materials, private documents, and both implicit and explicit feedback. Occasionally, they even get real-world feedback when users return to iterate on previous interactions.
Taking a role of assistant LLMs are primed to learn from the outcomes of their actions, scaling across many people. Thus they can learn from our collective feedback signals over time.
Yes, that uses a lot of human in the loop, not just real world in the loop, but humans are also dependent on culture and society, I see no need for AI to be able to do it without society. I actually think that AGI will be a collective/network of humans and AI agents, this perspective fits right in. AI will be the knowledge and experience flywheel of humanity.
To what extent do you know this to be true? Can you describe the mechanism that is used?
I would contrast your statement with cases where chat gpt generated something, I read it and note various incorrect things and then walk away. Further, there are cases where the human does not realize there are errors. In both cases I'm not aware of any kind of feedback loop that would even be really possible - i never told the LLM it was wrong. Nor should the LLM assume it was wrong because I run more queries. Thus, there is no signal back that the answers were wrong.
Hence, where do you see the feedback loop existing?
Like, for example, a developer working on a project, will iterate many times, some codes generated by AI might generate errors, they will discuss that with the model to fix the code. This way the model gets not just one round interactions, but multi-round with feedback.
I think the general pattern will be people sticking with the task longer when it fails, trying to solve it with persistence. This is all aggregated over a huge number of sessions and millions of users.
In order to protect privacy we could only train preference models from this feedback data, and then fine-tune the base model without using the sensitive interaction logs directly. The model would learn a preference for how to act in specific contexts, but not remember the specifics.
This is where I quibble. Accurately detecting someone is actually still on the same task (indicating they are not satisfied) is perhaps as challenging as generating any answer to begin with.
That is why I also mentioned when people don't know the result was incorrect. That'll potentially drive a strong "this answer was correct" signal.
So, during development of a tool, I can envision that feedback loop. But something simply presented to millions, without a way to determine false negatives nor false positives-- how exactly does that feedback loop work?
I think AGI is going to have to learn itself on-the-job, same as humans. Trying to collect human traces of on-the-job training/experience and pre-training an AGI on them seems doomed to failure since the human trace is grounded in the current state (of knowledge/experience) of the human's mind, but what the AGI needs is updates relative to it's own state. In any case, for a job like (human-level) software developer AGI is going to need human-level runtime reasoning and learning (>> in-context learn by example), even if it were possible to "book pre-train" it rather than on-the-job train it.
Outside of repetitive genres like CRUD-apps, most software projects are significantly unique, even if they re-use learnt developer skills - it's like Chollet's ARC test on mega-steroids, with dozens/hundreds of partial-solution design techniques, and solutions that require a hierarchy of dozens/hundreds of these partial-solutions (cf Chollet core skills, applied to software) to be arranged into a solution in a process of iterative refinement.
There's a reason senior software developers are paid a lot - it's not just economically valuable, it's also one of the more challenging cognitive skills that humans are capable of.
Kinda, but also not entirely.
One of the things OpenAI did to improve performance was to train an AI to determine how a human would rate an output, and use that to train the LLM itself. (Kinda like a GAN, now I think about it).
https://forum.effectivealtruism.org/posts/5mADSy8tNwtsmT3KG/...
But this process has probably gone as far as it can go, at least with current architectures for the parts, as per Amdahl's law.
This works perfectly in games. e.g. Alpha Zero. In other domains, not so much.
Games are closed systems. There’s no unknowns in the rule set or world state because the game wouldn’t work if there were. No unknown unknowns. Compare to physics or biology where we have no idea if we know 1% or 90% of the rules at this point.
self-evaluation would still work great even where there are probabilistic and changing rule sets. The linchpin of the whole operation is automated loss function evaluation, not a set of known and deterministic rules. Once you have to pay and employ humans to compute loss functions, the scale falls apart.
Sorry but I have to ask: what makes you think this would be a good idea?
This will just lead to the evaluatee finding anomalies in evaluator and exploiting them for maximum gains. It happened many times already where a ML model controled an object in a physical world simulator, and all it learned was to exploit simulator bugs [1]
[1] https://boingboing.net/2018/11/12/local-optima-r-us.html
Thats a natural tendency for optimization algorithms
We humans learn our own value function.
If I get hungry for example, my brain will generate a plan to satisfy that hunger. The search process and the evaluation happen in the same place, my brain.
The "search" process for your brain structure took 13 billion years and 20 orders of magnitude more computation than we will ever harness.
So what’s your point? That we can’t create AGI because it took evolution a really long time?
Creating a human-level intelligence artificially is easy: just copy what happens in nature. We already have this technology, and we call it IVF.
The idea that humans aren't the only way of producing human-level intelligence is taken as a given in many academic circles, but we don't really have any reason to believe that. It's an article of faith (as is its converse – but the converse is at least in-principle falsifiable).
“Creating a human-level intelligence artificially is easy: just copy what happens in nature. We already have this technology, and we call it IVF.”
What’s the point of this statement? You know that IVF has nothing to do with artificial intelligence (as in intelligent machines). Did you just want to sound smart?
Of course it is related to the topic.
It is related because the goal of all of this is to create human level intelligence or better.
And that is a probable way to do it, instead of these other, less established methods that we don't know if they will work or not.
Even if the best we ever do is something with the intelligence and energy use of the human brain that would still be a massive (5 ooms?) improvement on the status quo.
You need to pay people, and they use a bunch of energy commuting, living in air conditioned homes, etc. which has nothing to do with powering the brain.
I don't think there's much in our brain of significance to intelligence older than ~200M years.
200M years ago you had dinosairs, they were significantly dumber than mammals.
400M years ago you had fish and arthropods, even dumber than dinosaurs.
Brain size grows as intelligence grows, the smarter you are the more use you have for compute so the bigger your brain gets. It took a really long time for intelligence to develop enough that brains as big as mammals were worth it.
Big brain (intelligence) comes at a huge cost, and is only useful if you are a generalist.
I'd assume that being a generalist drove intelligence. It may have started with warm bloodedness and feathers/fur and further boosted in mammals with milk production (& similar caring for young by birds) - all features that reduce dependence on specific environmental conditions and therefore expose the species to more diverse environments where intelligence becomes valuable.
i'm surprised you think we will harness so little computation. the universe's lifetime is many orders of magnitude longer than 13 billion years, and especially the 4.5 billion years of earth's own history, and the universe is much larger than earth's biosphere, most of which probably has not been exploring the space of possible computations very efficiently
Neither the Earth nor life have been around for 13 billion years.
I think we have ok generalized value functions (aka LLM benchmarks), but we don't have cheap approximations to them, which is what we'd need to be able to do tree search at inference time. Chess works because material advantage is a pretty good approximation to winning and is trivially calculable.
Stockfish doesn't use material advantage as an approximation to winning though. It uses a complex deep learning value function that it evaluates many times.
Still, the fact that there are obvious heuristics makes that function easier to train and and makes it presumably not need an absurd number of weights.
No, without assigning value to pieces, the heuristics are definitely not obvious. You're taking about 20 year old chess engines or beginner projects.
Everyone understands a queen is worth more than a pawn. Even if you don't know the exact value of one piece relative to another, the rough estimate "a queen is worth five to ten pawns" is a lot better than not assigning value at all. I highly doubt even 20 year old chess engines or beginner projects value a queen and pawn the same.
After that, just adding up the material on both sides, without taking into account the position of the pieces at all, is a heuristic that will correctly predict the winning player on the vast majority of all possible board positions.
He agrees with you on the 20yr old engines and beginner projects.
Do you believe that there will be a "general AI" breakthrough? I feel as though you have expressed the reason I am so skeptical of all these AI researchers who believe we are on the cusp of it (what "general AI" means exactly never seems to be very well-defined)
I think capitalistic pressures favor narrow superhuman AI over general AI. I wrote on this two years ago: https://argmax.blog/posts/agi-capitalism/
Since I wrote about this, I would say that OpenAI's directional struggles are some confirmation of my hypothesis.
summary: I believe that AGI is possible but will take multiple unknown breakthroughs on an unknown timeline, but most likely requires long-term concerted effort with much less immediate payoff than pursuing narrow superhuman AI, such that serious efforts at AGI is not incentivized much in capitalism.
But I thought the history of capitalism is an invasion from the future by an artificial intelligence that must assemble itself entirely from its enemy’s resources.
NB: I agree; I think AGI will first be achieved with genetic engineering, which is a path of way lesser resistance than using silicon hardware (which is probably a century plus off at the minimum from being powerful enough to emulate a human brain).
All you need for a good value function is high quality simulation of the task.
Some domains have better versions of this than others (eg theorem provers in math precisely indicate when you've succeeded)
Incidentally, lean could add a search like feature to help human researchers, and this would advance ai progress on math as well