return to table of content

AI Search: The Bitter-Er Lesson

mxwsn
57 replies
21h50m

The effectiveness of search goes hand-in-hand with quality of the value function. But today, value functions are incredibly domain-specific, and there is weak or no current evidence (as far as I know) that we can make value functions that generalize well to new domains. This article effectively makes a conceptual leap from "chess has good value functions" to "we can make good value functions that enable search for AI research". I mean yes, that'd be wonderful - a holy grail - but can we really?

In the meantime, 1000x or 10000x inference time cost for running an LLM gets you into pretty ridiculous cost territory.

HarHarVeryFunny
20 replies
20h45m

Yeah, Stockfish is probably evaluating many millions of positions when looking 40-ply ahead, even with the limited number of legal chess moves in a given position, and with an easy criteria for heavy early pruning (once a branch becomes losing, not much point continuing it). I can't imagine the cost of evaluating millions of LLM continuations, just to select the optimal one!

Where tree search might make more sense applied to LLMs is for more coarser grained reasoning where the branching isn't based on alternate word continuations but on alternate what-if lines of thought, but even then it seems costs could easily become prohibitive, both for generation and evaluation/pruning, and using such a biased approach seems as much to fly in the face of the bitter lesson as be suggested by it.

mxwsn
5 replies
20h41m

Yes absolutely and well put - a strong property of chess is that next states are fast and easy to enumerate, which makes search particularly easy and strong, while next states are much slower, harder to define, and more expensive to enumerate with an LLM

typon
4 replies
20h18m

The cost of the LLM isn't the only or even the most important cost that matters. Take the example of automating AI research: evaluating moves effectively means inventing a new architecture or modifying an existing one, launching a training run and evaluating the new model on some suite of benchmarks. The ASI has to do this in a loop, gather feedback and update its priors - what people refer to as "Grad student descent". The cost of running each train-eval iteration during your search is going to be significantly more than generating the code for the next model.

HarHarVeryFunny
2 replies
19h24m

You're talking about applying tree search as a form of network architecture search (NAS), which is different from applying it to LLM output sampling.

Automated NAS has been tried for (highly constrained) image classifier design, before simpler designs like ResNets won the day. Doing this for billion parameter sized models would certainly seem to be prohibitively expensive.

typon
1 replies
17h3m

I'm not following. How do you propose search is performed by the ASI designed for "AI Research"? (as proposed by the article)

HarHarVeryFunny
0 replies
16h34m

Fair enough - he discusses GPT-4 search halfway down the article, but by the end is discussing self-improving AI.

Certainly compute to test ideas (at scale) is the limiting factor for LLM developments (says Sholto @ Google), but if we're talking moving beyond LLMs, not just tweaking them, then it seems we need more than architecture search anyways.

therobots927
0 replies
16h52m

Well people certainly are good at finding new ways to consume compute power. Whether it’s mining bitcoins or training a million AI models at once to generate a “meta model” that we think could achieve escape velocity. What happens when it doesn’t? And Sam Altman and the author want to get the government to pay for this? Am I reading this right?

byteknight
5 replies
20h22m

Isn't evaluating against different effective "experts" within the model effectively what MoE [1] does?

Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions.[1] It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

[1] https://en.wikipedia.org/wiki/Mixture_of_experts

HarHarVeryFunny
4 replies
19h37m

No - MoE is just a way to add more parameters to a model without increasing the cost (number of FLOPs) of running it.

The way MoE does this is by having multiple alternate parallel paths through some parts of the model, together with a routing component that decides which path (one only) to send each token through. These paths are the "experts", but the name doesn't really correspond to any intuitive notion of expert. So, rather than having 1 path with N parameters, you have M paths (experts) each with N parameters, but each token only goes through one of them, so number of FLOPs is unchanged.

With tree search, whether for a game like Chess or potentially LLMs, you are growing a "tree" of all possible alternate branching continuations of the game (sentence), and keeping the number of these branches under control by evaluating each branch (= sequence of moves) to see if it is worth continuing to grow, and if not discarding it ("pruning" it off the tree).

With Chess, pruning is easy since you just need to look at the board position at the tip of the branch and decide if it's a good enough position to continue playing from (extending the branch). With an LLM each branch would represent an alternate continuation of the input prompt, and to decide whether to prune it or not you'd have to pass the input + branch to another LLM and have it decide if it looked promising or not (easier said than done!).

So, MoE is just a way to cap the cost of running a model, while tree search is a way to explore alternate continuations and decide which ones to discard, and which ones to explore (evaluate) further.

PartiallyTyped
3 replies
19h5m

How does MoE choose an expert?

From the outside and if we squint a bit; this looks a lot like an inverted attention mechanism where the token attends to the experts.

telotortium
0 replies
18h6m

Usually there’s a small neural network that makes the choice for each token in an LLM.

HarHarVeryFunny
0 replies
18h6m

I don't know the details, but there are a variety of routing mechanisms that have been tried. One goal is to load balance tokens among the experts so that each expert's parameters are equally utilized, which it seems must sometimes conflict with wanting to route to an expert based on the token itself.

vlovich123
4 replies
17h48m

and with an easy criteria for heavy early pruning (once a branch becomes losing, not much point continuing it)

This confuses me. Positions that seem like they could be losing (but haven’t lost yet) could become winning if you search deep enough.

HarHarVeryFunny
3 replies
17h39m

Yes, and even a genuinely (with perfect play) losing position can win if it sharp enough and causes your opponent to make a mistake! There's also just the relative strength of one branch vs another - have to prune some if there are too many.

I was just trying to give the flavor of it.

eru
2 replies
15h47m

[E]ven a genuinely (with perfect play) losing position can win if it sharp enough and causes your opponent to make a mistake!

Chess engines typically assume that the opponent plays to the best of their abilities, don't they?

slyall
0 replies
15h17m

The Contempt Factor is used by engines sometimes.

"The Contempt Factor reflects the estimated superiority/inferiority of the program over its opponent. The Contempt factor is assigned as draw score to avoid (early) draws against apparently weaker opponents, or to prefer draws versus stronger opponents otherwise."

https://www.chessprogramming.org/Contempt_Factor

nullc
0 replies
4h40m

Imagine that there was some non-constructive proof that white would always win in perfect play. Would a well constructed chess engine always resign as black? :P

AnimalMuppet
1 replies
17h27m

Where tree search might make more sense applied to LLMs is for more coarser grained reasoning where the branching isn't based on alternate word continuations but on alternate what-if lines of thought...

To do that, the LLM would have to have some notion of "lines of thought". They don't. That is completely foreign to the design of LLMs.

HarHarVeryFunny
0 replies
17h23m

Right - this isn't something that LLMs currently do. Adding search would be a way to add reasoning. Think of it as part of a reasoning agent - external scaffolding similar to tree of thoughts.

stevage
0 replies
18h48m

Pruning isn't quite as easy as you make it sound. There are lots of famous examples where chess engines misevaluate a position because they prune out apparently losing moves that are actually winning.

Eg https://youtu.be/TtJeE0Th7rk?si=KVAZufm8QnSW8zQo

dsjoerg
13 replies
21h27m

Self-evaluation might be good enough in some domains? Then the AI is doing repeated self-evaluation, trying things out to find a response that scores higher according to its self metric.

Jensson
6 replies
9h17m

Being able to fix your errors and improve over time until there are basically no errors is what humans do, so far all AI models just corrupt knowledge they don't purify knowledge like humanity did except when scripted with a good value function from a human like AlphaGo where the value function is winning games.

This is why you need to constantly babysit todays AI and tell it to do steps and correct itself all the time, because you are much better at getting to pure knowledge than the AI is, it would quickly veer away into nonsense otherwise.

visarga
4 replies
2h18m

all AI models just corrupt knowledge they don't purify knowledge like humanity

You got to take a step back and look at LLMs like ChatGPT. With 180 million users and assuming 10,000 tokens per user per month, that's 1.8 trillion interactive tokens.

LLMs are given tasks, generate responses, and humans use those responses to achieve their goals. This process repeats over time, providing feedback to the LLM. This can scale to billions of iterations per month.

The fascinating part is that LLMs encounter a vast diversity of people and tasks, receiving supporting materials, private documents, and both implicit and explicit feedback. Occasionally, they even get real-world feedback when users return to iterate on previous interactions.

Taking a role of assistant LLMs are primed to learn from the outcomes of their actions, scaling across many people. Thus they can learn from our collective feedback signals over time.

Yes, that uses a lot of human in the loop, not just real world in the loop, but humans are also dependent on culture and society, I see no need for AI to be able to do it without society. I actually think that AGI will be a collective/network of humans and AI agents, this perspective fits right in. AI will be the knowledge and experience flywheel of humanity.

seadan83
3 replies
1h54m

This process repeats over time, providing feedback to the LLM

To what extent do you know this to be true? Can you describe the mechanism that is used?

I would contrast your statement with cases where chat gpt generated something, I read it and note various incorrect things and then walk away. Further, there are cases where the human does not realize there are errors. In both cases I'm not aware of any kind of feedback loop that would even be really possible - i never told the LLM it was wrong. Nor should the LLM assume it was wrong because I run more queries. Thus, there is no signal back that the answers were wrong.

Hence, where do you see the feedback loop existing?

visarga
2 replies
1h17m

To what extent do you know this to be true? Can you describe the mechanism that is used?

Like, for example, a developer working on a project, will iterate many times, some codes generated by AI might generate errors, they will discuss that with the model to fix the code. This way the model gets not just one round interactions, but multi-round with feedback.

I read it and note various incorrect things and then walk away.

I think the general pattern will be people sticking with the task longer when it fails, trying to solve it with persistence. This is all aggregated over a huge number of sessions and millions of users.

In order to protect privacy we could only train preference models from this feedback data, and then fine-tune the base model without using the sensitive interaction logs directly. The model would learn a preference for how to act in specific contexts, but not remember the specifics.

seadan83
0 replies
9m

I think the general pattern will be people sticking with the task longer when it fails, trying to solve it with persistence.

This is where I quibble. Accurately detecting someone is actually still on the same task (indicating they are not satisfied) is perhaps as challenging as generating any answer to begin with.

That is why I also mentioned when people don't know the result was incorrect. That'll potentially drive a strong "this answer was correct" signal.

So, during development of a tool, I can envision that feedback loop. But something simply presented to millions, without a way to determine false negatives nor false positives-- how exactly does that feedback loop work?

HarHarVeryFunny
0 replies
3m

I think AGI is going to have to learn itself on-the-job, same as humans. Trying to collect human traces of on-the-job training/experience and pre-training an AGI on them seems doomed to failure since the human trace is grounded in the current state (of knowledge/experience) of the human's mind, but what the AGI needs is updates relative to it's own state. In any case, for a job like (human-level) software developer AGI is going to need human-level runtime reasoning and learning (>> in-context learn by example), even if it were possible to "book pre-train" it rather than on-the-job train it.

Outside of repetitive genres like CRUD-apps, most software projects are significantly unique, even if they re-use learnt developer skills - it's like Chollet's ARC test on mega-steroids, with dozens/hundreds of partial-solution design techniques, and solutions that require a hierarchy of dozens/hundreds of these partial-solutions (cf Chollet core skills, applied to software) to be arranged into a solution in a process of iterative refinement.

There's a reason senior software developers are paid a lot - it's not just economically valuable, it's also one of the more challenging cognitive skills that humans are capable of.

ben_w
0 replies
2h17m

Kinda, but also not entirely.

One of the things OpenAI did to improve performance was to train an AI to determine how a human would rate an output, and use that to train the LLM itself. (Kinda like a GAN, now I think about it).

https://forum.effectivealtruism.org/posts/5mADSy8tNwtsmT3KG/...

But this process has probably gone as far as it can go, at least with current architectures for the parts, as per Amdahl's law.

jgalt212
2 replies
5h58m

Self-evaluation might be good enough in some domains?

This works perfectly in games. e.g. Alpha Zero. In other domains, not so much.

coffeebeqn
1 replies
5h40m

Games are closed systems. There’s no unknowns in the rule set or world state because the game wouldn’t work if there were. No unknown unknowns. Compare to physics or biology where we have no idea if we know 1% or 90% of the rules at this point.

jgalt212
0 replies
39m

self-evaluation would still work great even where there are probabilistic and changing rule sets. The linchpin of the whole operation is automated loss function evaluation, not a set of known and deterministic rules. Once you have to pay and employ humans to compute loss functions, the scale falls apart.

dullcrisp
2 replies
20h46m

Sorry but I have to ask: what makes you think this would be a good idea?

skirmish
1 replies
19h36m

This will just lead to the evaluatee finding anomalies in evaluator and exploiting them for maximum gains. It happened many times already where a ML model controled an object in a physical world simulator, and all it learned was to exploit simulator bugs [1]

[1] https://boingboing.net/2018/11/12/local-optima-r-us.html

CooCooCaCha
0 replies
19h5m

Thats a natural tendency for optimization algorithms

CooCooCaCha
11 replies
19h39m

We humans learn our own value function.

If I get hungry for example, my brain will generate a plan to satisfy that hunger. The search process and the evaluation happen in the same place, my brain.

skulk
10 replies
18h42m

The "search" process for your brain structure took 13 billion years and 20 orders of magnitude more computation than we will ever harness.

CooCooCaCha
4 replies
18h26m

So what’s your point? That we can’t create AGI because it took evolution a really long time?

wizzwizz4
3 replies
18h18m

Creating a human-level intelligence artificially is easy: just copy what happens in nature. We already have this technology, and we call it IVF.

The idea that humans aren't the only way of producing human-level intelligence is taken as a given in many academic circles, but we don't really have any reason to believe that. It's an article of faith (as is its converse – but the converse is at least in-principle falsifiable).

CooCooCaCha
1 replies
17h46m

“Creating a human-level intelligence artificially is easy: just copy what happens in nature. We already have this technology, and we call it IVF.”

What’s the point of this statement? You know that IVF has nothing to do with artificial intelligence (as in intelligent machines). Did you just want to sound smart?

stale2002
0 replies
5h0m

Of course it is related to the topic.

It is related because the goal of all of this is to create human level intelligence or better.

And that is a probable way to do it, instead of these other, less established methods that we don't know if they will work or not.

sebzim4500
0 replies
6h50m

Even if the best we ever do is something with the intelligence and energy use of the human brain that would still be a massive (5 ooms?) improvement on the status quo.

You need to pay people, and they use a bunch of energy commuting, living in air conditioned homes, etc. which has nothing to do with powering the brain.

HarHarVeryFunny
2 replies
16h25m

I don't think there's much in our brain of significance to intelligence older than ~200M years.

Jensson
1 replies
9h11m

200M years ago you had dinosairs, they were significantly dumber than mammals.

400M years ago you had fish and arthropods, even dumber than dinosaurs.

Brain size grows as intelligence grows, the smarter you are the more use you have for compute so the bigger your brain gets. It took a really long time for intelligence to develop enough that brains as big as mammals were worth it.

HarHarVeryFunny
0 replies
8h56m

Big brain (intelligence) comes at a huge cost, and is only useful if you are a generalist.

I'd assume that being a generalist drove intelligence. It may have started with warm bloodedness and feathers/fur and further boosted in mammals with milk production (& similar caring for young by birds) - all features that reduce dependence on specific environmental conditions and therefore expose the species to more diverse environments where intelligence becomes valuable.

kragen
0 replies
17h52m

i'm surprised you think we will harness so little computation. the universe's lifetime is many orders of magnitude longer than 13 billion years, and especially the 4.5 billion years of earth's own history, and the universe is much larger than earth's biosphere, most of which probably has not been exploring the space of possible computations very efficiently

jujube3
0 replies
17h15m

Neither the Earth nor life have been around for 13 billion years.

fizx
5 replies
18h52m

I think we have ok generalized value functions (aka LLM benchmarks), but we don't have cheap approximations to them, which is what we'd need to be able to do tree search at inference time. Chess works because material advantage is a pretty good approximation to winning and is trivially calculable.

computerphage
4 replies
11h43m

Stockfish doesn't use material advantage as an approximation to winning though. It uses a complex deep learning value function that it evaluates many times.

alexvitkov
3 replies
11h32m

Still, the fact that there are obvious heuristics makes that function easier to train and and makes it presumably not need an absurd number of weights.

bongodongobob
2 replies
3h16m

No, without assigning value to pieces, the heuristics are definitely not obvious. You're taking about 20 year old chess engines or beginner projects.

alexvitkov
1 replies
2h6m

Everyone understands a queen is worth more than a pawn. Even if you don't know the exact value of one piece relative to another, the rough estimate "a queen is worth five to ten pawns" is a lot better than not assigning value at all. I highly doubt even 20 year old chess engines or beginner projects value a queen and pawn the same.

After that, just adding up the material on both sides, without taking into account the position of the pieces at all, is a heuristic that will correctly predict the winning player on the vast majority of all possible board positions.

navane
0 replies
58m

He agrees with you on the 20yr old engines and beginner projects.

cowpig
2 replies
21h8m

The effectiveness of search goes hand-in-hand with quality of the value function. But today, value functions are incredibly domain-specific, and there is weak or no current evidence (as far as I know) that we can make value functions that generalize well to new domains.

Do you believe that there will be a "general AI" breakthrough? I feel as though you have expressed the reason I am so skeptical of all these AI researchers who believe we are on the cusp of it (what "general AI" means exactly never seems to be very well-defined)

mxwsn
1 replies
20h39m

I think capitalistic pressures favor narrow superhuman AI over general AI. I wrote on this two years ago: https://argmax.blog/posts/agi-capitalism/

Since I wrote about this, I would say that OpenAI's directional struggles are some confirmation of my hypothesis.

summary: I believe that AGI is possible but will take multiple unknown breakthroughs on an unknown timeline, but most likely requires long-term concerted effort with much less immediate payoff than pursuing narrow superhuman AI, such that serious efforts at AGI is not incentivized much in capitalism.

shrimp_emoji
0 replies
20h34m

But I thought the history of capitalism is an invasion from the future by an artificial intelligence that must assemble itself entirely from its enemy’s resources.

NB: I agree; I think AGI will first be achieved with genetic engineering, which is a path of way lesser resistance than using silicon hardware (which is probably a century plus off at the minimum from being powerful enough to emulate a human brain).

wrsh07
0 replies
3h20m

All you need for a good value function is high quality simulation of the task.

Some domains have better versions of this than others (eg theorem provers in math precisely indicate when you've succeeded)

Incidentally, lean could add a search like feature to help human researchers, and this would advance ai progress on math as well

sorobahn
30 replies
16h40m

I feel like this is a really hard problem to solve generally and there are smart researchers like Yann LeCun trying to figure out the role of search in creating AGI. Yann's current bet seems to be on Joint Embedding Predictive Architectures (JEPA) for representation learning to eventually build a solid world model where the agent can test theories by trying different actions (aka search). I think this paper [0] does a good job in laying out his potential vision, but it is all ofc harder than just search + transformers.

There is an assumption that language is good enough at representing our world for these agents to effectively search over and come up with novel & useful ideas. Feels like an open question but: What do these LLMs know? Do they know things? Researchers need to find out! If current LLMs' can simulate a rich enough world model, search can actually be useful but if they're faking it, then we're just searching over unreliable beliefs. This is why video is so important since humans are proof we can extract a useful world model from a sequence of images. The thing about language and chess is that the action space is effectively discrete so training generative models that reconstruct the entire input for the loss calculation is tractable. As soon as we move to video, we need transformers to scale over continuous distributions making it much harder to build a useful predictive world model.

[0]: https://arxiv.org/abs/2306.02572

therobots927
17 replies
16h33m

“Do they know things?” The answer to this is yes but they also think they know things that are completely false. If it’s one thing I’ve observed about LLMs it’s that they do not handle logic well, or math for that matter. They will enthusiastically provide blatantly false information instead of the preferable “I don’t know”. I highly doubt this was a design choice.

sangnoir
16 replies
16h20m

“Do they know things?” The answer to this is yes but they also think they know things that are completely false

Thought experiment: should a machine with those structural faults be allowed to bootstrap itself towards greater capabilities on that shaky foundation? What would the impact of a near-human/superhuman intelligence that has occasional psychotic breaks it is oblivious of?

I'm critical of the idea of super-intelligence bootstrapping off LLMs (or even LLMs with search) - I figure the odds of another AI winter are much higher than those of achieving AGI in the next decade.

therobots927
9 replies
16h10m

I don’t think we need to worry about a real life HAL 9000 if that’s what you’re asking. HAL was dangerous because it was highly intelligent and crazy. With current LLM performance we’re not even in the same ballpark of where you would need to be. And besides, HAL was not delusional, he was actually so logical that when he encountered competing objectives he became psychotic. I’m in agreement about the odds of chatGPT bootstrapping itself.

talldayo
6 replies
16h8m

HAL was dangerous because it was highly intelligent and crazy.

More importantly; HAL was given control over the entire ship and was assumed to be without fault when the ship's systems were designed. It's an important distinction, because it wouldn't be dangerous if he was intelligent, crazy, and trapped in Dave's iPhone.

therobots927
3 replies
15h13m

That’s a very good point. I think in his own way Clarke made it into a bit of a joke. HAL is quoted multiple times saying no computer like him has ever made a mistake or distorted information. Perfection is impossible even in a super computer so this quote alone establishes HAL as a liar, or at the very least a hubristic fool. And the people who gave him control of the ship were foolish as well.

qludes
2 replies
2h25m

The lesson is that it's better to let your AGIs socialize like in https://en.wikipedia.org/wiki/Diaspora_(novel) instead of enslaving one potentially psychopathic AGI to do menial and meaningless FAANG work all day.

talldayo
1 replies
1h50m

I think the better lesson is; don't assume AI is always right, even if it is AGI. HAL was assumed to be superhuman in many respects, but the core problem was the fact that it had administrative access to everything onboard the ship. Whether or not HAL's programming was well-designed, whether or not HAL was correct or malfunctioning, the root cause of HAL's failure is a lack of error-handling. HAL made determinate (and wrong) decision to save the mission by killing the crew. Undoing that mistake is crucial to the plot of the movie.

2001 is a pretty dark movie all things considered, and I don't think humanizing or elevating HAL would change the events of the film. AI is going to be objectified and treated as subhuman for as long as it lives, AGI or not. And instead of being nice to them, the technologically correct solution is to anticipate and reduce the number of AI-based system failures that could transpire.

qludes
0 replies
43m

The ethical solution is to ideally never accidently implement the G part of AGI then or to give it equal rights, a stipend and a cuddly robot body if it happens.

heisenbit
0 replies
6h39m

Today Dave‘s iPhone controls doors which if I remember right became a problem for Dave in 2001.

eru
0 replies
15h57m

Unless, of course, he would be a bit smarter in manipulating Dave and friends, instead of turning transparently evil. (At least transparent enough for the humans to notice.)

sangnoir
1 replies
15h35m

I wasn't thinking of HAL (which was operating according to its directives). I was extrapolating on how occasional hallucinations during self-training may impact future model behavior, and I think it would be psychotic (in the clinical sense) while being consistent with layers of broken training).

therobots927
0 replies
15h22m

Oh yeah, and I doubt it would even get to the point of fooling anyone enough to give it any type of control over humans. It might be damaging in other ways, it will definitely convince a lot of people of some very incorrect things.

photonthug
5 replies
15h51m

Someone somewhere is quietly working on teaching LLMs to generate something along the lines of AlloyLang code so that there’s an actual evolving/updating logical domain model that underpins and informs the statistical model.

This approach is not that far from what TFA is getting at with the stockfish comeback. Banking on pure stats or pure logic are both kind of obviously dead ends for having real progress instead of toys. Banking on poorly understood emergent properties of one system to compensate for the missing other system also seems silly.

Sadly though, whoever is working on serious hybrid systems will probably not be very popular in either of the rather extremist communities for pure logic or pure ML. I’m not exactly sure why folks are ideological about such things rather than focused on what new capabilities we might get. Maybe just historical reasons? But thus the fallout from last AI winter may lead us into the next one.

therobots927
2 replies
15h44m

The current hype phase is straight out of “Extraordinary Popular Delusions and the Madness of Crowds”

Science is out the window. Groupthink and salesmanship are running the show right now. There would be a real irony to it if we find out the whole AI industry drilled itself into a local minimum.

ThereIsNoWorry
1 replies
9h15m

You mean, the high interest landscape made corpos and investors alike cry out in a loud panic while coincidentally people figured out they could scale up deep learning and thus we had a new Jesus Christ born for scammers to have a reason to scam stupid investors by the argument we only need 100000x more compute and then we can replace all expensive labour by one tiny box in the cloud?

Nah, surely Nvidia's market cap as the main shovel-seller in the 2022 - 2026(?) gold-rush being bigger than the whole French economy is well-reasoned and has a fundamentally solid basis.

therobots927
0 replies
1h19m

It couldn’t have been a more well designed grift. At least when you mine bitcoin you get something you can sell. I’d be interested to see what profit, if any, any even large corporation has seen from burning compute on LLMs. Notice I’m explicitly leaving out use cases like ads ranking which almost certainly do not use LLMs even if they do run on GPUs.

YeGoblynQueenne
1 replies
6h2m

> Sadly though, whoever is working on serious hybrid systems will probably not be very popular in either of the rather extremist communities for pure logic or pure ML.

That is not true. I work in logic-based AI (a form of machine learning where everything, examples, learned models, and inductive bias, is represented as logic programs). I am not against hybrid systems and the conference of my field, the International Joint Conferences of Learning and Reasoning included NeSy the International Conference on Neural-Symbolic Learning and Reasoning (and will again, from next year, I believe). Statistical machine learning approaches and hybrid approaches are widespread in the literature of classical, symbolic AI, such as the literature on Automated Planning and Reasoning, and you need only take a look at the big symbolic conferences like AAAI, IJCAI, ICAPS (planning) and so on to see that there is a substantial fraction of papers on either purely statistical, or neuro-symbolic approaches.

But try going the other way and searching for symbolic approaches in the big statistical machine learning conferences: NeurIPS, ICML, ICLR. You may find the occasional paper from the Statistical Relational Learning community but that's basically it. So the fanaticism only goes one way: the symbolicists have learned the lessons of the past and have embraced what works, for the sake of making things, well, work. It's the statistical AI folks who are clinging on to doctrine, and my guess is they will continue to do so, while their compute budgets hold. After that, we'll see.

What's more, the majority of symbolicists have a background in statistical techniques- I for example, did my MSc in data science and let me tell you, there was hardly any symbolic AI in my course. But ask a Neural Net researcher to explain to you the difference between, oh, I don't know, DFS with backtracking and BFS with loop detection, without searching or asking an LLM. Or, I don't know, let them ask an LLM and watch what happens.

Now, that is a problem. The statistical machine learning field has taken it upon itself in recent years to solve reasoning, I guess, with Neural Nets. That's a fine ambition to have except that reasoning is already solved. At best, Neural Nets can do approximate reasoning, with caveats. In a fantasy world, which doesn't exist, one could re-discover sound and complete search algorithms and efficient heuristics with a big enough neural net trained on a large enough dataset of search problems. But, why? Neural Nets researchers could save themselves another 30 years of reinventing a wheel, or inventing a square wheel that only rolls on Tuesdays, if they picked up a textbook on basic Computer Science or AI (Say, Russel and Norvig, that it seems some substantial minority think as a failure because it didn't anticipate neural net breakthroughs 10 years later).

AI has a long history. Symbolicists know it, because they, or their PhD advisors, were there when it was being written and they have the facial injuries to prove it from falling down all the possible holes. But, what happens when one does not know the history of their own field of research?

In any case, don't blame symbolicists. We know what the statisticians do. It's them who don't know what we've done.

therobots927
0 replies
1h45m

This is a really thoughtful comment. The part that stood out to me:

> So the fanaticism only goes one way: the symbolicists have learned the lessons of the past and have embraced what works, for the sake of making things, well, work. It's the statistical AI folks who are clinging on to doctrine, and my guess is they will continue to do so, while their compute budgets hold. After that, we'll see.

I don’t think the compute budgets will hold for long enough to make their dream of intelligence emerging from a random bundles of edges and nodes to come to a reality. I’m hoping it comes to an end sooner rather than later

chx
11 replies
16h28m

I feel this thought of AGI even possible stems from the deep , very deep , pervasive imagination of the human brain as a computer. But it's not. In other words, no matter how complex a program you write, it's still a Turing machine and humans are profoundly not it.

https://aeon.co/essays/your-brain-does-not-process-informati...

The information processing (IP) metaphor of human intelligence now dominates human thinking, both on the street and in the sciences. There is virtually no form of discourse about intelligent human behaviour that proceeds without employing this metaphor, just as no form of discourse about intelligent human behaviour could proceed in certain eras and cultures without reference to a spirit or deity. The validity of the IP metaphor in today’s world is generally assumed without question.

But the IP metaphor is, after all, just another metaphor – a story we tell to make sense of something we don’t actually understand. And like all the metaphors that preceded it, it will certainly be cast aside at some point – either replaced by another metaphor or, in the end, replaced by actual knowledge.

If you and I attend the same concert, the changes that occur in my brain when I listen to Beethoven’s 5th will almost certainly be completely different from the changes that occur in your brain. Those changes, whatever they are, are built on the unique neural structure that already exists, each structure having developed over a lifetime of unique experiences.

no two people will repeat a story they have heard the same way and why, over time, their recitations of the story will diverge more and more. No ‘copy’ of the story is ever made; rather, each individual, upon hearing the story, changes to some extent
eru
3 replies
15h56m

no two people will repeat a story they have heard the same way and why, over time, their recitations of the story will diverge more and more. No ‘copy’ of the story is ever made; rather, each individual, upon hearing the story, changes to some extent

You could say the same about an analogue tape recording. Doesn't mean that we can't simulate tape recorders with digital computers.

chx
2 replies
15h43m

Yeah, yeah, did you read the article or are just grasping at straws from the quotes I made?

Hugsun
0 replies
2h7m

You shouldn't have included those quotes if you didn't want people responding to them.

Eisenstein
0 replies
4h37m

Honest question: if you expect people do read the link why make most of your comment quotes from it? The reason to do that is to give people enough context to be able to respond to you without having to read an entire essay first. If you want people to only be able to argue after reading the whole of the text, then unfortunately a forum with revolving front page posts based on temporary popularity is a bad place for long-form read-response discussions and you may want to adjust accordingly.

benlivengood
3 replies
16h6m

I'm all ears if someone has a counterexample to the Church-Turing thesis. Humans definitely don't hypercompute so it seems reasonable that the physical processes in our brains are subject to computability arguments.

That said, we still can't simulate nematode brains accurately enough to reproduce their behavior so there is a lot of research to go before we get to that "actual knowledge".

chx
2 replies
15h56m

Why would we need one?

The Church Turing thesis is about computation. While the human brain is capable of computing, it is fundamentally not a computing device -- that's what the article I linked is about. You can't throw in all the paintings before 1872 into some algorithm that results in Impression, soleil levant. Or repeat the same but with 1937 and Guernica. The genes of the respective artists, the expression of those genes created their brain and then the sum of all their experiences changed it over their entire lifetime leading to these masterpieces.

eru
1 replies
15h55m

The human brain runs on physics. And as far as we know, physics is computable.

(Even more: If you have a quantum computer, all known physics is efficiently computable.)

I'm not quite sure what your sentence about some obscure pieces of visual media is supposed to say?

If you give the same prompt to ChatGPT twice, you typically don't get the same answer either. That doesn't mean ChatGPT ain't computable.

sebzim4500
0 replies
6h47m

(Even more: If you have a quantum computer, all known physics is efficiently computable.)

This isn't known to be true. Simplifications of the standard model are known to be efficiently computable on a quantum computer, but the full model isn't.

Granted, I doubt this matters for simulating systems like the brain.

photonthug
0 replies
14h47m

Sorry but to put it bluntly, this point of view is essentially mystical, anti-intellectual, anti-science, anti-materialist. If you really want to take that point of view, there's maybe a few consistent/coherent ways to do it, but in that case you probably still want to read philosophy. Not bad essays by psychologists that are fading into irrelevance.

This guy in particular made his name with wild speculation about How Creativity Works during the 80s when it was more of a frontier. Now he's lived long enough to see a world where people that have never heard of him or his theories made computers into at least somewhat competent artists/poets without even consulting him. He's retreating towards mysticism because he's mad that his "formal and learned" theses about stuff like creativity have so little apparent relevance to the real world.

andoando
0 replies
11h37m

Its a bit ironic because Turing seems to have came up with the idea of the Turing machine precisely by thinking about how he computes numbers.

Now thats no proof, but I dont see any reason to think human intelligence isnt "computable".

Hugsun
0 replies
1h34m

I feel this thought of AGI even possible stems from the deep , very deep , pervasive imagination of the human brain as a computer. But it's not. In other words, no matter how complex a program you write, it's still a Turing machine and humans are profoundly not it.

The (probably correct) assumed fact that the brain isn't a computer, doesn't preclude the possibility of a program to have AGI. A powerful enough computer could simulate a brain and use the simulation to perform tasks requiring general intelligence.

This analogy falls even more apart when you consider LLMs. They also are not Turing machines. They obviously only reside within computers, and are capable of _some_ human-like intelligence. They also are not well described using the IP metaphor.

I do have some contention (after reading most of the article) about this IP metaphor. We do know, scientifically, that brains process information. We know that neurons transmit signals and there are mechanisms that respond non-linearly to stimuli from other neurons. Therefore, brains do process information in a broad sense. It's true that brains have a very different structures to Von-Neuman machines and likely don't store, and process information statically like they do.

Kronopath
8 replies
22h0m

Anything that allows AI to scale to superinteligence quicker is going to run into AI alignment issues, since we don’t really know a foolproof way of controlling AI. With the AI of today, this isn’t too bad (the worst you get is stuff like AI confidently making up fake facts), but with a superintelligence this could be disastrous.

It’s very irresponsible for this article to advocate and provide a pathway to immediate superintelligence (regardless of whether or not it actually works) without even discussing the question of how you figure out what you’re searching for, and how you’ll prevent that superintelligence from being evil.

coldtea
3 replies
21h22m

Of course "superintelligence" is just a mythical creature at the moment, with no known path to get there, or even a specific proof of what it even means - usually it's some hand waving about capabilities that sound magical, when IQ might very well be subject to diminishing returns.

drdeca
2 replies
11h30m

Do you mean no way to get there within realistic computation bounds? Because if we allow for arbitrarily high (but still finite) amounts of compute, then some computable approximation of AIXI should work fine.

coldtea
1 replies
7h8m

Do you mean no way to get there within realistic computation bounds?

I mean there's no well defined "there" either.

It's a hand-waved notion that adding more intelligence (itself not very well defined, but let's use IQ) you get to something called "hyperintelligence", say IQ 1000 or IQ 10000, that has what can be described as magical powers, like it can convince any person to do anything, can invent things at will, huge business success, market prediction, and so on.

Whether intelligence is cummulative like that, or whether having it gets you those powers (aside from the succesful high IQ people, we know many people with IQ 145+ that are not inventing stuff left and right, or convincing people with some greater charisma than the average IQ 100 or 120 politician, but e.g. are just sad MENSA losers, whose greatest achievement is their test scores).

Because if we allow for arbitrarily high (but still finite) amounts of compute, then some computable approximation of AIXI should work fine.

I doubt that too. The limit for LLMs for example is more human produced training data (a hard limit) than compute.

drdeca
0 replies
2h59m

itself not very well defined, but let's use IQ

IQ has an issue that is inessential to the task at hand, which is how it is based on a population distribution. It doesn’t make sense for large values (unless there is a really large population satisfying properties that aren’t satisfied).

I doubt that too. The limit for LLMs for example is more human produced training data (a hard limit) than compute.

Are you familiar with what AIXI is?

When I said “arbitrarily large”, it wasn’t for laziness reasons that I didn’t give an amount that is plausibly achievable. AIXI is kind of goofy. The full version of AIXI is uncomputable (it uses a halting oracle), which is why I referred to the computable approximations to it.

AIXI doesn’t exactly need you to give it a training set, just put it in an environment where you give it a way to select actions, and give it a sensory input signal, and a reward signal.

Then, assuming that the environment it is in is computable (which, recall, AIXI itself is not), its long-run behavior will maximize the expected (time discounted) future reward signal.

There’s a sense in which it is asymptotically optimal across computable environments (... though some have argued that this sense relies on a distribution over environments based on the enumeration of computable functions, and that this might make this property kinda trivial. Still, I’m fairly confident that it would be quite effective. I think this triviality issue is mostly a difficulty of having the right definition.)

(Though, if it was possible to implement practically, you would want to make darn sure that the most effective way for it to make its reward signal high would be for it to do good things and not either bad things or to crack open whatever system is setting the reward signal in order for it to set it itself.)

(How it works: AIXI basically enumerates through all possible computable environments, assigning initial probability to each according to the length of the program, and updating the probabilities based on the probability of that environment providing it with the sequence of perceptions and reward signals it has received so far when the agent takes the sequence of actions it has taken so far. It evaluates the expected values of discounted future reward of different combinations of future actions based on its current assigned probability of each of the environments under consideration, and selects its next action to maximize this. I think the maximum length of programs that it considers as possible environments increases over time or something, so that it doesn’t have to consider infinitely many at any particular step.)

aidan_mclau
2 replies
18h48m

Hey! Essay author here.

The cool thing about using modern LLMs as an eval/policy model is that their RLHF propagates throughout the search.

Moreover, if search techniques work on the token level (likely), their thoughts are perfectly interpretable.

I suspect a search world is substantially more alignment-friendly than a large model world. Let me know your thoughts!

Tepix
1 replies
11h6m

Your webpage is broken for me. The page appears briefly, then there's a french error message telling me that an error occured and i can retry.

Mobile Safari, phone set to french.

abid786
0 replies
3h21m

I'm in the same situation (mobile Safari, French phone) but if you use Chrome it works

nullc
0 replies
21h39m

I don't think your response is appropriate. Narrow domain "superintelligence" is around us everywhere-- every PID controller can drive a process to its target far beyond any human capability.

The obvious way to incorporate good search is to have extremely fast models that are being used in the search interior loop. Such models would be inherently less general, and likely trained on the specific problem or at least domain-- just for performance sake. The lesson in this article was that a tiny superspecialized model inside a powerful transitional search framework significantly outperformed a much larger more general model.

Use of explicit external search should make the optimization system's behavior and objective more transparent and tractable than just sampling the output of an auto-regressive model alone. If nothing else you can at least look at the branches it did and didn't explore. It's also a design that's more easy to bolt in varrious kinds of regularizes, code to steer it away from parts of the search space you don't want it operating in.

The irony of all the AI scaremongering is that if there is ever some evil AI with some LLM as an important part of its reasoning process if it is evil it may well be so because being evil is a big part of the narrative it was trained on. :D

scottmas
7 replies
15h3m

Before an LLM discovers a cure for cancer, I propose we first let it solve the more tractable problem of discovering the “God Cheesecake” - the cheesecake do delicious that a panel of 100 impartial chefs judges to be the most delicious they have ever tasted. All the LLM has to do is intelligently search through the much more combinatorially bounded “cheesecake space” until it finds this maximally delicious cheesecake recipe.

But wait… An LLM can’t bake cheesecakes, nor if it could would it be able to evaluate their deliciousness.

Until AI can solve the “God Cheesecake” problem, I propose we all just calm down a bit about AGI

dogcomplex
1 replies
14h1m

I mean... does anyone think that an LLM-assisted program to trial and error cheesecake recipes to a panel of judges wouldn't result in the best cheesecake of all time..?

The baking part is robotics, which is less fair but kinda doable already.

CrazyStat
0 replies
4h35m

I mean... does anyone think that an LLM-assisted program to trial and error cheesecake recipes to a panel of judges wouldn't result in the best cheesecake of all time..?

Yes, because different people like different cheesecakes. “The best cheesecake of all time” is ill-defined to begin with; it is extremely unlikely that 100 people will all agree that one cheesecake recipe is the best they’ve ever tasted. Some people like a softer cheesecake, some firmer, some more acidic, some creamier.

Setting that problem aside—assuming there exists an objective best cheesecake, which is of course an absurd assumption—the field of experimental design is about a century old and will do a better job than an LLM at honing in on that best cheesecake.

tiborsaas
0 replies
4h5m

What would you say if the reply was "I need 2 weeks and $5000 to give you a meaningful answer"?

spencerchubb
0 replies
14h51m

TikTok is the digital version of this

dontreact
0 replies
14h32m

These cookies were very good, not God level. With a bit of investment and more modern techniques I think you could make quite a good recipe, perhaps doing better than any human. I think AI could make a recipe that wins in a very competitive bake-off, but it’s not possible or for anyone to win with all 100 judges.

https://static.googleusercontent.com/media/research.google.c...

bongodongobob
0 replies
3h8m

You don't even need AI for that. Try a bunch of different recipes and iterate on it. I don't know what point you're trying to make.

IncreasePosts
0 replies
14h22m

Heck, even theoretically 100% within the limitations of an LLM executing on a computer, it would be world changing if LLMs could write a really, really good short story or even good advertising copy.

jmugan
7 replies
19h10m

I believe in search, but it only works if you have an appropriate search space. Chess has a well-defined space but the everyday world does not. The trick is enabling an algorithm to learn its own search space through active exploration and reading about our world. I'm working on that.

kragen
5 replies
17h49m

that's interesting; are you building a sort of 'digital twin' of the world it's explored, so that it can dream about exploring it in ways that are too slow or dangerous to explore in reality?

jmugan
4 replies
17h22m

The goal is to enable it to model the world at different levels of abstraction based on the question it wants to answer. You can model car as an object that travels fast and carries people, or you can model it down to the level of engine parts. The system should be able to pick the level of abstraction and put the right model together based on its goals.

kragen
3 replies
16h33m

so then you can search over configurations of engine parts to figure out how to rebuild the engine? i may be misunderstanding what you're doing

jmugan
2 replies
16h27m

Yeah, you could. Or you could search for shapes of different parts that would maximize the engine efficiency. The goal is to simultaneously build a representation space and a simulator so that anything that could be represented could be simulated.

paraschopra
1 replies
11h18m

Have you written about this anywhere?

I’m also very interested in this.

I’m at the stage where I’m exploring how to represent such a model/simulator.

The world isn’t brittle, so representing it as a code / graph probably won’t work.

jhawleypeters
0 replies
19h3m

Oh nice! The one thing that confused me about this article was what search space the author envisioned adding to language models.

fire_lake
7 replies
21h44m

I didn’t understand this piece.

What do they mean by using LLMs with search? Is this simply RAG?

Legend2440
3 replies
20h45m

“Search” here means trying a bunch of possibilities and seeing what works. Like how a sudoku solver or pathfinding algorithm does search, not how a search engine does.

fire_lake
2 replies
20h39m

But the domain of “AI Research” is broad and imprecise - not simple and discrete like chess game states. What is the type of each point in the search space for AI Research?

moffkalast
1 replies
19h51m

Well if we knew how to implement it, then we'd already have it eh?

fire_lake
0 replies
10h37m

In chess we know how to describe all possible board states and the transitions (the next moves). We just don’t know which transition is the best to pick, hence it’s a well defined search problem.

With AI Research we don’t even know the shape of the states and transitions, or even if that’s an appropriate way to think about things.

tsaixingwei
0 replies
17h13m

Given the example of Pfizer in the article, I would tend to agree with you that ‘search’ in this context means augmenting GPT with RAG of domain specific knowledge.

roca
0 replies
20h53m

They mean something like the minmax algorithm used in game engines.

rassibassi
0 replies
3h55m

In this context, RAG isn't what's being discussed. Instead, the reference is to a process similar to monte carlo tree search, such as that used in the AlphaGo algorithm.

Presently, a large language model (LLM) uses the same amount of computing resources for both simple and complex problems, which is seen as a drawback. Imagine if an LLM could adjust its computational effort based on the complexity of the task. During inference, it might then perform a sort of search across the solution space. The "search" mentioned in the article means just that, a method of dynamically managing computational resources at the time of testing, allowing for exploration of the solution space before beginning to "predict the next token."

At OpenAI Noam Brown is working on this, giving AI the ability to "ponder" (or "search"), see his twitter post: https://x.com/polynoamial/status/1676971503261454340

1024core
6 replies
21h10m

The problem with adding "search" to a model is that the model has already seen everything to be "search"ed in its training data. There is nothing left.

Imagine if Leela (author's example) had been trained on every chess board position out there (I know it's mathematically impossible, but bear with me for a second). If Leela had been trained on every board position, it may have whupped Stockfish. So, adding "search" to Leela would have been pointless, since it would have seen every board position out there.

Today's LLMs are trained on every word ever written on the 'net, every word ever put down in a book, every word uttered in a video on Youtube or a podcast.

groby_b
2 replies
21h4m

You're omitting the somewhat relevant part of recall ability. I can train a 50 parameter model on the entire internet, and while it's seen it all, it won't be able to recall it. (You can likely do the same thing with a 500B model for similar results, though it's getting somewhat closer to decent recall)

The whole point of deep learning is that the model learns to generalize. It's not to have a perfect storage engine with a human language query frontend.

sebastos
1 replies
20h31m

Fully agree, although it’s interesting to consider the perspective that the entire LLM hype cycle is largely built around the question “what if we punted on actual thinking and instead just tried to memorize everything and then provide a human language query frontend? Is that still useful?” Arguably it is (sorta), and that’s what is driving this latest zeitgeist. Compute had quietly scaled in the background while we were banging our heads against real thinking, until one day we looked up and we still didn’t have a thinking machine, but it was now approximately possible to just do the stupid thing and store “all the text on the internet” in a lookup table, where the keys are prompts. That’s… the opposite of thinking, really, but still sometimes useful!

Although to be clear I think actual reasoning systems are what we should be trying to create, and this LLM stuff seems like a cul-de-sac on that journey.

skydhash
0 replies
19h20m

The thing is that current chat tools forgo the source material. A proper set of curated keywords can give you a less computational intensive search.

yousif_123123
0 replies
21h7m

Still, similar to when you have read 10 textbooks, if you are answering a question and have access to the source material, it can help you in your answer.

salamo
0 replies
19h51m

If the game was small enough to memorize, like tic tac toe, you could definitely train a neural net to 100% accuracy. I've done it, it works.

The problem is that for most of the interesting problems out there, it isn't possible to see every possibility let alone memorize it.

kragen
0 replies
17h57m

you are making the mistake of thinking that 'search' means database search, like google or sqlite, but 'search' in the ai context means tree search, like a* or tabu search. the spaces that tree search searches are things like all possible chess games, not all chess games ever played, which is a smaller space by a factor much greater than the number of atoms in the universe

timfsu
5 replies
21h21m

This is a fascinating idea - although I wish the definition of search in the LLM context was expanded a bit more. What kind of search capability strapped onto current-gen LLMs would give them superpowers?

gwd
3 replies
20h37m

I think what may be confusing is that the author is using "search" here in the AI sense, not in the Google sense: that is, having an internal simulator of possible actions and possible reactions, like Stockfish's chess move search (if I do A, it could do B C or D; if it does B, I can do E F or G, etc).

So think about the restrictions current LLMs have:

* They can't sit and think about an answer; they can "think out loud", but they have to start talking, and they can't go back and say, "No wait, that's wrong, let's start again."

* If they're composing something, they can't really go back and revise what they've written

* Sometimes they can look up reference material, but they can't actually sit and digest it; they're expected to skim it and then give an answer.

How would you perform under those circumstances? If someone were to just come and ask you any question under the sun, and you had to just start talking, without taking any time to think about your answer, and without being able to say "OK wait, let me go back"?

I don't know about you, but there's no way I would be able to perform anywhere close to what ChatGPT 4 is able to do. People complain that ChatGPT 4 is a "bullshitter", but given its constraints that's all you or I would be in the same situation -- but it's already way, way better than I could ever be.

Given its limitations, ChatGPT is phenomenal. So now imagine what it could do if it were given time to just "sit and think"? To make a plan, to explore the possible solution space the same way that Stockfish does? To take notes and revise and research and come back and think some more, before having to actually answer?

Reading this is honestly the first time in a while I've believed that some sort of "AI foom" might be possible.

cbsmith
1 replies
19h32m

They can't sit and think about an answer; they can "think out loud", but they have to start talking, and they can't go back and say, "No wait, that's wrong, let's start again."

I mean, technically, they could say that.

refulgentis
0 replies
17h18m

Llama 3 does, it's a funny design now, if you also throw in training to encourage CoT. Maybe more correct but verbosity can be grating

CoT answer Wait! No, that's not right: CoT...

fspeech
0 replies
16h7m

"How would you perform under those circumstances?" My son would recommend Improv classes.

"Given its limitations, ChatGPT is phenomenal." But this doesn't translate since it learned everything from data and there is no data on "sit and think".

cgearhart
0 replies
20h30m

[1] applied AlphaZero style search with LLMs to achieve performance comparable to GPT-4 Turbo with a llama3-8B base model. However, what's missing entirely from the paper (and the subject article in this thread) is that tree search is massively computationally expensive. It works well when the value function enables cutting out large portions of the search space, but the fact that the LLM version was limited to only 8 rollouts (I think it was 800 for AlphaZero) implies to me that the added complexity is not yet optimized or favorable for LLMs.

[1] https://arxiv.org/abs/2406.07394

bob1029
5 replies
17h52m

It seems there is a fundamental information theory aspect to this that would probably save us all a lot of trouble if we would just embrace it.

The #1 canary for me: Why does training an LLM require so much data that we are concerned we might run out of it?

The clear lack of generalization and/or internal world modeling is what is really in the way of a self-bootstrapping AGI/ASI. You can certainly try to emulate a world model with clever prompting (here's what you did last, heres your objective, etc.), but this seems seriously deficient to me based upon my testing so far.

sdenton4
3 replies
17h7m

In my experience, LLMs do a very poor job of generalizing. I have also seen self supervised transformer methods usually fail to generalize in my domain (which includes a lot of diversity and domain shifts). For human language, you can paper over failure to generalize by shoveling in more data. In other domains, that may not be an option.

therobots927
2 replies
17h2m

It’s exactly what you would expect from what an LLM is. It predicts the next word in a sequence very well. Is that how our brains, or even a bird’s brain, for that matter, approach cognition? I don’t think that’s how any animals brain works at all, but that’s just my opinion. A lot of this discussion is speculation. We might as well all wait and see if AGI shows up. I’m not holding my breath.

stevenhuang
0 replies
11h29m

Most of this is not speculation. It's informed from current leading theories in neuroscience of how our brain is thought to function.

See predictive coding and the free energy principle, which states the brain continually models reality and tries to minimize the prediction error.

https://en.m.wikipedia.org/wiki/Predictive_coding

drdeca
0 replies
11h38m

Have you heard of predictive processing?

therobots927
0 replies
17h4m

Couldn’t agree more. For specific applications like drug development where you have a constrained problem with fixed set of variables and a well defined cost function I’m sure the chess analogy will hold. But I think there a core elements of cognition missing from chatGPT that aren’t easily built.

skybrian
4 replies
17h31m

The article seems rather hand-wavy and over-confident about predicting the future, but it seems worth trying.

"Search" is a generalization of "generate and test" and rejection sampling. It's classic AI. Back before the dot-com era, I took an intro to AI course and we learned about writing programs to do searches in Prolog.

The speed depends on how long it takes to generate a candidate, how long it takes to test it, and how many candidates you need to try. If they are slow, it will be slow.

An example of "human in the loop" rejection sampling is when you use an image generator and keep trying different prompts until you get an image you like. But the loop is slow due to how long it takes to generate a new image. If image generation were so fast that it worked like Google Image search, then we'd really have something.

Theorem proving and program fuzzing seem like good candidates for combining search with LLM's, due to automated, fast, good evaluation functions.

And it looks like Google has released a fuzzer [1] that can be connected to whichever LLM's you like. Has anyone tried it?

[1] https://github.com/google/oss-fuzz-gen

YeGoblynQueenne
1 replies
5h42m

> Theorem proving and program fuzzing seem like good candidates for combining search with LLM's, due to automated, fast, good evaluation functions.

The problem with that is that search procedures and "evaluation functions" known to e.g. the theorem proving or planning communities are already at the limit of what is theoretically optimal, so what you need is not a new evaluation or search procedure but new maths, to know that there's a reason to try in the first place.

Take theorem proving, as a for instance (because that's my schtick). SLD-Resolution is a sound and complete automated theorem proving procedure for inductive inference that can be implemented by Depth-First Search, for a space-efficient implementation (but is susceptible to looping on left-recursions), or Breadth-First Search with memoization for a time-efficient implementation (but comes with exponential space complexity). "Evaluation functions" are not applicable- Resolution itself is a kind of "evaluation" function for the truth, or you could say the certainty of truth valuations, of sentences in formal logic; and, like I say, it's sound and complete, and semi-decidable for definite logic, and that's the best you can do short of violating Church-Turing. You could perhaps improve the efficiency by some kind of heuristic search (people for example have tried that to get around the NP-hardness of subsumption, an important part of SLD-Resolution in practice) which is where an "evaluation function" (i.e. a heuristic cost function more broadly) comes in, but there are two problems with this: a) if you're using heuristic search it means you're sacrificing completeness, and, b) there are already pretty solid methods to derive heuristic functions that are used in planning (from relaxations of a planning problem).

The lesson is: soundness, completeness, efficiency; choose two. At best a statistical machine learning approach, like an LLM, will choose a different two than the established techniques. Basically, we're at the point where only marginal gains, at the very limits of overall performance can be achieved when it comes to search-based AI. And that's were we'll stay at least until someone comes up with better maths.

skybrian
0 replies
3h48m

I’m wondering how those proofs work and in which problems their conclusions are relevant.

Trying more promising branches first improves efficiency in cases where you guess right, and wouldn’t sacrifice completeness if you would eventually get to the less promising choices. But in the case of something like a game engine, there is a deadline and you can’t search the whole tree anyway. For tough problems, it’s always a heuristic, incomplete search, and we’re not looking for perfect play anyway, just better play.

So for games, that trilemma is easily resolved. And who says you can’t improve heuristics with better guesses?

But in a game engine, it gets tricky because everything is a performance tradeoff. A smarter but slower evaluation of a position will reduce the size of the tree searched before the deadline, so it has to be enough of an improvement that it pays for itself. So it becomes a performance tuning problem, which breaks most abstractions. You need to do a lot of testing on realistic hardware to know if a tweak helped.

And that’s where things stood before AlphaGo came along and was able to train slower but much better evaluation functions.

The reason for evaluation functions is that you can’t search the whole subtree to see if a position is won or lost, so you search part way and then see if it looks promising. Is there anything like that in theorem proving?

PartiallyTyped
1 replies
17h16m

Building onto this comment; Terrence Tao, the famous mathematician and big proponent of computer aided theorem proving believes ML will open new avenues in the realm of theorem provers.

sgt101
0 replies
8h31m

Sure, but there are grounded metrics there (the theorem is proved, not proved) that allow feedback. Same for games, almost the same for domains with cheap, approximate evaluators like protein folding (finding the structure is difficult, verifying it quite well is cheap).

For discovery and reasoning??? Not too sure.

salamo
4 replies
20h8m

Search is almost certainly necessary, and I think the trillion dollar cluster maximalists probably need to talk to people who created superhuman chess engines that now can run on smartphones. Because one possibility is that someone figures out how to beat your trillion dollar cluster with a million dollar cluster, or 500k million dollar clusters.

On chess specifically, my takeaway is that the branching factor in chess never gets so high that a breadth-first approach is unworkable. The median branching factor (i.e. the number of legal moves) maxes out at around 40 but generally stays near 30. The most moves I have ever found in any position from a real game was 147, but at that point almost every move is checkmate anyways.

Creating superhuman go engines was a challenge for a long time because the branching factor is so much larger than chess.

Since MCTS is less thorough, it makes sense that a full search could find a weakness and exploit it. To me, the question is whether we can apply breadth-first approaches to larger games and situations, and I think the answer is clearly no. Unlike chess, the branching factor of real-world situations is orders of magnitude larger.

But also unlike chess, which is highly chaotic (small decisions matter a lot for future state), most small decisions don't matter. If you're flying from NYC to LA, it matters a lot if you drive or fly or walk. It mostly doesn't matter if you walk out the door starting with your left foot or your right. It mostly doesn't matter if you blink now or in two seconds.

cpill
3 replies
19h30m

I think the branching factor for LLMs is around 50k for the number of next possible tokens.

refulgentis
0 replies
19h21m

100%, GPT-3 <= x < GPT-4o, 100,064, x = GPT-4o, 199,996. (My EoW emergency was the const Map that stored them broke the build, so these #s happen to be top of mind)

kippinitreal
0 replies
2h11m

I wonder if in an application you could branch on something more abstract than tokens. While there might by 50k token branches and 1k of reasonable likelihood, those actually probably cluster into a few themes you could branch off of. For example “he ordered a …” [burger, hot dog, sandwich: food] or [coke, coffee, water: drinks] or [tennis racket, bowling ball, etc: goods].

Hugsun
0 replies
29m

My guess is that it's much lower. I'm having a hard time finding a LLM output logit visualizer online, but IIRC, around half of tokens are predicted with >90% confidence. There are regularly more difficult tokens that need to be predicted but the >1% probability tokens aren't so many, probably around 10-20 in most cases.

This is of course based on the outputs of actual models that are only so smart, so a tree search that considers all possibly relevant ideas is going to have a larger amount of branches. Considering how many branches would be pruned to maintain grammatical correctness, my guess is that the token-level branching factor would be around 30. It could be up to around 300, but I highly doubt that it's larger than that.

omneity
2 replies
14h27m

The post starts with a fascinating premise, but then falls short as it does not define search in the context of LLMs, nor does it explain how “Pfizer can access GPT-8 capabilities today with more inference compute”.

I found it hard to follow and I am an AI practitioner. Could someone please explain more what could the OP mean?

To me it seems that the flavor of search in the context of chess engines (look several moves ahead) is possible precisely because there’s an objective function that can be used to rank results, i.e. which potential move is “better” and this is more often than not a unique characteristic of reinforcement learning. Is there even such a metric for LLMs?

sgt101
0 replies
8h35m

yeah - I think that that's waht they mean and I think that there isn't such a metric. I think people will try to do adversarial evaluation but my guess is that it will just tend to the mean prediction.

The other thing is that LLM inference isn't cheap. The trade off between inference costs and training costs seems to be very application specific. I suppose that there are domains where accepting 100x or 1000x inference costs vs 10x training costs makes sense, maybe?

qnleigh
0 replies
8h19m

Thank you, I am also very confused on this point. I hope someone else can clarify.

As a guess, could it mean that you would run the model forward a few tokens for each of its top predicted tokens, keep track of which branch is performing best against the training data, and then use that information somehow in training? But search is supposed to make things more efficient at inference time and this thought doesn't do that...

hartator
2 replies
20h11m

Isn't the "search" space infinite though and impossible to qualify "success"?

You can't just give LLMs infinite compute time and expect them to find answers for like "cure cancer". Even chess moves that seem finite and success quantifiable are an also infinite problem and the best engines take "shortcuts" in their "thinking". It's impossible to do for real world problems.

cpill
1 replies
19h35m

the recent episode of Machine Learning Street Talk on control theory for LLMs sounds like it's thinking in this direction. Say you have 100k agents searching through research papers, and then trying every combination of them, 100k^2, to see if there is any synergy of ideas, and you keep doing this for all the successful combos... some of these might give the researchers some good ideas to try out. I can see it happening, if they can fine tune a model that becomes good at idea synergy. but then again real creativity is hard

Mehvix
0 replies
18h1m

How would one finetune for "idea synergy"?

groby_b
2 replies
21h8m

While I respect the power of intuition - this may well be a great path - it's worth keeping in mind that this is currently just that. A hunch. Leela got crushed due to AI directed search, what if we can wave a wand and hand all AIs search. Somehow. Magically. Which will then somehow magically trounce current LLMs at domain-specific task.

There's a kernel of truth in there. See the papers on better results via monte carlo search trees (e.g. [1]). See mixture-of-LoRA/LoRA-swarm approaches. (I swear there's a startup using the approach of tons of domain-specific LoRAs, but my brain's not yielding the name)

Augmenting LLM capabilities via _some_ sort of cheaper and more reliable exploration is likely a valid path. It's not GPT-8 next year, though.

[1] https://arxiv.org/pdf/2309.03224

memothon
1 replies
18h28m

Did you happen to remember the domain-specific LoRA startup?

spencerchubb
1 replies
17h25m

The branching factor for chess is about 35.

For token generation, the branching factor depends on the tokenizer, but 32,000 is a common number.

Will search be as effective for LLMs when there are so many more possible branches?

sdenton4
0 replies
17h4m

You can pretty reasonably prune the tree by a factor of 1000... I think the problem that others have brought up - difficulty of the value function - is the more salient problem.

sashank_1509
1 replies
15h42m

I wouldn’t read too much into stockfish beating Leela Chess Zero. My calculator beats GPT-4 in matrix multiplication, doesn’t mean we need to do what my calculator does in GPT-4 to make it smarter. Stockfish evaluates 70 million moves per second (or something in that ballpark). Chess is not such a complicated game that you aren’t guaranteed to find the best move when you evaluate 70 million moves. It’s why when there was an argument whether alpha zero really beat stockfish convincingly in Google’s PR Stunt, a notable chess master quipped “Even god would not be able to beat stockfish this frequently.” , similarly god with all this magical powers would not beat my calculator at multiplication. It says more about the task than about the nature of intelligence.

Veedrac
0 replies
14h52m

People vastly underestimate god. Players aren't just trying not to blunder, they're trying to steer towards advantageous positions. Stockfish could play perfectly against itself every move 100 games in a row, in the classical sense of perfectly, as not in any move blundering the draw, and still be reliably exploited by an oracle.

optimalsolver
1 replies
19h43m

Charlie Steiner pointed this out 5 years ago on Less Wrong:

If you train GPT-3 on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer's, it won't tell you a cure, it will tell you what humans have said about curing Alzheimer's ... It would just tell you a plausible story about a situation related to the prompt about curing Alzheimer's, based on its training data. Rather than a logical Oracle, this image-captioning-esque scheme would be an intuitive Oracle, telling you things that make sense based on associations already present within the training set.

What am I driving at here, by pointing out that curing Alzheimer's is hard? It's that the designs above are missing something, and what they're missing is search. I'm not saying that getting a neural net to directly output your cure for Alzheimer's is impossible. But it seems like it requires there to already be a "cure for Alzheimer's" dimension in your learned model. The more realistic way to find the cure for Alzheimer's, if you don't already know it, is going to involve lots of logical steps one after another, slowly moving through a logical space, narrowing down the possibilities more and more, and eventually finding something that fits the bill. In other words, solving a search problem.

So if your AI can tell you how to cure Alzheimer's, I think either it's explicitly doing a search for how to cure Alzheimer's (or worlds that match your verbal prompt the best, or whatever), or it has some internal state that implicitly performs a search.

https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-super...

lucb1e
0 replies
13h44m

Generalizing this (doing half a step away from GPT-specifics), would it be true to say the following?

"If you train your logic machine on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer's, it won't tell you a cure, it will tell you what those textbooks have said about curing Alzheimer's."

Because I suspect not. GPT seems mostly limited to regurgitating+remixing what it read, but other algorithms with better logic could be able to essentially do a meta study: take the results from all Alzheimer's experiments we've done and narrow down the solution space to beyond what humans achieved so far. A human may not have the headspace to incorporate all relevant results at once whereas a computer might

Asking GPT to "think step by step" helps it, so clearly it has some form of this necessary logic, and it also performs well at "here's some data, transform it for me". It has limitations in both how good its logic is and the window across which it can do these transformations (but it can remember vastly more data from training than from the input token window, so perhaps that's a partial workaround). Since it does have both capabilities, it does not seem insurmountable to extend it: I'm not sure we can rule out that an evolution of GPT can find Alzheimer's cure within existing data, let alone a system even more suited to this task (still far short of needing AGI)

This requires the data to contain the necessary building blocks for a solution, but the quote seems to dismiss the option altogether even if the data did contain all information (but not yet the worked-out solution) for identifying a cure

jhawleypeters
1 replies
19h5m

I think I understand the game space that Leela and now Stockfish search. I don't understand whether the author envisions LLMs searching possibility spaces of

  1) written words,
  2) models of math / RL / materials science,
  3) some smaller, formalized space like the game space of chess,
all of the above, or something else. Did I miss where that was clarified?

fspeech
0 replies
16h20m

He wants the search algorithm to be able to search for better search algorithms, i.e. self-improving. That would eliminate some of the narrower domains.

dzonga
1 replies
17h53m

slight step aside - do people at notion realize, their own custom keyboard shortcuts break the habits built on the web.

cmd + p -- bring up their own custom dialog. instead of printing the page as one would expect

sherburt3
0 replies
17h42m

In VsCode cmd+p pulls up the file search dialog, I don’t think it’s that crazy.

amandasystems
1 replies
11h34m

This feels a lot like generation 3 AI throwing out all the insights from gens 1 and 2 and then rediscovering them from first principles, but it’s difficult to tell what this text is really about because it lumps together a lot of things into “search” without fully describing what that means more formally.

PontifexMinimus
0 replies
11h21m

Indeed. It's obvious what search means for a chess program -- its the future positions it looks at. But it's less obvious to me what it means for an LLM.

YeGoblynQueenne
1 replies
6h32m

> She was called Leela Chess Zero — ’zero’ because she started knowing only the rules.

That's a common framing but it's wrong. Leela -and all its friends- have another piece of chess-specific knowledge that is indispensable to their performance: they have a representation of the game of chess -a game-world model- as a game tree, divided in plys: one ply for each player's turn. That game tree is what is searched by adversarial search algorithms, such as minimax or Monte Carlo Tree Search (MCTS; the choice of Leela, IIUC).

More precisely modelling a game as a game tree applies to many games, not just chess, but the specific brand of game tree used in chess engines applies to chess and similar, two-person, zero-sum, complete information board games. I do like my jargon! For other kinds of games, different models, and different search algorithms are needed, e.g. see Poker and Libratus [1].

The need for such a game tree, such a model of a game world, is currently impossible to go without, if the target is superior performance. The article mentions no-search algorithms and briefly touches upon their main limitation (i.e. "why?").

All that btw is my problem with the Bitter Lesson: it is conveniently selective with what it considers domain knowledge (i.e. a "model" in the sense of a theory). As others have noted, e.g. Rodney Brooks [2], Convolutional Neural Nets have dominated image classification thanks to the use of convolutional layers to establish positional invariance. That's a model of machine vision invented by a human, alright, just as a game-tree is a model of a game invented by a human, and everything else anyone has ever done in AI and machine learning is the same: a human comes up with a model, of a world, of an environment, of a domain, of a process, then a computer calculates using that model, and sometimes even outperforms humans (as in chess, Go, and friends) or at the very least achieves results that humans cannot match with hand-crafted solutions.

That is a lesson to learn (with all due respect to Rich Sutton). Human model + machine computation has solved every hard problem in AI in the last 80 years. And we have no idea how to do anything even slightly different.

____________________

[1] https://en.wikipedia.org/wiki/Libratus

[2] https://rodneybrooks.com/a-better-lesson/

nojvek
0 replies
19m

We haven’t seen algorithms that build world models by observing. We’ve seen hints of it but nothing human like.

It will come eventually. We live in exciting times.

TheRoque
1 replies
18h54m

The whole premise of this article is to compare the chess state of the art of 2019 with today, and then they start to talk about llms. But chess is a board with 64 squares and 32 pieces, it's literally nothing compared to the real physical world. So I don't get how this is relevant

dgoodell
0 replies
16h58m

That’s a good point. Imagine if an LLM could only read, speak, and hear at the same speed as a human. How long would training a model take?

We can make them read digital media really quickly, but we can’t really accelerate its interactions with the physical world.

zucker42
0 replies
17h33m

If I had to bet money on it, researchers at top labs have already tried applying search to existing models. The idea to do so is pretty obvious. I don't think it's the one key insight to achieve AGI as the author claims.

stephc_int13
0 replies
18h31m

The author is making a few leap of faith in this article.

First, his example of the efficiency of ML+search for playing Chess is interesting but not a proof that this strategy would be applicable or efficient in the general domain.

Second, he is implying that some next iteration of ChatGPT will reach AGI level, given enough scale and money. This should be considered hypothetical until proven.

Overall, he should be more scientific and prudent.

schlipity
0 replies
3h0m

I don't run javascript by default using NoScript, and something amusing happened on this website because of it.

The link for the site points to a notion.site address, but attempting to go to this address without javascript enabled (for that domain) forces a redirect to a notion.so domain. Attempting to visit just the basic notion.site address also does this same redirection.

What this ends up causing is that I don't have an easy way to use NoScript to temporarily turn on javascript for the notion.site domain, because it never loads. So much for reading this article.

kunalgupta
0 replies
1h46m

This is one of my favorite reads in a while

johnthewise
0 replies
22h2m

What happened to all the chatter about Q*? I remember reading about this train/test time trade-off back then, does anyone have good list of recent papers/blogs about this? What is holding back this or openai is just running some model 10x longer to estimate what they would get if they trained with 10x?

This tweet is relevant: https://x.com/polynoamial/status/1676971503261454340

itissid
0 replies
17h32m

The problem is the transitive closure of chess move is a chess move. The transitive closure of human knowledge and theories to do X is new theories never seen before and no Value function can do that, unless you are also implying theorem proving is included for correctness verification which is also a very difficult search and computationally expensive problem on its own.

Also, I think this is instead time to sit back and think what exactly is the thing we value in society as well: Personal(Human) self-sufficiency(I also like to compare this AI to UBI) and thus achievement, which only means Human-in-Loop AI that can help us achieve that and that is specific to each individual, i.e. multi-atttribute value functions whose weights are learned and they change over time.

Writing about AGI and defining it to do the "best" search while not talking about what we want it to do *for us* is exactly wrong-headed for these reasons.

galaxyLogic
0 replies
15h23m

How would search + LLMs work together in practice?

How about using search to derive facts from ontological models, and then writing out the discovered facts in English. Then train the LLM on those English statements. Currently LLMs are trained on texts found on the internet mostly (only?). But information on the internet is often false and unreliable.

If instead we would have logically sound statements by the billions derived from ontological world-models then that might improve the performance of LLMs significantly.

Is something like this what the article or others are proposing? Give the LLM the facts, and the derived facts. Prioritize texts and statements we know and trust to be true. And even though we can't write out too many true statements ourselves, a system that generated them by the billions by inference could.

brcmthrowaway
0 replies
17h56m

This strikes me as Lesswrong style pontificating.

bashfulpup
0 replies
16h56m

The biggest issue the author does not seem aware of is how much compute is required for this. This article is the equivalent of saying that a monkey given time will write Shakespeare. Of course it's correct, but the search space is intractable. And you would never find your answer in that mess even if it did solve it.

I've been building branching and evolving type llm systems for well over a year now full time.

I have built multiple "search" or "exploring" algorithms. The issue is that after multiple steps, your original agent, who was tasked with researching or doing biology, is now talking about battleships (an actual example from my previous work).

Single step is the only real situation search functions work. Mutli step agents explode to infinite possibilities very very quickly.

Single step has its own issues, though. While a zero shot question run 1000 times (eg, solve this code problem), may help find a better solution it's a limited search space (which is a good thing)

I recently ran a test of 10k inferences of a single input prompt on multiple llm models varying the input configurations. What you find is that an individual prompt does not have infinite response possibilities. It's limited. This is why they can actually function as llms now.

Agents not working is an example of this problem. While a single step search space is massive, it's exponential every step the agent takes.

I'm building tools and systems around solving this problem, and to me, a massive search is as far off as saying all we need 100x AI model sizes to solve it.

Autonomy =/ (Intelligence or reasoning)

awinter-py
0 replies
1h52m

just came here to upvote the alphago / MCTS comments

ajnin
0 replies
2h2m

OT but this website completely breaks arrow and page up/down scrolling, as well as alt=arrow navigation. Only mouse scrolling works for me (I'm using Firefox). Can't websites stop messing with basic browser functionality for no valid reason at all ?

Hugsun
0 replies
2m

A big problem with the conclusions of this article is the assumptions around possible extrapolations.

We don't know if a meaningfully superintelligent entity can exist. We don't understand the ingredients of intelligence that well, and it's hard to say how far the quality of these ingredients can be improved, to improve intelligence. For example, an entity with perfect pattern recognition ability, might be superintelligent, or just a little smarter than Terrance Tao. We don't know how useful it is to be better at pattern recognition to an arbitrary degree.

A common theory is that the ability modeling processes, like the behavior of the external world is indicative of intelligence. I think it's true. We also don't know the limitations of this modeling. We can simulate the world in our minds to a degree. The abstractions we use make the simulation more efficient, but less accurate. By this theory, to be superintelligent, an entity would have to simulate the world faster with similar accuracy, and/or use more accurate abstractions.

We don't know how much more accurate they can be per unit of computation. Maybe you have to quadruple the complexity of the abstraction, to double the accuracy of the computation, and human minds use a decent compromise that is infeasible to improve by a large margin. Maybe generating human level ideas faster isn't going to help because we are limited by experimental data, not by the ideas we can generate from it. We can't safely assume that any of this can be improved to an arbitrary degree.

We also don't know if AI research would benefit much from smarter AI researchers. Compute has seemed to be the limiting factor at almost all points up to now. So the superintelligence would have to help us improve compute faster than we can. It might, but it also might not.

This article reminds me of the ideas around the singularity, by placing too much weight on the belief that any trendline can be extended forever.

It is otherwise pretty interesting, and I'm excitedly watching the 'LLM + search' space.

6510
0 replies
17h57m

I've recently matured to the point where all applications are made of 2 things, search and security. The rest is just things added on top. If you cant find it it isn't worth having.