https://twitter.com/gfodor/status/1735348301812383906
If DeepMind just definitively proved neural networks can generate genuinely new knowledge then it’s the most important discovery since fire.
If this were actually the case why wouldn't everyone be talking about this? I am impressed it was done on Palm 2 given that's less advanced than GPT-4 and Gemini. Will be wild to see what the next few generations of models can do utilizing methods like this.
The heavy lifting here is done by the evolutionary algorithm. The LLM is just being asked "propose some reasonable edits to these 20 lines of python" to replace a random mutation operator. It feels generous to credit the neural network with the knowledge generation here.
It's also highly dependent on the nature of the problem, even beyond the need to have the "hard to generate, easy to evaluate" structure. You have to be able to decompose the problem in such a way that a very short Python function is all that you want to evolve.
There's nothing generous about it. It wouldn't be possible without the LLM. Traditional Computational solvers fall well short as they say and demonstrate.
Hahaha "Just".
We can see from the ablations that simply asking the LLM to generate a large number of potential solutions without the evolutionary search performs terribly.
It's certainly true that the LLM mutator produces more reasonable edits than a random mutator. But the value the LLM is bringing is that it knows how to produce reasonable-looking programs, not that it has some special understanding of bin packing or capset finding.
The only thing the LLM sees is two previously generated code samples, labeled <function>_v0 and <function>_v1 and then a header for <function>_v2 which it fills in. Look at Extended Data Fig. 1.
With the current competency or state of the models today, both are necessary, nobody is denying that.
It doesn't take a genius however to see who the more valuable contributor to the system is. The authors say as much when they anticipate much of any improvement to come from capabilities of better models than any change to the search algorithm. Gpt-4 may well have reduced the search time by half and so on.
So it is a general problem solver then, great. That's what we want.
Indeed, we can look at the ablations and see that swapping out a weaker code LLM (540B Codey to 15B Starcoder) makes little difference:
"This illustrates that FunSearch is robust to the choice of the model as long as it has been trained sufficiently well to generate code."
In the sense that it does not rely on the LLM having any understanding of the problem, yes. But not in the sense that it can be applied to problems that are not naturally decomposable into a single short Python function.
Really stretching the meaning of "little difference" here
"While ‘StarCoder’ is not able to find the full-sized admissible set in any of the five runs, it still finds large admissible sets that improve upon the previous state of the art lower bound on the cap set capacity."
https://imgur.com/a/xerOOLn
Literally telling you how much more valuable the LLM (any coding LLM) is than alternatives, while demonstrating prediction competence still matters.
I don't think five runs is sufficient to reach the conclusion you're reaching here, especially when we can see that one of the Codey runs is worse than all of the StarCoder runs. If Codey is so much more valuable, why is this happening?
It's certainly true that LLM's are more valuable than the alternative tested - which is a set of hand-crafted code mutation rules. But it's important to think about why an LLM is better. There are two big pitfalls for the hand-crafted rules. First, the hand-crafted rule has to make local changes - it's vanishingly improbable to be able to do something like "change < to <= in each of this series of if-statements" or "increase all constants by 1", whereas those are natural simple changes that an LLM might make. The second is that the random mutations have no concept of parsimonious code. There's nothing preventing it from generating code that does stuff like compute values and then neglect to return them, or multiply a previous computation by zero, or any number of other obviously useless variations.
What the LLM brings to the table here is the ability to avoid pitfalls like the above. It writes code that is shaped sensibly - it can make a more natural set of edits than "just randomly delete an operator". But that's all it's doing. That's not, like the tweet quoted above called it, "generating genuinely new knowledge".
Put it this way - assuming for a moment that Codey ending up with a higher score than StarCoder is not random chance, do you think it's because Codey has some greater understanding of the admissible-set problem, or because Codey generates a different set of minor code edits?
Agreed. If the phrase “yeah well you could also do that with a group of moderately trained people” applies, you have a use case for an LLM.
Having the right hunches on what to try based on accumulated knowledge & experiences is a key thing that distinguishes masters from apprentices.
A fun story from a UCLA math PhD: “terry tao was on both my and my brother's committee.
he solved both our dissertation problems before we were done talking, each of us got "wouldn't it have been easier to...outline of entire proof"”
https://twitter.com/AAAzzam/status/1735070386792825334
Current LLMs are far from Terence Tao but Tao himself wrote this:
“The 2023-level AI can already generate suggestive hints and promising leads to a working mathematician and participate actively in the decision-making process. When integrated with tools such as formal proof verifiers, internet search, and symbolic math packages, I expect, say, 2026-level AI, when used properly, will be a trustworthy co-author in mathematical research, and in many other fields as well.”
https://unlocked.microsoft.com/ai-anthology/terence-tao/
Do you think the role the LLM plays in this system is analogous to what Tao is talking about?
What Tao does when proposing an idea most likely encapsulates much more than what a current LLM does. I’m no Terence Tao but I sometimes come up with useful ideas. In a more complex case, I revise those ideas in my head and sometimes on paper several times before they become useful (analogous to using evolutionary algorithms).
However, it is impractical to think consciously of all possible variations. So the brain only surfaces ones likely to be useful. This is the role an LLM plays here.
An expert or an LLM with more relevant experiences would be better at suggesting these variations to try. Chess grandmasters often don’t consciously simulate more possibilities than novices.
Critically, though, the LLM is not acting as a domain expert here. It's acting as a code mutator. The expertise it brings to the table is not mathematical - Codey doesn't have any special insight into bin packing heuristics. It's not generating "suggestive hints and promising leads", it's just help the evolutionary search avoid nonsense code mutations.
An LLM serves as a sort of “expert” programmer here. Programming itself is a pretty complex domain in which a purely evolutionary algorithm isn’t efficient.
We can imagine LLMs trained in mathematical programming or a different domain playing the same role more effectively in this framework.
I disagree that the type of coding the LLM is doing here is "expert" level in any meaningful sense. Look for example at the code for the bin-packing heuristics:
https://github.com/google-deepmind/funsearch/blob/main/bin_p...
These aren't complex or insightful programs - they're pretty short simple functions, of the sort you typically get from program evolution. The LLM's role here is just proposing edits, not leveraging specialized knowledge or even really exercising the limits of existing LLMs' coding capabilities.
To be clearer, the word “expert” in my comment above is in quotation marks because an LLM has more programming expertise than a non-programmer or an evolutionary algorithm, but not anywhere near a true expert programmer.
People might argue about the weights of credit to the various parts. LLMs as they are by themselves are quite limited in their creative potential to do new things or generally make progress, but if you couple their capabilities (as is done here) with other tech then their powers really shine.
Likewise when that coupled "tech" is a person. LLMs don't do my job or further my projects by replacing me, but I can use it to generate examples in seconds that would take minutes or hours for me and it can provide ideas for options moving forward.
I said “WOW!” out loud.
An LLM can discover a new solution in high dimensional geometry that hasn’t advanced in 20 years!? That goes way beyond glueing little bits of plagiarized training data together in a plausible way.
This suggest that there are hidden depths to LLMs’ capabilities if we can just figure out how to prompt and evaluate them correctly.
This significantly broke my expectations. Who knows what discovery could be hiding behind the next prompt and random seed.
Moreover, LLMs almost ALWAYS extrapolate, and never interpolate. They don't regurgitate training data. Doing so is virtually impossible.
An LLM's input (AND feature) space is enormous. Hundreds or thousands of dimensions. 3D space isn't like 50D or 5,000D space. The space is so combinatorially vast that basically no two points are neighbors. You cannot take your input and "pick something nearby" to a past training example. There IS NO nearby. No convex hull to walk around in. This "curse of dimensionality" wrecks arguments that these models only produce "in distribution" responses. They overwhelmingly can't! (Check out the literature of LeCun et al. for more rigor re. LLM extrapolation.)
LLMs are creative. They work. They push into new areas daily. This reality won't change regardless of how weirdly, desperately the "stochastic parrot" people wish it were otherwise. At this point they're just denialists pushing goalposts around. Don't let 'em get to you!
While I very much do not think this is all they do, I don't think this statement is correct. Some research indicates that it is not:
https://not-just-memorization.github.io/extracting-training-...
Anecdotally, there were also a few examples I tried earlier this year (on GPT3.5 and GPT4) of being able to directly prompt for training data. They were patched out pretty quick but did work for a while. For example, asking for "fast inverse square root" without specifying anything else would give you the famous Quake III code character for character, including comments.
Your examples at best support, not contradict, my position.
1. Repeating "company" fifty times followed by random factoids is way outside of training data distribution lol. That's actually a hilarious/great example of creative extrapolation.
2. Extrapolation often includes memory retrieval. Recalling bits of past information is perfectly compatible with critical thinking, be it from machines or humans.
3. GPT4 never merely regurgitated the legendary fast root approximation to you. You might've only seen that bit. But that's confusing an iceberg with its tip. The actual output completion was on several hundred tokens setting up GPT as this fantasy role play writer who must finish this Simplicio-style dialogue between some dudes named USER and ASSISTANT, etc. This conversation, which does indeed end with Carmack's famous code, is nowhere near a training example to simply pluck from the combinatorial ether.
The "random factoids" were verbatim training data though, one of their extractions was >1,000 tokens in length.
I interpreted the claim that it can't "regurgitate training data" to mean that it can't reproduce verbatim a non-trivial amount of its training data. Based on how I've heard the word "regurgitate" used, if I were to rattle off the first page of some book from memory on request I think it would be fair to say I regurgitated it. I'm not trying to diminish how GPT does what it does, and I find what it does to be quite impressive.
Do you have a specific reference? I've mostly ignored LLMs until now because it seemed like the violent failure mode (confident + competent + wrong) renders them incapable of being a useful tool[1]. However this application, combined with the dimensionality idea, has me interested.
I do wish the authors of the work referenced here made it more clear what, if anything, the LLM is doing here. It's not clear to me it confers some advantage over a more normal genetic programming approach to these particular problems.
[1] in the sense that useful, safe tools degrade predictably. An airplane which stalls violently and in an unrecoverable manner doesn't get mass-produced. A circular saw which disintegrates when the blade binds throwing shrapnel into its operator's body doesn't pass QA. Etc.
"Learning in High Dimension Always Amounts to Extrapolation" [1]
[1] https://arxiv.org/abs/2110.09485
Thank you. I'll give this a look.
LLM + brute force coded by humans.
And that not only created a human interpretable method, but beat out 20 years of mathematicians working on a high profile open question.
Who’s to say our brains themselves aren’t brute force solution searchers?
yes, its just much slower (many trillions times) for numbers crunching.
The first advance in 20 years was by Fred Tyrrell last year https://arxiv.org/abs/2209.10045, who showed that the combinatorics quantity in question is between 2.218 and 2.756, improving the previous lower bound of 2.2174. DeepMind has now shown that the number is between 2.2202 and 2.756.
That's why the DeepMind authors describe it as "the largest increase in the size of cap sets in the past 20 years," not as the only increase. The arithmetic is that 2.2202-2.218 is larger than 2.218-2.2174. (If you consider this a very meaningful comparison to make, you might work for DeepMind.)
It’s kind of like humans: seed of two people and a random interest to pursue, what could they do?!? It makes poverty and children dying unnecessarily even more depressing.
Neural networks have been able to "generate new knowledge" for a long time.
So have LLMs, https://www.nature.com/articles/s41587-022-01618-2
From the paper:
We note that FunSearch currently works best for problems having the following characteristics: a) availability of an efficient evaluator; b) a “rich” scoring feedback quantifying the improvements (as opposed to a binary signal); c) ability to provide a skeleton with an isolated part to be evolved. For example, the problem of generating proofs for theorems [52–54] falls outside this scope, since it is unclear how to provide a rich enough scoring signal.
Don't Look Up
In some sense, it's not especially interesting because millions of programmers are doing this with copilot every day. Like I get that in a lot of cases it's _just_ applying common knowledge to new domains, but it is novel code for the most part.
In this example, it's relatively limited I feel to finding new algorithms/functions. Which is great, but compared to the discovery of fire, and tons of things in between, like say electricity, it wouldn't feel in the same ballpark at all.