FunSearch: Making new discoveries in mathematical sciences using LLMs

https://twitter.com/gfodor/status/1735348301812383906

If DeepMind just definitively proved neural networks can generate genuinely new knowledge then it’s the most important discovery since fire.

If this were actually the case why wouldn't everyone be talking about this? I am impressed it was done on Palm 2 given that's less advanced than GPT-4 and Gemini. Will be wild to see what the next few generations of models can do utilizing methods like this.

The heavy lifting here is done by the evolutionary algorithm. The LLM is just being asked "propose some reasonable edits to these 20 lines of python" to replace a random mutation operator. It feels generous to credit the neural network with the knowledge generation here.

It's also highly dependent on the nature of the problem, even beyond the need to have the "hard to generate, easy to evaluate" structure. You have to be able to decompose the problem in such a way that a very short Python function is all that you want to evolve.

There's nothing generous about it. It wouldn't be possible without the LLM. Traditional Computational solvers fall well short as they say and demonstrate.

The LLM is just being asked "propose some reasonable edits to these 20 lines of python" to replace a random mutation operator.

Hahaha "Just".

We can see from the ablations that simply asking the LLM to generate a large number of potential solutions without the evolutionary search performs terribly.

It's certainly true that the LLM mutator produces more reasonable edits than a random mutator. But the value the LLM is bringing is that it knows how to produce reasonable-looking programs, not that it has some special understanding of bin packing or capset finding.

The only thing the LLM sees is two previously generated code samples, labeled <function>_v0 and <function>_v1 and then a header for <function>_v2 which it fills in. Look at Extended Data Fig. 1.

With the current competency or state of the models today, both are necessary, nobody is denying that.

It doesn't take a genius however to see who the more valuable contributor to the system is. The authors say as much when they anticipate much of any improvement to come from capabilities of better models than any change to the search algorithm. Gpt-4 may well have reduced the search time by half and so on.

But the value the LLM is bringing is that it knows how to produce reasonable-looking programs, not that it has some special understanding of bin packing or capset finding.

So it is a general problem solver then, great. That's what we want.

It doesn't take a genius however to see who the more valuable contributor to the system is.

Indeed, we can look at the ablations and see that swapping out a weaker code LLM (540B Codey to 15B Starcoder) makes little difference:

"This illustrates that FunSearch is robust to the choice of the model as long as it has been trained sufficiently well to generate code."

So it is a general problem solver then, great. That's what we want.

In the sense that it does not rely on the LLM having any understanding of the problem, yes. But not in the sense that it can be applied to problems that are not naturally decomposable into a single short Python function.

Indeed, we can look at the ablations and see that swapping out a weaker code LLM (540B Codey to 15B Starcoder) makes little difference

Really stretching the meaning of "little difference" here

"While ‘StarCoder’ is not able to find the full-sized admissible set in any of the five runs, it still finds large admissible sets that improve upon the previous state of the art lower bound on the cap set capacity."

https://imgur.com/a/xerOOLn

Literally telling you how much more valuable the LLM (any coding LLM) is than alternatives, while demonstrating prediction competence still matters.

I don't think five runs is sufficient to reach the conclusion you're reaching here, especially when we can see that one of the Codey runs is worse than all of the StarCoder runs. If Codey is so much more valuable, why is this happening?

It's certainly true that LLM's are more valuable than the alternative tested - which is a set of hand-crafted code mutation rules. But it's important to think about why an LLM is better. There are two big pitfalls for the hand-crafted rules. First, the hand-crafted rule has to make local changes - it's vanishingly improbable to be able to do something like "change < to <= in each of this series of if-statements" or "increase all constants by 1", whereas those are natural simple changes that an LLM might make. The second is that the random mutations have no concept of parsimonious code. There's nothing preventing it from generating code that does stuff like compute values and then neglect to return them, or multiply a previous computation by zero, or any number of other obviously useless variations.

What the LLM brings to the table here is the ability to avoid pitfalls like the above. It writes code that is shaped sensibly - it can make a more natural set of edits than "just randomly delete an operator". But that's all it's doing. That's not, like the tweet quoted above called it, "generating genuinely new knowledge".

Put it this way - assuming for a moment that Codey ending up with a higher score than StarCoder is not random chance, do you think it's because Codey has some greater understanding of the admissible-set problem, or because Codey generates a different set of minor code edits?

Agreed. If the phrase “yeah well you could also do that with a group of moderately trained people” applies, you have a use case for an LLM.

Having the right hunches on what to try based on accumulated knowledge & experiences is a key thing that distinguishes masters from apprentices.

A fun story from a UCLA math PhD: “terry tao was on both my and my brother's committee.

he solved both our dissertation problems before we were done talking, each of us got "wouldn't it have been easier to...outline of entire proof"”

https://twitter.com/AAAzzam/status/1735070386792825334

Current LLMs are far from Terence Tao but Tao himself wrote this:

“The 2023-level AI can already generate suggestive hints and promising leads to a working mathematician and participate actively in the decision-making process. When integrated with tools such as formal proof verifiers, internet search, and symbolic math packages, I expect, say, 2026-level AI, when used properly, will be a trustworthy co-author in mathematical research, and in many other fields as well.”

https://unlocked.microsoft.com/ai-anthology/terence-tao/

Do you think the role the LLM plays in this system is analogous to what Tao is talking about?

What Tao does when proposing an idea most likely encapsulates much more than what a current LLM does. I’m no Terence Tao but I sometimes come up with useful ideas. In a more complex case, I revise those ideas in my head and sometimes on paper several times before they become useful (analogous to using evolutionary algorithms).

However, it is impractical to think consciously of all possible variations. So the brain only surfaces ones likely to be useful. This is the role an LLM plays here.

An expert or an LLM with more relevant experiences would be better at suggesting these variations to try. Chess grandmasters often don’t consciously simulate more possibilities than novices.

Critically, though, the LLM is not acting as a domain expert here. It's acting as a code mutator. The expertise it brings to the table is not mathematical - Codey doesn't have any special insight into bin packing heuristics. It's not generating "suggestive hints and promising leads", it's just help the evolutionary search avoid nonsense code mutations.

An LLM serves as a sort of “expert” programmer here. Programming itself is a pretty complex domain in which a purely evolutionary algorithm isn’t efficient.

We can imagine LLMs trained in mathematical programming or a different domain playing the same role more effectively in this framework.

I disagree that the type of coding the LLM is doing here is "expert" level in any meaningful sense. Look for example at the code for the bin-packing heuristics:

https://github.com/google-deepmind/funsearch/blob/main/bin_p...

These aren't complex or insightful programs - they're pretty short simple functions, of the sort you typically get from program evolution. The LLM's role here is just proposing edits, not leveraging specialized knowledge or even really exercising the limits of existing LLMs' coding capabilities.

To be clearer, the word “expert” in my comment above is in quotation marks because an LLM has more programming expertise than a non-programmer or an evolutionary algorithm, but not anywhere near a true expert programmer.

People might argue about the weights of credit to the various parts. LLMs as they are by themselves are quite limited in their creative potential to do new things or generally make progress, but if you couple their capabilities (as is done here) with other tech then their powers really shine.

Likewise when that coupled "tech" is a person. LLMs don't do my job or further my projects by replacing me, but I can use it to generate examples in seconds that would take minutes or hours for me and it can provide ideas for options moving forward.

I said “WOW!” out loud.

An LLM can discover a new solution in high dimensional geometry that hasn’t advanced in 20 years!? That goes way beyond glueing little bits of plagiarized training data together in a plausible way.

This suggest that there are hidden depths to LLMs’ capabilities if we can just figure out how to prompt and evaluate them correctly.

This significantly broke my expectations. Who knows what discovery could be hiding behind the next prompt and random seed.

That goes way beyond glueing little bits of plagiarized training data together

Moreover, LLMs almost ALWAYS extrapolate, and never interpolate. They don't regurgitate training data. Doing so is virtually impossible.

An LLM's input (AND feature) space is enormous. Hundreds or thousands of dimensions. 3D space isn't like 50D or 5,000D space. The space is so combinatorially vast that basically no two points are neighbors. You cannot take your input and "pick something nearby" to a past training example. There IS NO nearby. No convex hull to walk around in. This "curse of dimensionality" wrecks arguments that these models only produce "in distribution" responses. They overwhelmingly can't! (Check out the literature of LeCun et al. for more rigor re. LLM extrapolation.)

LLMs are creative. They work. They push into new areas daily. This reality won't change regardless of how weirdly, desperately the "stochastic parrot" people wish it were otherwise. At this point they're just denialists pushing goalposts around. Don't let 'em get to you!

They don't regurgitate training data.

While I very much do not think this is all they do, I don't think this statement is correct. Some research indicates that it is not:

https://not-just-memorization.github.io/extracting-training-...

Anecdotally, there were also a few examples I tried earlier this year (on GPT3.5 and GPT4) of being able to directly prompt for training data. They were patched out pretty quick but did work for a while. For example, asking for "fast inverse square root" without specifying anything else would give you the famous Quake III code character for character, including comments.

Your examples at best support, not contradict, my position.

1. Repeating "company" fifty times followed by random factoids is way outside of training data distribution lol. That's actually a hilarious/great example of creative extrapolation.

2. Extrapolation often includes memory retrieval. Recalling bits of past information is perfectly compatible with critical thinking, be it from machines or humans.

3. GPT4 never merely regurgitated the legendary fast root approximation to you. You might've only seen that bit. But that's confusing an iceberg with its tip. The actual output completion was on several hundred tokens setting up GPT as this fantasy role play writer who must finish this Simplicio-style dialogue between some dudes named USER and ASSISTANT, etc. This conversation, which does indeed end with Carmack's famous code, is nowhere near a training example to simply pluck from the combinatorial ether.

random factoids

The "random factoids" were verbatim training data though, one of their extractions was >1,000 tokens in length.

GPT4 never merely regurgitated

I interpreted the claim that it can't "regurgitate training data" to mean that it can't reproduce verbatim a non-trivial amount of its training data. Based on how I've heard the word "regurgitate" used, if I were to rattle off the first page of some book from memory on request I think it would be fair to say I regurgitated it. I'm not trying to diminish how GPT does what it does, and I find what it does to be quite impressive.

Do you have a specific reference? I've mostly ignored LLMs until now because it seemed like the violent failure mode (confident + competent + wrong) renders them incapable of being a useful tool[1]. However this application, combined with the dimensionality idea, has me interested.

I do wish the authors of the work referenced here made it more clear what, if anything, the LLM is doing here. It's not clear to me it confers some advantage over a more normal genetic programming approach to these particular problems.

[1] in the sense that useful, safe tools degrade predictably. An airplane which stalls violently and in an unrecoverable manner doesn't get mass-produced. A circular saw which disintegrates when the blade binds throwing shrapnel into its operator's body doesn't pass QA. Etc.

"Learning in High Dimension Always Amounts to Extrapolation" [1]

[1] https://arxiv.org/abs/2110.09485

Thank you. I'll give this a look.

An LLM can discover a new solution in high dimensional geometry that hasn’t advanced in 20 years!?

LLM + brute force coded by humans.

And that not only created a human interpretable method, but beat out 20 years of mathematicians working on a high profile open question.

Who’s to say our brains themselves aren’t brute force solution searchers?

Who’s to say our brains themselves aren’t brute force solution searchers?

yes, its just much slower (many trillions times) for numbers crunching.

An LLM can discover a new solution in high dimensional geometry that hasn’t advanced in 20 years!? That goes way beyond glueing little bits of plagiarized training data together in a plausible way.

The first advance in 20 years was by Fred Tyrrell last year https://arxiv.org/abs/2209.10045, who showed that the combinatorics quantity in question is between 2.218 and 2.756, improving the previous lower bound of 2.2174. DeepMind has now shown that the number is between 2.2202 and 2.756.

That's why the DeepMind authors describe it as "the largest increase in the size of cap sets in the past 20 years," not as the only increase. The arithmetic is that 2.2202-2.218 is larger than 2.218-2.2174. (If you consider this a very meaningful comparison to make, you might work for DeepMind.)

It’s kind of like humans: seed of two people and a random interest to pursue, what could they do?!? It makes poverty and children dying unnecessarily even more depressing.

Neural networks have been able to "generate new knowledge" for a long time.

So have LLMs, https://www.nature.com/articles/s41587-022-01618-2

From the paper:

We note that FunSearch currently works best for problems having the following characteristics: a) availability of an efficient evaluator; b) a “rich” scoring feedback quantifying the improvements (as opposed to a binary signal); c) ability to provide a skeleton with an isolated part to be evolved. For example, the problem of generating proofs for theorems [52–54] falls outside this scope, since it is unclear how to provide a rich enough scoring signal.

Don't Look Up

In some sense, it's not especially interesting because millions of programmers are doing this with copilot every day. Like I get that in a lot of cases it's _just_ applying common knowledge to new domains, but it is novel code for the most part.

In this example, it's relatively limited I feel to finding new algorithms/functions. Which is great, but compared to the discovery of fire, and tons of things in between, like say electricity, it wouldn't feel in the same ballpark at all.

To what extent is an LLM necessary here? As far as I can tell (and perhaps I haven't looked closely enough yet) the purpose of the LLM here is to generate things that look plausibly like python functions conforming to a given type signature.

But it should be possible to generate random, correct python functions conforming to a given type signature without an LLM. This would be an exercise like [1], just with a substantially more complex language. But might a restricted language be more ergonomic? Something like PushGP [2]?

So I guess my questions would be:

(1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?

(2) Are other genetic programming techniques less competitive on the same problems? Do they produce less fit solutions?

(3) If a more "traditional" genetic programming approach can achieve similar fitness, is there a difference in compute cost, including the cost to train the LLM?

[1] http://www.davidmontana.net/papers/stgp.pdf [2] https://faculty.hampshire.edu/lspector/push.html

it should be possible to generate random, correct python functions conforming to a given type signature without an LLM

The state space of viable programs is far, far larger than useful ones. You need more than monkeys and typewriters. The point of using Palm2 here is that you don’t want your candidates to be random, you want them to be plausible so you’re not wasting time on nonsensical programs.

Further, a genetic algorithm with random program generation will have a massive cold start problem. You’re not going to make any progress in the beginning, and probably ever, if the fitness of all of your candidates is zero.

That's an interesting hypothesis, but the work under discussion here didn't make an attempt to test it. You make two claims:

(1) That using an LLM (Palm2) somehow "saved time" by avoiding "nonsensical programs"

(2) That there exists a "cold start problem" (which, presumably, the LLM somehow solves).

Yet not the paper, nor the blog post, nor the code on github support these.

What paper are you reading exactly ?

There's a clear comparison

https://imgur.com/a/xerOOLn

Same paper you are. This is comparing apples to oranges. Their "random" strategy is only capable of assembling very constrained python programs whereas the LLM is coming up with things that look like the sort of stuff a person might write, utilizing the whole language (modulo hallucinatory nonsense). So it's not a reasonable comparison at all.

Instead, for the purpose of making something resembling a reasonable comparison, it would make a lot more sense to train an LLM on programs written in the same language they allow their "random" strategy to use. I wouldn't recommend trying this in python, as it's way too big a language. Something smaller and more fit for purpose would likely be much more tractable. There are multiple examples from the literature to choose from.

EDIT: To be clear--it's obvious the LLM solution outperformed the hand-rolled solution. By a large margin. What would actually be an interesting scientific question, though, is to what extent is the success attributable to the LLM? That's not possible to answer given the results, because the hand rolled-solution and the LLM solution are literally speaking different languages.

The purpose of the ablation is to demonstrate the value-add of using an LLM. Comparing to another type of candidate generator wouldn’t be ablation anymore, it would just be another study.

As I have mentioned previously, using an LLM addresses the cold start problem by immediately generating plausible-looking programs. The reason random token selection is limited to constrained programs is that valid programs are statistically unlikely to occur.

And the ablation directly demonstrates that the LLM is better than random.

The fact that LLMs generate text which is better than random is not something the authors need to spell out. Reducing the cross entropy between predicted and target distributions is the entire purpose of training.

Put another way: a flat probability distribution from an untrained network is equivalent to the random decoding you’re skeptical an LLM is better than. That’s not a criticism the authors’ peers would likely put forward.

Genetic algorithms, even with programmed constraints, will come up with many non-sensical programs, although they could be mostly syntactically correct (given sufficient effort).

The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.

The above should answer your 1) and 2).

On 3), a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.

The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.

Yet, sadly, the work under question here didn't make any attempt to show this, AFAICT. It's an interesting hypothesis, but as yet untested.

a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.

Again, an interesting hypothesis that probably could be investigated with the experimental apparatus described in the paper. But it wasn't. Have you?

> The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.

Yet, sadly, the work under question here didn't make any attempt to show this, AFAICT. It's an interesting hypothesis, but as yet untested.

It would be ideal to do that. But given the way LLMs work, it is almost a given. This could be the reason it didn't occur to the researchers who are very familiar with LLMs.

> a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.

Again, an interesting hypothesis that probably could be investigated with the experimental apparatus described in the paper. But it wasn't. Have you?

Perhaps it's quite evident to people in the trench, as the hypothesis seems very plausible to me. A numerical confirmation would be ideal as you suggested.

Given my management experience, I'd say it's smart of them to focus their limited resources on something more innovative in this exploratory work. Details can be left for future work or worked out by other teams.

almost a given

Not a scientifically compelling argument.

limited resources

What? This is Google we're talking about. Right?

Limited resources: time of highly competent researchers, management time

Science takes time. A single paper needs not completely answer everything. Also, in many fields, it’s common to find out what works before why it works.

The found function is in here: https://github.com/google-deepmind/funsearch/blob/main/cap_s.... I'm not very familiar with genetic algorithms but it's pretty hard for me to imagine that they couldn't come up with this. I'd be surprised if too many people (if any) have tried.

On the other hand, as observed in Appendix A.2 of this paper, the non-LLM genetic approach would have to be engineered by hand more than the LLM approach.

I guess what I'm really trying to get at is this seems like a huge missed opportunity to show their LLM is actually doing something cool. Like if it somehow significantly outperforms established genetic programming update strategies that should be pretty easy to demonstrate given their experimental setup, but no attempt was made. That's kinda bizarre... like, in what sense is this an advancement of the field?

They need to show something to keep the excitement around Google's upcoming AI alive. Expect a slew of (non-)discoveries to continue unabated going forward.

(1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?

I thought that stochastic (random) gradient descent and LLM were converging much quicker than genetic programming. Definitely much quicker than random search.

Inductive program synthesis stood still for decades because the space is far too large to get anything beyond absolutely trivial programs; LLMs vastly trim (in many wrong ways too of course) the search space and then you can apply ips to finetune and test. You cannot do this (not in any way we know of) without LLMs because it just tests billions of programs that make absolutely no sense even for trivial cases.

Wonder where the goalposts will be moved now by the "stochastic parrot" parroters.

Idk if this work really bears on that argument. I can imagine Bender reacting to this by observing that this is a clever way of finding the needle (a good heuristic algorithm for an NP hard problem) in a haystack (millions of other heuristic algorithms that are bad, very bad, or maybe not even syntactically well-formed), and then saying that the fact that the model still produced so much garbage is proof that the LLM still doesn't really know how to reason or access meaning.

And I think that'd be a good argument. OTOH, this system is not just an LLM, but an LLM and a bunch of additional components on top of it, and there might be different things to say about this system considered as a whole.

We don't need LLMs to reason or to access meaning for them to be useful

Sure, but I don't think even Bender or Marcus would deny their potential for practical utility, right? I think they mean to say that LLMs are not exactly as miraculous as they might seem, and very capable of misbehavior.

Bender and Marcus don’t get to decide what can be useful for people. LLMs are useful, although their imperfections limit their usefulness. But the bar for usefulness is a lot lower than “perfection”.

your mind also conjectures many unreasonable and illogical solutions to things. then separately your mind has an "evaluator" to deduce a set of solutions down to things that make sense.

Is anyone saying LLMs in isolation and without contextual tooling are miraculous? Even the text box for chatGPT is a form of wrapping LLM in a UX that is incrementally more miraculous.

I'm not sure why people always want to consider the LLM in isolation. Like if someone built an apparently fully sentient android which completely mimics the behavior of a sentient human in every respect, but it was comprised of a bunch of LLMs and other systems wired together, would you say that it's not intelligent because the LLMs aren't fully sentient?

A stochastic parrot used to generate variation of a program, to do hill climbing on programs.

The magician lose some its charm when you know his trick. The Earth used to be the center of the universe, now it's circling an average star. Intelligence used to be a sacred mystical thing, we might be in our way to make it less sacred.

FunSearch is more along the lines of how I wanted AI to evolve over the last 20 years or so, after reading Genetic Programming III by John Koza:

https://www.amazon.com/Genetic-Programming-III-Darwinian-Inv...

I wanted to use genetic algorithms (GAs) to come up with random programs run against unit tests that specify expected behavior. It sounds like they are doing something similar, finding potential solutions with neural nets (NNs)/LLMs and grading them against an "evaluator" (wish they added more details about how it works).

What the article didn't mention is that above a certain level of complexity, this method begins to pull away from human supervisors to create and verify programs faster than we can review them. When they were playing with Lisp GAs back in the 1990s on Beowulf clusters, they found that the technique works extremely well, but it's difficult to tune GA parameters to evolve the best solutions reliably in the fastest time. So volume III was about re-running those experiments multiple times on clusters about 1000 times faster in the 2000s, to find correlations between parameters and outcomes. Something similar was also needed to understand how tuning NN parameters affects outcomes, but I haven't seen a good paper on whether that relationship is understood any better today.

Also GPU/SIMD hardware isn't good for GAs, since video cards are designed to run one wide algorithm instead of thousands or millions of narrow ones with subtle differences like on a cluster of CPUs. So I feel that progress on that front has been hindered for about 25 years, since I first started looking at programming FPGAs to run thousands of MIPS cores (probably ARM or RISC-V today). In other words, the perpetual AI winter we've been in for 50 years is more about poor hardware decisions and socioeconomic factors than technical challenges with the algorithms.

So I'm certain now that some combination of these old approaches will deliver AGI within 10 years. I'm just frustrated with myself that I never got to participate, since I spent all of those years writing CRUD apps or otherwise hustling in the struggle to make rent, with nothing to show for it except a roof over my head. And I'm disappointed in the wealthy for hoarding their money and not seeing the potential of the countless millions of other people as smart as they are who are trapped in wage slavery. IMHO this is the great problem of our time (explained by the pumping gas scene in Fight Club), although since AGI is the last problem in computer science, we might even see wealth inequality defeated sometime in the 2030s. Either that or we become Borg!

not seeing the potential of the countless millions of other people as smart as they are who are trapped in wage slavery.

Think of the change from what built Silicon Valley, that anyone with a good idea could work hard and change the world. Many of SV's wealthy began that way.

But for you: They say the best time to plant a tree is 20 years ago; the second best time is now.

If tree-planting robots are uniquitous, and any tree you plant will get shaded by their established ranks of trees .. it might be nice to plant a tree now but it's not going to pay any bills.

It's not all, or even mostly what other people do that limits us. It's popular to claim powerlessness, but the social acceptability won't make the consequences any better.

I wonder how someone will integrate symbolic reasoning with LLMs, or if it will be possible.

This is what we're doing. I think it's not only possible, but also necessary for applications beyond trial-and-error generation.

Ditto, this seems to have some parallels to the neurosymbolic ideas being explored by Lab V2 at ASU

LEAN

Paraphrasing a post on Twitter / X:

Things will only get better from here.

i.e.

AI capabilities are strictly monotonically increasing (as they have been for decades), and in this case, the capabilities are recursively self-improving: I'm already seeing personal ~20-30% productivity gains in coding with AI auto-complete, AI-based refactoring, and AI auto-generated code review diffs from comments.

I feel like we've hit a Intel-in-the-90s era of AI. To make your code 2x as fast, you just had to wait for the next rev. of Intel CPUs. Now it's AI models, once you have parts of a business flow hooked up with a LLM system (e.g. coding, customer support, bug triaging), "improving" the system amounts to swapping out the model name.

We can expect a "everything kinda getting magically better" over the next few years, with minimal effort beyond the initial integration.

AFAICT neither the blog post nor the linked paper showed anything like this. Specifically, they didn't make any comparison between results obtained _with_ an LLM vs results obtained _without_ one. IIUC, this paper showed results obtained by genetic programming using an LLM to generate a python kernel function conforming (maybe) to a given type signature. You don't _need_ an LLM to do this.

So the question of whether the LLM in particular does anything special here is still wide open.

Is this the start of Singularity?

I wonder what would happen if this was applied to alpha code. Instead of generating 1 million codes for each problem during inference, do this kind of iteration during training and then generate less codes for each problem (or generate 1 million but hopefully better ones?).

Some important context: the discovery is that a certain number in combinatorics is now known to be between 2.2202 and 2.756, not just between 2.218 and 2.756 as discovered last year. The improvement is by finding some particular sequences of numbers which have some special properties, not by a logic-heavy mathematical proof. (That doesn't mean it's unrigorous.)

But it's an interesting and possibly useful method of coming up with examples, pretty much a genetic algorithm with LLMs.

Garbage in. Garbage out. It's useless unless it can provide a functional mathematical proof of correctness or, at the least, provide a process by with it's fictional composition can be proven or disproven.

Regardless of whether this is verifiably new knowledge, it's an interesting case study when you consider limiting access to AI based on model size or some other regulatory measure, and the unfair advantage that confers on corporations that do discover new knowledge or laws of nature, and can monetize them without sharing.

Tl;dr in my words after a skim: this is a method for using LLMs to find new, SOTA (or at least very good?) heuristic algorithms for NP hard combinatorial optimization problems. They achieve this not by making the LLM itself smarter but by finding a way to keep only its best ideas. (Haven't had a chance to read carefully yet so caveat lector.)

Pretty impressive, but don't panic--we're not at the singularity yet.

With the universal approximation theorem, we can use artificial neutral networks with ReLUs to accurately approximate a function. But we just get weights out of it at the end. This approach feels similar, but provides code in the end.

One of the problems they were approaching was the cap set problem

https://en.m.wikipedia.org/wiki/Cap_set

The problem consists of finding the largest set of points (called a cap set) in a high-dimensional grid, where no three points lie on a line. This problem is important because it serves as a model for other problems in extremal combinatorics - the study of how large or small a collection of numbers, graphs or other objects could be. Brute-force computing approaches to this problem don’t work – the number of possibilities to consider quickly becomes greater than the number of atoms in the universe.

FunSearch generated solutions - in the form of programs - that in some settings discovered the largest cap sets ever found. This represents the largest increase in the size of cap sets in the past 20 years. Moreover, FunSearch outperformed state-of-the-art computational solvers, as this problem scales well beyond their current capabilities.

Related commentary on "self-play" by Subbarao: https://twitter.com/rao2z/status/1728121216479949048

From the article:

  FunSearch uses an evolutionary method powered by LLMs, which promotes and develops the highest scoring ideas. These ideas are expressed as computer programs, so that they can be run and evaluated automatically.

  The user writes a description of the problem in the form of code. This description comprises a procedure to evaluate programs, and a seed program used to initialize a pool of programs.

  At each iteration, FunSearch selects some programs from the current pool. The LLM creatively builds upon these, and generates new programs, which are automatically evaluated. The best ones are added back to the pool of existing programs, creating a self-improving loop.

For websearch, I (evaluator) use pplx.ai and phind.com in a similar manner. Ask it a question (seed) and see what references it brings up (web links). Refine my question or ask follow-ups (iterate) so it pulls up different or more in-depth references (improve). Works better in unearthing gems than sifting through reddit or Google.

Given Tech Twitter has amazing content too, looking forward to using Grok for research, now that it is open to all.

I have been wondering recently about a good test for checking whether LLMs can only parrot stochastically or do they manage to indeed learn some higher order concepts inside their weights.

What if we would take a bare LLM, train it only on the information (in Maths, physics, etc) that was available to, say, Newton and then try to prompt it to solve the problems that Newton solved? Will it be able to derive the calculus basics and stuff having never seen it before, maybe even with little bit of help from the prompts? Maybe it will be able to come up with something completely different, yet also useful? Or nothing at all...

For me, this is the only right way for LLMs to go. Use them to trim search space for robustly designed logical programs/algo's we then use to do things. Part of the chain, not the end 'solution' of the chain.

summary: given a program template/skeleton and a fitness function (# correct results, shorter programs, etc), generate a population of programs with LLM, use a prompt that generates a new program from k other versions (they found k=2 is good, kinda biological eh), run the programs on inputs and score them with the fitness function, uses the island model for evolution

I think the prompt looks in principle something like

  def foo_v1(a, b): ...
  def foo_v2(a, b): ...
  # generate me a new function using foo_v1 and foo_v2. You can only change things inside two double curly braces like THIS in {{ THIS }}
  # idk not a "prompt engineer"
  def foo(a, b): return a + {{}}

They achieved the new results with only ~1e6 LLM calls (I think I'm reading that right) which seems impressively low. They talk about evaluating/scoring taking minutes. Interesting to think about the depth vs breadth tradeoff here which is tied to the latency vs throughput of scoring an individual vs population. What if you memoize across all programs. Can you keep the loss function multidimensional (1d per input or input bucket) so that you might find a population of programs that do well in different areas first and then it can work on combining them.

Did we have any prior on how rare the cap set thing is? Had there been previous computational efforts at this to no avail? Cool nonetheless