return to table of content

AI solves International Math Olympiad problems at silver medal level

adverbly
42 replies
2h47m

First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.

lolinder
12 replies
2h5m

It feels pretty disingenuous to claim silver-medal status when your machine played by significantly different rules. The article is light on details, but it says they wired it up to a theorem prover, presumably with feedback sent back to the AI model for re-evaluation.

How many cycles of guess-and-check did it take over the course of three days to get the right answer?

If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?

let's be real I'd be okay waiting a month for the cure to cancer.

I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.

golol
6 replies
1h54m

I believe you are misreading this.

First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.

Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.

lolinder
3 replies
1h33m

First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.

If this were the case then the headline would be "AI solves 4/6 IMO 2024 problems", it wouldn't be claiming "silver-medal standard". Medals are generally awarded by comparison to other contestants, not to the challenges overcome.

This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.

This is great, and I'm not complaining about what the team is working on, I'm complaining about how it's being sold. Headlines like these from lab press releases will feed the AI hype in counterproductive ways. The NYT literally has a headline right now: "Move Over Mathematicians, Here Comes AlphaProof".

golol
2 replies
1h29m

At the IMO "silver medal" afaik is define as some tange of points, which more or less equals some range of problems solved. For me it is fair to say that "silver-medal performance" is IMO langauge for about 4/6 problems solved. And what's the problem if some clickbait websites totally spin the result? They would've done it anyways even with a different title, and I also don't see the harm. Let people be wrong.

mathnmusic
1 replies
1h14m

No, "silver medal" is defined as a range of points to be earned in the allotted time (4.5 hours for both papers of 3 problems each).

lolinder
0 replies
13m

And the cutoffs are chosen after the results are in, not in advance.

riku_iki
1 replies
1h25m

They are building a general pipeline which turns informal natural lamguage mathematics

but this part currently sucks, because they didn't trust it and formalized problems manually.

golol
0 replies
1h23m

Yea that's fair, but I don't think it will keep sucking forever as formalization is in principle just a translation process.

ToucanLoucan
3 replies
2h2m

I am so exhausted of the AI hype nonsense. LLMs are not fucking curing cancer. Not now, not in five years, not in a hundred years. That's not what they do.

LLM/ML is fascinating tech that has a lot of legitimate applications, but it is not fucking intelligent, artificial or otherwise, and I am sick to death of people treating it like it is.

jercos
0 replies
1h5m

A significant part of the problem is that a majority of people are unaware of just how simple what they consider "intelligence" really is. You don't need actual intelligence to replace the public-facing social role of a politician, or a talking head, or a reactive-only middle manager. You just need words strung together that fit a problem.

apsec112
0 replies
1h58m

What observation, if you saw it, do you think would falsify that hypothesis?

29athrowaway
0 replies
1h43m

It is not artificial? so it is natural then?

regularfry
0 replies
1h21m

I'm not sure it matters that it had access to a theorem prover. The fact that it's possible to build a black box that solves hard proofs on its own at all is the fascinating bit.

it still took 8x as long to solve the problems as our best humans did without any computer assistance.

Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".

nnarek
7 replies
2h18m

"three days" does not say anything about how much computational power is used to solve problems, maybe they have used 10% of all GCP :)

vlovich123
4 replies
2h9m

The thing is though, once we have a benchmark that we pass, it’s pretty typical to be able to bring down time required in short order through performance improvements and iterating on ideas. So if you knew you had GAI but it took 100% of all GCP for 3 years to give a result, within the next 5 years that would come down significantly (not least of which you’d build HW dedicated to accelerating the slow parts).

tsimionescu
3 replies
2h4m

That's patently false for many classes of problems. We know exactly how to solve the traveling salesman problem, and have for decades, but we're nowhere close to solving a random 1000 city case (note: there are approximate methods that can find good, but not optimal, results on millions of cities). Edit: I should say 1,000,000 city problem, as there are some solutions for 30-60k cities from the 2000s.

And there are good reasons to believe that theorem finding and proof generation are at least NP-hard problems.

vlovich123
1 replies
1h36m

We're not talking about mathematical optimality here, both from the solution found and for the time taken. The point is whether this finds results more cheaply than a human can and right now it's better on some problems while others it's worse. Clearly if a human can do it, there is a way to solve it in a cheaper amount of time and it would be flawed reasoning to think that improving the amount of time would be asymptotically optimal already.

While I agree that not all problems show this kind of acceleration in performance, that's typically only true if you've already spent so much time trying to solve it that you've asymptoted to the optimal solution. Right now we're nowhere near the asymptote for AI improvements. Additionally, there's so many research dollars flowing into AI precisely because the potential upside here is nowhere near realized and there's lots of research lines still left to be explored. George Hinton ended the AI winter.

visarga
0 replies
1h17m

The point is whether this finds results more cheaply than a human can

If you need to solve 1000 problems in 3 days you wouldn't find the humans that can do it. So it would not be cheaper if it's not possible.

rowanG077
0 replies
1h57m

The person said typical not always the case. Just because there are obviously cases where it didn't happen does mean it it's still not typically the case.

falcor84
1 replies
2h6m

And say they did use 10% of all GCP? Would it be less impressive? This is a result that was considered by experts to be far beyond the state of the art; it's absolutely ok if it's not very efficient yet.

Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).

nnarek
0 replies
1h34m

yes it is very impressive, especially autoformalization of problems written in natural language and also proof search of theorems

ZenMikey
6 replies
2h34m

I haven't read TFA as I'm at work, but I would be very interested to know what the system was doing in those three days. Were there failed branches it explored? Was it just fumbling its way around until it guessed correctly? What did the feedback loop look like?

visarga
1 replies
1h24m

just fumbling its way around until it guessed correctly

As opposed to 0.999999% of the human population who can't do it even if their life depends on it?

dsign
0 replies
44m

I was going to come here to say that. I remember being a teenager and giving up in frustration at IMO problems. And I was competing at IPhO.

thomasahle
0 replies
41m

They just write "it's like alpha zero". So presumably they used a version of MCTS where each terminal node is scored by LEAN as either correct or incorrect.

Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).

qsort
0 replies
2h25m

I can't find a link to an actual paper, that just seems to be a blog post. But from what I gather the problems were manually translated to Lean 4, and then the program is doing some kind of tree search. I'm assuming they are leveraging the proof checker to provide feedback to the model.

lacker
0 replies
1h9m

The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

So they had three days to keep training the model, on synthetic variations of each IMO problem.

10100110
5 replies
1h44m

Don't confuse interpolation with extrapolation. Curing cancer will require new ideas. IMO requires skill proficiency in tasks where the methods of solving are known.

trotro
1 replies
1h15m

The methods are know, but the solutions to the IMO problems weren't. So the AI did extrapolate a solution.

Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)

dsign
0 replies
39m

I think you are correct though. We don't need new physics to cure cancer. But we may need information-handling, reasoning and simulation systems which are orders of magnitude bigger and more complex than anything we have this year. We also need to stop pussy-footing and diddling with ideologies and start working on the root cause of cancer and almost every other disease, which is aging.

visarga
0 replies
1h23m

Search is extrapolation. Learning is interpolation. Search+Learn is the formula used by AZ. Don't forget AZ taught us humans a thing or two about a game we had 2000 years head start in, and starting from scratch not from human supervision.

trueismywork
0 replies
1h33m

They are the same things

golol
0 replies
1h25m

Mathematicians spend most of their time interpolating between known ideas and it would be extremely helpful to have computer assistance with that.

ComplexSystems
4 replies
2h26m

Or the simultaneous discovery of thousands of cryptographic exploits...

poincaredisk
3 replies
2h11m

Still waiting for the first one. I'm not holding my breath - just like fuzzing found a lot of vulnerabilities in low-level software, I expect novel automated analysis approaches will yield some vulnerabilities - but that won't be a catastrophic event just like fuzzing wasn't.

throwaway240403
0 replies
1h3m

Hope that's true. Really mucks up the world a bit if not.

sqeaky
0 replies
2h1m

I hope it doesn't find a new class of bug. Find another thing like Spectre could be problematic.

EDIT - I hope if that new class of bug exists that it is found. I hope that new class of bug doesn't exist.

criddell
0 replies
51m

It's rumored that the NSA has 600 mathematicians working for them. If they are the ones finding the exploits you will probably never hear about them until they are independently discovered by someone who can publish.

wongarsu
2 replies
2h9m

The problem solved "within minutes" is also interesting. I'd interpret that as somewhere between 2 and 59 minutes. Given the vagueness probably on the higher end, otherwise they'd celebrate it more. The students had 6 tasks in 9 hours, so on average 1.5h per task. If you add the time a student would take to (correctly!) translate the problems to their input format, their best-case runtime is probably about as fast as a silver-medalist would take to solve the problem on their own.

But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.

gjm11
1 replies
1h55m

They say "our systems" (presumably meaning AlphaProof and AlphaGeometry 2) solved one problem "within minutes", and later on the page they say that the geometry question (#4) was solved by AlphaGeometry in 19 seconds.

So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.

I'd guess it's the first of those.

(Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)

As for the long solve times, I would guess they're related to this fascinating remark:

The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
visarga
0 replies
1h20m

Euclidian Geometry still requires constructions to solve, and those are based in intuition.

Smaug123
26 replies
2h19m

So I am extremely hyped about this, but it's not clear to me how much heavy lifting this sentence is doing:

First, the problems were manually translated into formal mathematical language for our systems to understand.

The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?

ocfnash
8 replies
2h16m

The computer did find the answers itself. I.e., it found "even integers" for P1, "{1,1}" for P2, and "2" for P6. It then also provided provided a Lean proof in each case.

nnarek
5 replies
2h7m

formal definition of first theorem already contain answer of the problem "{α : ℝ | ∃ k : ℤ, Even k ∧ α = k}" (which mean set of even real numbers).if they say that they have translated first problem into formal definition then it is very interesting how they initially formalized problem without including answer in it

riku_iki
0 replies
2h0m

its not clear if theorem is actual input formal definition, or formal definition was in different form.

pishpash
0 replies
1h13m

Exactly, a problem and its answer are just different ways of describing the same object. Every step of a proof is a transformation/translation of the same object. It would be disingenuous to say that some heavy lifting isn't done in formalizing a problem but it seems that step is also performed by a machine:

"We established a bridge between these two complementary spheres by fine-tuning a Gemini model to automatically translate natural language problem statements into formal statements, creating a large library of formal problems of varying difficulty."

I'm confused, is the formalization by Gemini or "manually"? Which is it?

golol
0 replies
2h1m

I would expect that in their data which they train AlphaProof on they have some concept of a "vague problem" whoch could just look like

{Formal description of the set in question} = ?

And then Alphaproof has to find candidate descriptions of this set and prove a theorem that they are equal to the above.

I doubt they would claim to solve the problem if they provided half of the answer.

cygaril
0 replies
1h59m

Come up with many possible answers, formalize them all, and then try to prove or disprove each of them.

Smaug123
0 replies
1h46m

(You're talking to one of the people who was part of the project, which is why I took @ocfnash's answer as authoritative: they did not cheat.)

Davidzheng
1 replies
54m

Can you elaborate on how it makes guesses like this? Does it do experiments before? Is it raw LLM? Is it feedback loop based on partial progress?

Sharlin
0 replies
23m

"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."

dooglius
7 replies
2h8m

The linked page says

While the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.

However, it's unclear what initial format was given to the agents that allowed this step

pclmulqdq
3 replies
1h17m

So if Lean was used to find the answers, where exactly is the AI? A thin wrapper around Lean?

xrisk
0 replies
54m

Lean is just the language, Presumably to drive the AI towards the program (“the proof”)

pishpash
0 replies
50m

The AI is the "solver network", which is the (directed) search over solutions generated by Lean. The AI is in doing an efficient search, I suppose.

I'm also waiting for my answer on the role of the Gemini formalizer, but reading between the lines, it looks like it was only used during training the "solver network", but not used in solving the IMO problems. If so then the hyping is greatly premature, as the hybrid formalizer/solver is the whole novelty of this, but it's not good enough to use end-to-end?

You cannot say AlphaProof learned enough to solve problems if formalization made them easier to solve in the first place! You can say that the "solver network" part learned enough to solve formalized problems better than prior training methods.

dooglius
0 replies
55m

Lean checks that the proof is valid, it didn't find the proof.

Smaug123
2 replies
1h47m

FWIW, GPT-4o transcribed a screenshot of problem 1 perfectly into LaTeX, so I don't think "munge the problem into machine-readable form" is per se a difficult part of it these days even if they did somehow take shortcuts (which it sounds like they didn't).

pclmulqdq
1 replies
1h15m

Comparing "turn photo into LaTeX" to "translate theorems into Lean" is like comparing a child's watercolor drawing to the Mona Lisa.

Smaug123
0 replies
55m

… no? After the LaTeX output, I told stock GPT4o that the answer was "all even integers", and asked for the statement in Lean. I had to make two changes to its output (both of which were compile-time errors, not misformalisations), and it gave me the formalisation of the difficult direction of the problem.

Both changes were trivial: it had one incorrect (but unnecessary) import, and it used the syntax from Lean 3 instead of Lean 4 in one lambda definition. A system that was trained harder on Lean would not make those mistakes.

The one actual error it made was in not proposing that the other direction of the "if and only if" is required. Again, I am quite confident that this formalisation failure mode is not hard to solve in a system that is, like, actually trained to do this.

Obviously formalising problems that a working mathematicican solves is dramatically harder than formalising IMO problems, and is presumably way ahead of the state of the art.

summerlight
5 replies
1h31m

To speak generally, that translation part is much easier than the proof part. The problem with automated translation is that the translation result might be incorrect. This happens a lot when even people try formal methods by their hands, so I guess the researchers concluded that they'll have to audit every single translation regardless of using LLM or whatever tools.

thomasahle
3 replies
55m

You'd think that, but Timothy Gowers (the famous mathematician they worked with) wrote (https://x.com/wtgowers/status/1816509817382735986)

However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

So didn't actually solve autoformalization, which is why they still needed humans to translate the input IMO 2024 problems.

The reason why formalization is harder than you think is that there is no way to know if you got it right. You can use Reinforcement Learning with proofs and have a clear signal from the proof checker. We don't have a way to verify formalizations the same way.

thrdbndndn
1 replies
16m

However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?

(In contrast to problems actually from "a huge database of IMO-type problems", they do have answers for these already).

adrianN
0 replies
7m

You write proofs in a formal language that can be machine checked. If the checker is happy, the proof is correct (unless there is a bug in the checker, but that is unlikely).

llwu
0 replies
27m

We don't have a way to verify formalizations the same way.

While there is no perfect method, it is possible to use the agent to determine if the statement is false, has contradictory hypotheses, or a suspiciously short proof.

ajross
0 replies
34m

To speak generally, that translation part is much easier than the proof part.

To you or me, sure. But I think the proof that it isn't for this AI system is that they didn't do it. Asking a modern LLM to "translate" something is a pretty solved problem, after all. That argues strongly that what was happening here is not a "translation" but something else, like a semantic distillation.

If you ask a AI (or person) to prove the halting problem, they can't. If you "translate" the question into a specific example that does halt, they can run it and find out.

I'm suspicious, basically.

zerocrates
0 replies
1h46m

Interesting that they have a formalizer (used to create the training data) but didn't use it here. Not reliable enough?

kurthr
0 replies
1h3m

As is often the case, creating a well formed problem statement often takes as much knowledge (if not work) as finding the solution.

But seriously, if you can't ask the LLM to solve the right question, you can't really expect it to give you the right answer unless you're really lucky. "I'm sorry, but I think you meant to ask a different question. You might want to check the homework set again to be sure, but here's what I think you really want."

golol
0 replies
2h4m

When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean.

To me, this sounds like Alphaproof receives a "problem", whatever that is (how do you formalize "determine all X such that..."? One is asked to show that an abstract set is actually some easily understandable set...). Then it generates candidate Theorems, persumably in Lean. I.e. the set is {n: P(n)} for some formula or something. Then it searches for proofs.

I think if Alphaproof did not find {foo} but it was given then it would be very outrageous to claim that it solved the problem.

I am also very hyped.

lolinder
12 replies
1h56m

This is a fun result for AI, but a very disingenuous way to market it.

IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].

This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.

If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?

Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.

[0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf

golol
7 replies
1h41m

The point is not to compare AI and humans, it is to compare AI and IMO-level math problems. It's not for sport.

lolinder
5 replies
1h39m

They're literally comparing AI to human IMO contestants. "DeepProof solves 4/6 IMO problems correctly" would be the non-comparison version of this press release and would give a better sense for how it's actually doing.

golol
4 replies
1h35m

"Solving IMO problems at Silver-Medal level" is pretty much equivalent to solving something like 4/6 problems. It is only a disingenious comparison if you want to read it as a comparison. I mean yea, many people will, but I don't care anout them. People who are technically interested in this know that the point is not to have a competition of AI with humans.

lolinder
3 replies
1h30m

but I don't care anout them

It's great that you feel safe being so aloof, but I believe we have a responsibility in tech to turn down the AI hype valve.

The NYT is currently running a piece with the headline "Move Over, Mathematicians, Here Comes AlphaProof". People see that, and people react, and we in tech are not helping matters by carelessly making false comparisons.

golol
1 replies
1h27m

Why? Why is hype bad? What actual harm does it cause?

Also the headline is fair, as I do believe that AlphaProof demonstrates an approach to mathematics that will indeed invade mathematicians workspaces. And I say that as a mathemstician.

Davidzheng
0 replies
26m

For sure. I feel like mathematicians are not paying attention (maybe deliberately) we have a real shot of solving some incredibly hard open problem in the next few years.

visarga
0 replies
1h6m

I think search-based AI is on a different level than imitation models like GPT. This is not a hallucinating model, and it could potentially discover things that are not written in any books.

Search is amazing. Protein folding searches for the lowest energy configuration. Evolution searches for ecological fit. Culture searches for progress, and science for understanding. Even placing a foot on the ground searches for dynamic equilibrium. Training a ML model searches for best parateters to fit a dataset. Search is universal. Supervised learning is just imitative, search is extrapolative.

kenjackson
0 replies
1h35m

Exactly. The point is what can we eventually get AI to solve problems which we as humans can’t. Not if we can win the IMO with a computer.

blackbear_
1 replies
38m

And why aren't you complaining that human participants could train and study for thousands of hours before attempting the problems? And that the training materials they used was itself created and perfected by hundreds of other people, after having themselves spend countless hours studying?

lolinder
0 replies
24m

Because so did the AI?

gjm11
0 replies
1h51m

Working mathematicians mostly don't use theorem provers in their work, and find that when they do they go significantly more slowly (with of course the compensating advantage of guaranteeing no mistakes in the final result).

A theorem prover is probably more useful for the typical IMO problem than for the typical real research problem, but even so I'd guess that even with a reasonable amount of training most IMO contestants would not do much better for having access to a theorem prover.

Having three days would be a bigger deal, for sure. (But from "computers can't do this" to "computers can do this, but it takes days" is generally a much bigger step than from "computers can do this, but it takes days" to "computers can do this in seconds".)

Davidzheng
0 replies
27m

I can tell you that as someone who could have gotten bronze (i was too weak for the team) and is now a math phd--I would not have scored as well as alphaproof in three days most likely. In most problems either you find an idea soon or it can be much much longer. It's just not a matter of working and constant progress.

piombisallow
11 replies
2h41m

IMO problems aren't fundamentally different from chess or other games, in that the answer is already known.

almostgotcaught
2 replies
2h26m

in that the answer is already known.

you realize this holds true for all of math right? outside of godel incompleteness potholes every proof/theorem is a permutation of ZFC. And you can fix the potholes by just filling them in with more Cs.

lanstin
0 replies
1h43m

There are elementary Diophantine equations that are independent of ZFC. What is your approach for those?

diffeomorphism
0 replies
1h54m

That seems shallow and not really useful. Like sorting is easy, just take all permutations and look at each one to see whether is sorted.... it just might you take longer than the heat death of the universe to actually do that.

Smaug123
2 replies
2h34m

I really don't understand what you mean by this. 1) it's not known whether chess is a win for White or not. 2) IMO problems, such as 2024 problem 1 which the system solved, are often phrased as "Determine all X such that…".

diffeomorphism
1 replies
2h1m

You are attacking a straw man and the point made is pretty good.

Competition problems are designed to be actually solvable by contestants. In particular, the problems should be solvable using a reasonable collection of techniques and many "prep courses" will teach you many techniques, tools and algorithms and a good starting point is to throw that stuff at any given problem. So just like chess openings putting in lots of leg work will give you some good results for that part. You might very well lose in mid and late game, just like this AI might struggle with "actual problems"

It is of course still very impressive, but that is an important point.

Smaug123
0 replies
1h50m

I'm attacking nobody! I literally couldn't understand the point, so I said so: as stated, its premises are simply clearly false!

Your point, however, is certainly a good one: IMO problems are an extremely narrow subset of the space of mathematical problems, which is itself not necessarily even 50% of the space of the work of a mathematician.

wufufufu
1 replies
2h33m

Kinda? Chess isn't solved. Complex problems can have better solutions discovered in the future.

jeremyjh
0 replies
2h29m

It isn't solved but the evaluation (which side is better, by how much, and which moves are best) of a strong engine is - for all practical purposes - an answer to every chess position you can pose to it. This makes it easy to gauge improvement and benchmark against other systems compared to some other problems.

osti
0 replies
2h30m

But the answer is probably not known by you, in particular.

energy123
0 replies
2h28m

IMO and Chess are the same in the most important respect, you can use Lean or a simulated chess game to create unlimited quality training labels. Any problem of this category should be solved with enough compute and clever architecture/metacognition design. The more intractable problems are where data is hard to find or synthesize.

Davidzheng
0 replies
2h32m

lol

Ericson2314
10 replies
2h28m

The lede is a bit buried: they're using Lean!

This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.

Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.

Smaug123
5 replies
2h23m

And while AlphaProof is clearly extremely impressive, it does give the computer an advantage that a human doesn't have in the IMO: nobody's going to be constructing Gröbner bases in their head, but `polyrith` is just eight characters away. I saw AlphaProof used `nlinarith`.

xrisk
1 replies
46m

Can you give some context on how using Lean benefits?

In my understanding, proofs are usually harder to transcribe into Lean which is nobody _writes_ proofs using Lean.

What is a nlinarith?

llwu
0 replies
23m

nlinarith is a proof automation that attempts to finish a proof using the simplex method to find a linear combination of hypotheses and things that have already been proven, as well as some quadratic terms of them.

Docs: https://leanprover-community.github.io/mathlib4_docs/Mathlib...

empath75
0 replies
44m

Don't think of lean as a tool that the ai is using (ie, cheating), think of lean plus AlphaProof as a single system. There's no reason to draw artificial boundaries around where the AI is and where the tools that the AI is using are. Lean itself is a traditional symbolic artificial intelligence system.

People want always knock generative AIs for not being able to reason, and we've had automated systems that reason perfectly well for decades, but for some reason that doesn't count as AI to people.

Ericson2314
0 replies
1h31m

Hehe, well, we'll need to have a tool-assited international math Olympiad then.

Davidzheng
0 replies
51m

Good. I want my AI to use all the advantages it has to reinvent the landscape of mathematics

Ericson2314
3 replies
2h25m

They're def gonna go after the Riemann hypothesis with this, hehe.

nwoli
2 replies
2h11m

Guessing the context here is that the RH was recently translated into Lean. Would be very cool if they threw their compute on that

Smaug123
1 replies
2h8m

I think you might be thinking of the recent project to start Fermat's Last Theorem? The Riemann hypothesis has been easy to state (given what's in Mathlib) for years.

Davidzheng
0 replies
50m

Yeah lol i don't think either is hard to formalize in lean

petters
9 replies
2h38m

The problems were first converted into a formal language. So they were partly solved by the AI

golol
2 replies
2h14m

Formalization is in principle just a translation process and should be a much simpler problem than the actual IMO problem. Besides, they also trained a Gemini model which formalizes natural language problems, and this is how they generated training data for AlphaProof. I would therefore expect that they could have also formalized the IMO problems with that model and just did it manually because the point is not to demonstrate formalizing but instead proof capabilities.

riku_iki
0 replies
1h24m

Formalization is in principle just a translation process and should be a much simpler problem than the actual IMO problem

maybe not, because you need to connect many complicated topics/terms/definitions together, and you don't have a way to reliably verify if formalized statement is correct.

They built automatic formalization network in this case, but didn't trust it and formalized it manually.

pishpash
0 replies
36m

Yet the facts at hand are the opposite of what you say. Reliable formalizer was the more difficult problem than solving formalized IMO problems, because they have not produced one.

rpois
1 replies
2h23m

Does this formalization process include giving it the answer it should try to prove?

jeremyjh
1 replies
2h31m

Yes and it is difficult for me to believe that there is not useful human analysis and understanding involved in this translation that the AI is helpless without. But that I suppose is a problem that could be tackled with a different model...

adrianN
0 replies
2h19m

Even so, having a human formalize the problems and an AI to find machine checkable proofs could be very useful for mathematicians.

trotro
0 replies
1h7m

But formalization is the easy part for humans. I'm sure every mathematician would be be happy if the only thing required to prove a result was to formalize it in Lean and feed it to the AI to find the proof.

clbrmbr
0 replies
2h18m

IIUC, a Gemini-based system could translate the natural language questions into Lean, but in the blog post they don’t really commit to whether this was done just to generate training data or was used in the competition.

balls187
9 replies
2h17m

What was the total energy consumption required to acheive this result (both training and running)

And, how much CO2 was released into earths atmosphere?

amelius
5 replies
2h10m

There's no energy limit in the IMO rules.

balls187
4 replies
1h58m

The point isn't IMO rules.

It's that we are living in a period of time where there are very real consequences of nearly a century of unchecked CO2 due to human industry.

And AI (like crypto before it) requires considerable energy consumption. Because of which, I believe we (people who believe in AI) need to hold companies accountable by very transparently disclosing those energy costs.

amelius
3 replies
1h26m

What if at some point AI figures out a solution to climate change?

Sivart13
1 replies
1h4m

Why would we be any more likely to implement it, relative to the solutions that humans have already figured out for climate change?

sensanaty
0 replies
43m

Well we can be confident in the knowledge that techbros might finally take the issue seriously if an AI tells them to!

ks2048
0 replies
19m

I know this is not an uncommon opinion in tech circles, but I believe an insane thing to hang humanities hopes on. There's no reason to think AI will be omnipotent.

ozten
2 replies
2h15m

Compared to all of the humans who compete at this level and their inputs and outputs for the trailing 5 years.

balls187
1 replies
2h5m

And?

The result is (likely) net energy consumption, resulting in (likely) net CO2 emissions.

So, what was did it cost us for this achievement in AI?

EDIT TO ADD: It's fair to think that such a presser should not include answers to my questions. But, it's also fair to want that level of transparency given we are dealing with climate change.

regularfry
0 replies
25m

You're not wrong and in general the conclusion is that AI emits less CO2 than a human performing the same task. Whether that's true for this specific task is worth asking, as is the question of how efficient such a process can be made.

golol
8 replies
2h20m

This is the real deal. AlphaGeometry solved a very limited set of problems with a lot of brute force search. This is a much broader method that I believe will have a great impact on the way we do mathematics. They are really implementing a self-feeding pipeling from natural language mathematics to formalized mathematics where they can train both formalization and proving. In principle this pipeline can also learn basic theory building like creating auxilliary definitions and Lemmas. I really think this is the holy grail of proof-assistance and will allow us to formalize most mathematics that we create very naturally. Humans will work podt-rigorously and let the machine asisst with filling in the details.

visarga
7 replies
1h26m

a lot of brute force search

Don't dismiss search, it might be brute force but it goes beyond human level in Go and silver at IMO. Search is also what powers evolution which created us, also by a lot of brute forcing, and is at the core of scientific method (re)search.

kypro
3 replies
1h13m

My old AI professor used to say that every problem is a search problem.

The issue is that to find solutions for useful problem you're often searching through highly complex and often infinite solution spaces.

pishpash
1 replies
41m

There's no problem with search. The goal is to search most efficiently.

deely3
0 replies
13m

You mean that by improving search we can solve any problem? What if solution field is infinite, even if we make search algo 10x100 more performant, solution field will still be infinite, no?

visarga
0 replies
1h0m

For some problems validation is expensive. Like the particle collider or space telescope, or testing the COVID vaccine. It's actually validation that is the bottleneck in search not ideation.

thomasahle
0 replies
43m

Also AlphaProof had to search for 60 hours for one of the IMO problems it solved.

Eridrus
0 replies
1h16m

Search is great, search works, but there was not a tonne to learn from the AlphaGeometry paper unless you were specifically interested in solving geometry problems.

Davidzheng
0 replies
53m

Yes and there's a lot of search here too. That's a key to the approach

cs702
8 replies
20m

In 1997, machines defeated a World Chess Champion for the first time, using brute-force "dumb search." Critics noted that while "dumb search" worked for chess, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]

In 2016, machines defeated a World Go Champion for the first time, using a clever form of "dumb search" that leverages compute, DNNs, reinforcement learning (RL), and self-play. Critics noted that while "dumb search" worked for Go, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]

In 2024, machines solved insanely hard math problems at the Silver Medal level in an International Math Olympiad for the first time, using a clever form of "dumb search" that leverages compute, DNNs, RL, and a formal language. Perhaps "dumb search" isn't as dumb as the critics would like it to be?

---

[a] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

klyrs
2 replies
14m

A brute force search has perfect knowledge. Calling it "dumb" encourages bad analogies -- it's "dumb" because it doesn't require advanced reasoning. It's also "genius" because it always gets the right answer eventually. It's hugely expensive to run.

And you keep shifting the goalpost on what's called "dumb" here.

adrianN
1 replies
12m

You might have missed the point.

bdjsiqoocwk
0 replies
4m

Tell us what the point was, I think a lot of us are missing it.

TheDudeMan
1 replies
13m

So all forms of search are dumb in your view?

outlore
0 replies
3m

I think OP is arguing the opposite. In other words, critics of AI innovations call them “dumb search”. The goal post keeps moving.

tux3
0 replies
2m

The success in Go was very much not dumb search. Where dumb search had failed to achieve the level of even a very weak player, the neural net's "intuition" about good moves without any search was already very strong. Only the combination was superhuman, and that was anything but the dumb search that had been tried before.

Today's announcement is also not about proving Lean theorems by "dumb search". The success is about search + neural networks.

You're attacking critics for criticizing the solution that has failed, while confusing it for the solution that just worked.

sobellian
0 replies
9m

From whence are you quoting the phrase "dumb search?"

fspeech
0 replies
3m

By "dumb" I assume you meant brute-force. Search, as opposed to extrapolation, is what actually produces suprise or creative results, whether it happens in a person's head or on a computer. The issue is to produce the heuristics that can let one push back the combinatorial explosion.

majikaja
3 replies
2h1m

It would be nice if on the page they included detailed descriptions of the proofs it came up with, more information about the capabilities of the system and insights into the training process...

If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.

majikaja
1 replies
1h22m

I found those, I just would have appreciated if the content of the mathematics wasn't sidelined to a separate download as if it's not important. I felt the explanation on the page was shallow, as if they just want people to accept it's a black box.

All I've learnt from this is that they used an unstated amount of computational resources just to basically brute force what a human already is capable of doing in far less time.

Davidzheng
0 replies
49m

Very few humans can after years of training. Please don't trivialize.

c0l0
3 replies
2h34m

That's great, but does that particular model also know if/when/that it does not know?

ibash
0 replies
2h24m

Yes

AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. … Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness.
foota
0 replies
2h32m

Never?

Edit: To defend my response, the model definitely knows when it hasn't yet found a correct response, but this is categorically different from knowing that it does not know (and of course monkeys and typewriters etc., can always find a proof eventually if one exists).

diffeomorphism
0 replies
2h17m

While that was probably meant to be rhetorical, the answer surprisingly seems to be an extremely strong "Yes, it does". Exciting times.

highcountess
2 replies
2h22m

Sorry, mathematicians. Your irrelevance approaches. Do You have any other useful skills?

ykonstant
0 replies
1h32m

Programming in Lean?

dpbriggs
0 replies
57m

This is more analogous to programmers working with copilot. There's an exciting possibility here of mathematicians feeding these systems subproblems to assist in proving larger theorums.

skywhopper
1 replies
1h33m

Except it didn’t. The problem statements were hand-encoded into a formal language by human experts, and even then only one problem was actually solved within the time limit. So, claiming the work was “silver medal” quality is outright fraudulent.

noud
0 replies
1h6m

I had exactly the same feeling when reading this blog. Sure, the techniques used to find the solutions are really interesting. But the claim more than they achieve. The problem statements are not available in Lean, and the time limit is 2 x 4.5 hours. Not 3 days.

The article claims they have another model that can work without formal languages, and that it looks very promising. But they don't mention how well that model performed. Would that model also perform at silver medal level?

Also note, that if the problems are provided in a formal language, you can always find the solution in finite amount of time (provided the solution exists). You can brute-force over all possible solutions until you find the solution that proofs the statement. This may take a very long time, but it will find the solutions eventually. You will always solve all the problems and win the IMO at gold medal level. Alphaproof seems to do something similar, but takes smarter decisions which possible solutions to try and which once to skip. What would be the reason they don't achieve gold?

robinhouston
1 replies
1h41m

Some more context is provided by Tim Gowers on Twitter [1].

Since I think you need an account to read threads now, here's a transcript:

Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.

It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.

However, that statement needs a bit of qualifying.

The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.

If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.

Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.

Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.

As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.

However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.

So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.

That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.

Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.

It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.

But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.

The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".

However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.

1. https://x.com/wtgowers/status/1816509803407040909?s=46

visarga
0 replies
1h10m

If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.

Or if AlphaProof used more compute they could have slashed that time to 1/10 or less. It's arbitrary as long as we don't define what is the compute the AI should be entitled to use here.

StefanBatory
1 replies
1h47m

Wow, that's absolutely impressive to hear!

Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.

visarga
0 replies
1h3m

Everything that allows for cheap validation is going that way. Math, code, or things we can simulate precisely. LLM ideation + Validation is a powerful combination.

golol
0 replies
2h10m

This Wikipedia page makes AM kind of comes across as a nonsense project whose outputs no one (besides the author) bothered to decipher.

refulgentis
0 replies
1h17m

Goalposts at the moon, FUD at "but what if its obviously fake?".

Real, exact, quotes from the top comments at 1 PM EST.

"I want to believe that the computer found it, but I can't find anything to confirm."

"Curing cancer will require new ideas"

"Maybe they used 10% of all of GCP [Google compute]"

quirino
0 replies
1h47m

I honestly expected the IOI (International Olympiad of Informatics) to be "beaten" much earlier than the IMO. There's AlphaCode, of course, but on the latest update I don't think it was quite on "silver medal" level. And available LLM's are probably not even on "honourable mention" level.

I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).

Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.

[1] https://info.atcoder.jp/entry/llm-abc-rules-en

pmcf
0 replies
48m

I read this as “In My Opinion” and really thought this about AI dealing with opinionated people. Nope. HN is still safe. For now…

osti
0 replies
2h31m

So they weren't able to solve the combinatorics problem. I'm not super well versed in competition math, but combinatorics always seem to be the most interesting problems to me.

mupuff1234
0 replies
2h28m

Can it / did it solve problems that weren't solved yet?

michael_nielsen
0 replies
32m

A good brief overview here from Tim Gowers (a Fields Medallist, who participated in the effort), explaining and contextualizing some of the main caveats: https://x.com/wtgowers/status/1816509803407040909

lo_fye
0 replies
40m

Remember when people thought computers would never be able to beat a human Grand Master at chess? Ohhh, pre-2000 life, how I miss thee.

ksoajl
0 replies
11m

It's fascinating to see AI reaching the level of a silver medalist in the International Math Olympiad! This milestone highlights the progress in AI's ability to handle complex reasoning tasks. However, it's worth pondering how far we are from an AI achieving gold and what this means for future human-AI collaboration in mathematics. Can AI eventually surpass human ingenuity in problem-solving, or will it always be a complementary tool?

iamronaldo
0 replies
2h47m

No benchmarks of any kind?

gallerdude
0 replies
1h10m

Sometimes I wonder if in 100 years, it's going to be surprising to people that computers had a use before AI...

fancyfredbot
0 replies
1h56m

I'm seriously jealous of the people getting paid to work on this. Sounds great fun and must be incredibly satisfying to move the state of the art forward like that.

dmitrygr
0 replies
17m

First, the problems were manually translated into formal mathematical language

That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"

brap
0 replies
1h10m

Are all of these specialized models available for use? Like, does it have an API?

I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them

Jun8
0 replies
1h6m

Tangentially: I found it fascinating to follow along the solution to Problem 6: https://youtu.be/7h3gJfWnDoc (aquaesulian is a node to ancient name of Bath). There’s no advanced math and each step is quite simple, I’d guess on a medium 8th grader level.

Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.

I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!

HL33tibCe7
0 replies
26m

This is kind of an ideal use-case for AI, because we can say with absolute certainty whether their solution is correct, completely eliminating the problem of hallucination.

Davidzheng
0 replies
2h32m

HOLY SHIT. It's amazing

AyyEye
0 replies
38m

Parlor tricks. Wake me up AI can reliably identify which number is circled at the level of my two year old.