First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.
Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.
It feels pretty disingenuous to claim silver-medal status when your machine played by significantly different rules. The article is light on details, but it says they wired it up to a theorem prover, presumably with feedback sent back to the AI model for re-evaluation.
How many cycles of guess-and-check did it take over the course of three days to get the right answer?
If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?
I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.
I believe you are misreading this.
First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.
Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
If this were the case then the headline would be "AI solves 4/6 IMO 2024 problems", it wouldn't be claiming "silver-medal standard". Medals are generally awarded by comparison to other contestants, not to the challenges overcome.
This is great, and I'm not complaining about what the team is working on, I'm complaining about how it's being sold. Headlines like these from lab press releases will feed the AI hype in counterproductive ways. The NYT literally has a headline right now: "Move Over Mathematicians, Here Comes AlphaProof".
At the IMO "silver medal" afaik is define as some tange of points, which more or less equals some range of problems solved. For me it is fair to say that "silver-medal performance" is IMO langauge for about 4/6 problems solved. And what's the problem if some clickbait websites totally spin the result? They would've done it anyways even with a different title, and I also don't see the harm. Let people be wrong.
No, "silver medal" is defined as a range of points to be earned in the allotted time (4.5 hours for both papers of 3 problems each).
And the cutoffs are chosen after the results are in, not in advance.
but this part currently sucks, because they didn't trust it and formalized problems manually.
Yea that's fair, but I don't think it will keep sucking forever as formalization is in principle just a translation process.
I am so exhausted of the AI hype nonsense. LLMs are not fucking curing cancer. Not now, not in five years, not in a hundred years. That's not what they do.
LLM/ML is fascinating tech that has a lot of legitimate applications, but it is not fucking intelligent, artificial or otherwise, and I am sick to death of people treating it like it is.
A significant part of the problem is that a majority of people are unaware of just how simple what they consider "intelligence" really is. You don't need actual intelligence to replace the public-facing social role of a politician, or a talking head, or a reactive-only middle manager. You just need words strung together that fit a problem.
What observation, if you saw it, do you think would falsify that hypothesis?
It is not artificial? so it is natural then?
I'm not sure it matters that it had access to a theorem prover. The fact that it's possible to build a black box that solves hard proofs on its own at all is the fascinating bit.
Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".
"three days" does not say anything about how much computational power is used to solve problems, maybe they have used 10% of all GCP :)
The thing is though, once we have a benchmark that we pass, it’s pretty typical to be able to bring down time required in short order through performance improvements and iterating on ideas. So if you knew you had GAI but it took 100% of all GCP for 3 years to give a result, within the next 5 years that would come down significantly (not least of which you’d build HW dedicated to accelerating the slow parts).
That's patently false for many classes of problems. We know exactly how to solve the traveling salesman problem, and have for decades, but we're nowhere close to solving a random 1000 city case (note: there are approximate methods that can find good, but not optimal, results on millions of cities). Edit: I should say 1,000,000 city problem, as there are some solutions for 30-60k cities from the 2000s.
And there are good reasons to believe that theorem finding and proof generation are at least NP-hard problems.
We're not talking about mathematical optimality here, both from the solution found and for the time taken. The point is whether this finds results more cheaply than a human can and right now it's better on some problems while others it's worse. Clearly if a human can do it, there is a way to solve it in a cheaper amount of time and it would be flawed reasoning to think that improving the amount of time would be asymptotically optimal already.
While I agree that not all problems show this kind of acceleration in performance, that's typically only true if you've already spent so much time trying to solve it that you've asymptoted to the optimal solution. Right now we're nowhere near the asymptote for AI improvements. Additionally, there's so many research dollars flowing into AI precisely because the potential upside here is nowhere near realized and there's lots of research lines still left to be explored. George Hinton ended the AI winter.
If you need to solve 1000 problems in 3 days you wouldn't find the humans that can do it. So it would not be cheaper if it's not possible.
The person said typical not always the case. Just because there are obviously cases where it didn't happen does mean it it's still not typically the case.
And say they did use 10% of all GCP? Would it be less impressive? This is a result that was considered by experts to be far beyond the state of the art; it's absolutely ok if it's not very efficient yet.
Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).
yes it is very impressive, especially autoformalization of problems written in natural language and also proof search of theorems
I haven't read TFA as I'm at work, but I would be very interested to know what the system was doing in those three days. Were there failed branches it explored? Was it just fumbling its way around until it guessed correctly? What did the feedback loop look like?
As opposed to 0.999999% of the human population who can't do it even if their life depends on it?
I was going to come here to say that. I remember being a teenager and giving up in frustration at IMO problems. And I was competing at IPhO.
This is NOT the paper, but probably a very similar solution: https://arxiv.org/abs/2009.03393
They just write "it's like alpha zero". So presumably they used a version of MCTS where each terminal node is scored by LEAN as either correct or incorrect.
Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).
I can't find a link to an actual paper, that just seems to be a blog post. But from what I gather the problems were manually translated to Lean 4, and then the program is doing some kind of tree search. I'm assuming they are leveraging the proof checker to provide feedback to the model.
The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
So they had three days to keep training the model, on synthetic variations of each IMO problem.
Don't confuse interpolation with extrapolation. Curing cancer will require new ideas. IMO requires skill proficiency in tasks where the methods of solving are known.
The methods are know, but the solutions to the IMO problems weren't. So the AI did extrapolate a solution.
Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)
I think you are correct though. We don't need new physics to cure cancer. But we may need information-handling, reasoning and simulation systems which are orders of magnitude bigger and more complex than anything we have this year. We also need to stop pussy-footing and diddling with ideologies and start working on the root cause of cancer and almost every other disease, which is aging.
Search is extrapolation. Learning is interpolation. Search+Learn is the formula used by AZ. Don't forget AZ taught us humans a thing or two about a game we had 2000 years head start in, and starting from scratch not from human supervision.
They are the same things
Mathematicians spend most of their time interpolating between known ideas and it would be extremely helpful to have computer assistance with that.
Or the simultaneous discovery of thousands of cryptographic exploits...
Still waiting for the first one. I'm not holding my breath - just like fuzzing found a lot of vulnerabilities in low-level software, I expect novel automated analysis approaches will yield some vulnerabilities - but that won't be a catastrophic event just like fuzzing wasn't.
Hope that's true. Really mucks up the world a bit if not.
I hope it doesn't find a new class of bug. Find another thing like Spectre could be problematic.
EDIT - I hope if that new class of bug exists that it is found. I hope that new class of bug doesn't exist.
It's rumored that the NSA has 600 mathematicians working for them. If they are the ones finding the exploits you will probably never hear about them until they are independently discovered by someone who can publish.
The problem solved "within minutes" is also interesting. I'd interpret that as somewhere between 2 and 59 minutes. Given the vagueness probably on the higher end, otherwise they'd celebrate it more. The students had 6 tasks in 9 hours, so on average 1.5h per task. If you add the time a student would take to (correctly!) translate the problems to their input format, their best-case runtime is probably about as fast as a silver-medalist would take to solve the problem on their own.
But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.
They say "our systems" (presumably meaning AlphaProof and AlphaGeometry 2) solved one problem "within minutes", and later on the page they say that the geometry question (#4) was solved by AlphaGeometry in 19 seconds.
So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.
I'd guess it's the first of those.
(Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)
As for the long solve times, I would guess they're related to this fascinating remark:
Euclidian Geometry still requires constructions to solve, and those are based in intuition.