AlphaGeometry: An Olympiad-level AI system for geometry

If I read their paper right, this is legit work (much more legit than DeepMind's AI math paper last month falsely advertised as solving an open math research problem) but it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.

A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work. [edit: as corrected by Imnimo, I got this backwards - the brute search is just pure brute search, the transformer is used to predict which extra geometric construction to add]

Also (not mentioned in the blog post) the actual problem statements had to be modified/adapted, e.g. the actual problem statement "Let AH1, BH2 and CH3 be the altitudes of a triangle ABC. The incircle W of triangle ABC touches the sides BC, CA and AB at T1, T2 and T3, respectively. Consider the symmetric images of the lines H1H2, H2H3, and H3H1 with respect to the lines T1T2, T2T3, and T3T1. Prove that these images form a triangle whose vertices lie on W." had to be changed to "Let ABC be a triangle. Define point I such that AI is the bisector of angle BAC and CI is the bisector of angle ACB. Define point T1 as the foot of I on line BC. Define T2 as the foot of I on line AC. Define point T3 as the foot of I on line AB. Define point H1 as the foot of A on line BC. Define point H2 as the foot of B on line AC. Define point H3 as the foot of C on line AB. Define point X1 as the intersection of circles (T1,H1) and (T2,H1). Define point X2 as the intersection of circles (T1,H2) and (T2,H2). Define point Y2 as the intersection of circles (T2,H2) and (T3,H2). Define point Y3 as the intersection of circles (T2,H3) and (T3,H3). Define point Z as the intersection of lines X1X2 and Y2Y3. Prove that T1I=IZ."

it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.

How so? Reasoning is fundamentally a search problem.

The process you described is exactly the process humans use: i.e. make a guess about what's useful, try to work out details mechanically. If get stuck, make another guess, etc. So the process is like searching through a tree.

People figured out this process back in 1955 (and made a working prototype which can prove theorems): https://en.wikipedia.org/wiki/Logic_Theorist but it all hinges on using good 'heuristics'. Neural networks are relevant here as they can extract heuristics from data.

What do you think is "the usual idea of automated reasoning"? Some magic device which can solve any problem using a single linear pass?

Well, what's different is that humans invent new abstractions along the way such as complex numbers and Fourier transforms.

Not a lot of humans do

When a human has done the same thing many times they tend to try to generalize and take shortcuts. And make tools. Perhaps I missed something but I haven't seen a neural net do that.

Is that very different than the distillation and amplification process that happens during training? Where the neural net learns to predict in one step what initially required several steps of iterated execution.

IMHO, yes. It's not an (internal) invention occurring as a result of insight into the process - it's an (external) human training one model on the output of another model.

Every single human has abstractions that are unique to them. Your world model isn’t the same as mine.

It’s just that usually these abstractions are fuzzy and hard to formalize, so they aren’t shared. It doesn’t mean that they don’t exist.

This is laughable... Neural networks also have basically unknowable abstractions encoded in their weights. There was some work not long ago which taught an ANN to do modular arithmetic, and found that it was doing Fourier transforms with the learned weights....

True, but we're talking about Olympiad level math skills here.

A neural network is nothing but a heap of new abstractions from data.

I think the intuition here is that an “intelligent” system would lean much more heavily on the heuristics than on the search, like humans. It’s hard to quantify in this case, but certainly in the case of chess, when engines were about as good as best human players, they were doing orders of magnitude larger search than humans. Which made them feel more like chess _engines_ than chess AI. AlphaZero certainly made a step in the right direction, but it’s still far from how humans play.

The comment above mentions "brute force" incorrectly. It's fundamentally impossible to brute force googol combinations...

The key insight is this whole thread is that this Alpha Geometry only works because the search field is not a googol combinations. So, it doesn't really generalize to many other fields of math. We shouldn't expect an AlphaCategoryTheory or AlphaNumberTheory anytime soon.

"Brute force" = going through all combinations indiscriminately.

Using heuristics, beam search, etc, = "not brute force".

Calling something smart "brute force" is wrong.

We shouldn't expect an AlphaCategoryTheory or AlphaNumberTheory anytime soon.

There's already a number of papers demonstrating use of LLMs for math in general:

"Autoformalization with Large Language Models" https://arxiv.org/abs/2205.12615

"Large language models can write informal proofs, translate them into formal ones, and achieve SoTA performance in proving competition-level maths problems!" https://twitter.com/AlbertQJiang/status/1584877475502301184

"Magnushammer: A Transformer-based Approach to Premise Selection" https://twitter.com/arankomatsuzaki/status/16336564683345879...

I find it rather sad that the reaction to amazing research is "but it's just..." and "it might not work on ...". Like, maybe appreciate it a bit and consider a possibility that something similar to it might work?

AlphaGeometry works, in simplified terms, in two stages:

1. Heuristically add a candidate construction using an NN (a transformer)

2. Brute force search through all possible deductions from the current set of constructions using a symbolic solver

If you don't find a solution after step 2, repeat. There may be some backtracking involved to try a different set of constructions as well.

This approach only works because, in geometry, the set of possible deductions from a given construction is actually quite small.

Also, note that this approach overall is essentially an optimization, not amazing new capabilities. Replacing step 1 with a random construction still solves 10 problems on average in the given time, compared to 30 with the new approach. The existing algorithm, relying mostly on brute force search, is probably able to solve all of the geometry problems if given, say, 10 times as much time as the olympiad students (so not some absurd amount of time).

The search space here is not so large. Read this paper, a key part of the present work: https://doi.org/10.1023/A:1006171315513

Some magic device which can solve any problem using a single linear pass?

You mean like a calculator?

A calculator that can solve any problem human could, and more? Sure.

how did you edit your comment after it got visible to others?

You can edit a comment for about an hour if I remember correctly, regardless of it being visible. You can't delete it after someone responds to it though.

A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work.

I don't think this is quite right. The brute force search is performed by a symbolic solver, not the transformer. When it runs out of new deductions, the transformer is asked to suggest possible extra constructions (not randomly added).

Thanks for the correction, it looks like you're right.

When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work

That's exactly how I was taught geometry in school and I hated it with my all guts. Only after making it into the math department of the university I learned to do it properly and to enjoy it.

What happens to science when the most talented kids won't be able to compete in ~2 years tops? Would our civilization reach plateau or start downward trajectory as there will be no incentive to torture oneself to become the best in the world? Will it all fade away like chess once computers started beating grandmasters?

The incentive to compete in the IMO is that it's fun to do math contests, it's fun to win math contests, and (if you think that far ahead as a high schooler) it looks good on your resume. None of that incentive will go away if the computers get better at math contests.

I'd be super demoralized if anything I could do a future pocket machine could do much better and faster. Like my very best is not enough to even tread water.

That applies to most things humans so for fun.

No one needs these contest math problems solved -- by requirement, they are solved before any student attempts them.

People still compete in foot races even though cars can go faster. People still play chess and Go even though computers can beat them.

Chess never faded away. It actually has never been as big as it is today.

That might be true in the number of active chess players, but it's no longer viewed as a peak intellectual game and the magic is long gone, just another intense hobby some people do, but basic machines can do better.

It’a still viewed as a peak intellectual game. And the magic is still there, just watch or listen to the fans of Magnus Carlsen when they’re viewing, discussing or analyzing his games.

Maybe not the peak intellectual game, because there are overall so many other games, digital as well, but it is unaffected from the fact that computers have beat humans.

Chess is immensely more popular since Deep Blue beat Kasparov.

Chess was the drosophila of AI - something one could study in detail and invent newer and newer approaches to solving it. It's no longer having that function, was surpassed by Go for a brief moment until that one got solved as well. A whole generation that was raised on this drosophila is slowly fading away, for the new entrants its no longer the game to beat, more like brain stimulating fun exercise/hobby and not something capturing imagination, telling us something about the very base of our intelligence anymore.

True, but the fun is currently derived from humans going up against humans as a sport. Machines are tools (for learning or cheating) but we are not interested in competition with them.

How will it work for math, where humans do useful work right now? I can not see battle math becoming a big thing, but maybe when some of the tedious stuff goes away math will be philosophy and poking the smarter system is where entertainment can be had?

I never studied math to become the best in the world !

Are you an Olympiad Gold Medalist? Did you move the field of math significantly (beyond a basic PhD)?

We just need a Netflix series about a geometry prodigy with excellent fashion sense.

Unless you can figure out AI that is available everywhere, all the time (even in the most remote jungle with no power), there will always be value in making humans smarter

I've accepted that the human form, with its ~3 pounds of not especially optimized brainpower, is basically going to be relegated to the same status as demoscene hardware for anything that matters after this century.

That's cool by me, though. This bit of demoscene hardware experiences qualia, and that combination is weird and cool enough to make me want to push myself in new and weird directions. That's what play is in a way.

The same thing that happened when they allowed calculators in the classroom.

I don't think chess faded away

“What the I.M.O. is testing is very different from what creative mathematics looks like for the vast majority of mathematicians,” he said. ---

Not to pick on this guy, but this is ridiculous goal post shifting. It's just astounding what people will hand-wave away as not requiring intelligence.

No, that sort of thing has been said about math competitions for a long time. It's not a new argument put forward as something against AI.

An analogy with software is that math competitions are like (very hard) leetcode.

There was an article posted on HN recently that is related: https://benexdict.io/p/math-team

What is the argument; that math competitions are easy for computers but hard for humans?

It's a quote from the article. The argument is naturally there.

I was referring to your leetcode analogy; those too are hard for humans.

No, the analogue of competition math here is writing programs to solve leetcode problems: they emphasize quickly and reliably applying known tools, not developing new ones.

The usual argument is:

We test developer skill by giving them leetcode problems, but leetcode while requiring programming skill is nothing like a real programmer's job.

Like I said, it's got nothing to do with computers or AI. This point of view predates any kind of AI that would be capable of doing either.

The analogy is as follows. Like with with leetcode & job interviews is as follows, to excel at math competitions one must grind problems and learn tricks, in order to quickly solve problems in a high pressure environment. And, just like how solving leetcode problems is pretty different than what a typical computer scientist or software engineer does, doing math competitions is pretty different than what a typical mathematician does.

As said by others, that is an absolutely commonplace saying, albeit usually having nothing whatsoever to do with AI. See for instance https://terrytao.wordpress.com/career-advice/advice-on-mathe...

Also if you read about the structure of AlphaGeometry, I think it's very hard to maintain that it "requires intelligence." As AI stuff goes, it's pretty comprehensible and plain.

Yeah, this would be akin to saying "What leetcode is testing is very different from what professional programming looks like".

It isn't untrue, but both are considered metrics by the community. (Whether rightfully or not seems irrelevant to ML).

I guess it kind of depends on what a goal post should represent. If it represents an incremental goal then obviously you are correct.

I think the ultimate goal though is clearly that AI systems will be able to generate new high-value knowledge and proofs. I think the realistically the final goal post has always been at that point.

I’m not sure if it’s goal post shifting or not, but it is a true statement.

Google DeepMind keeps publishing these things while having missed the AI-for-the-people train, that I'm getting more and more the feeling that they have such a huge arsenal of AI-tech piling up in their secret rooms, as if they're already simulating entire AI-based societies.

As if they're stuck in a loop of "let's improve this a bit more until it is even better", while AGI is already a solved problem for them.

Or they're just an uncoordinated, chaotic bunch of AI teams without a leader who unifies them. With leader I don't mean Demis Hassabis, but rather someone like Sundar Pichai.

This is at almost the polar opposite end of the spectrum from "AGI," it's centered on brute search.

Brute searching all possible mathematical constructs, theorems, etc. to see which one fits the problem would probably take you practiacally an infinite amount of time.

This works tbh, how I see it, very closely to how a human does - via "instinct" it gathers relevant knowledge based on the problem and then "brute searches" some combinations to see which one holds. But this "intuition" is the crucial part where brute search completely fails and you need very aggressive compression of the knowledge space.

Brute searching all possible mathematical constructs, theorems, etc. to see which one fits the problem would probably take you practiacally an infinite amount of time.

That's not the kind of search which is being done. Read this paper: https://doi.org/10.1023/A:1006171315513

They are not polar opposites.

They are optimizing for Nature papers, not general usability.

Not even one thing they did was without an associated Nature paper, and the paper was always first.

Maybe they have a secret client. I mean someone must be doing this for our side. If not them then who?

simulating entire AI-based societies.

Didnt they already have scaled down simulations of this?

I don’t get the criticism. It seems like basic research and the kind of thing that would lead the way to “AGI” (combining llm-style prediction with logical reasoning). Unless you’re talking about what’s the point of publishing Nature papers - then it’s probably that the people involved want some concrete recognition in their work up to this point. And I supposed tech/investor press until they get something useful out from it.

What they did here is one of the first steps towards an AI that generates nontrivial and correct mathematical proofs. (An alternative benchmark would be IMO-level inequalities. Other topics such as algebra, combinatorics and number theory are probably no easier than the rest of mathematics, thus less useful as stepping stones.)

And an AI that generates nontrivial and correct mathematical proofs would, in turn, be a good first step towards an AI that can "think logically" in the common-sense meaning of the word (i.e., not just mess around with words and sentences but actually have some logical theory behind what it is saying). It might be a dead end, but it might not be. Even if it is, it will be a boon to mathematics.

I think Occam's razor would have something to say about the relative likelihood of those two options.

From the Nature article:

Note that the performance of GPT-4 performance on IMO problems can also be contaminated by public solutions in its training data.

Doesn't anybody proofread in Nature?

Not a subject matter expert, so forgive me if the question is unintentionally obtuse, but It seems like a reasonable statement. They seem to be inferring that problems with existing public solutions wouldn't be a good indicator of performance in solving novel problems-- likely a more important evaluative measure than how fast it can replicate an answer it's already seen. Since you couldn't know if that solution was in its training data, you couldn't know if you were doing that or doing a more organic series of problem solving steps, therefore contaminating it. What's the problem with saying that?

They're referring to "performance of performance".

Not that it's a big deal. I notice problems like this slip into my writing more and more without detection as I get older :/

I don't see the sentence in the Nature paper, though.

Ah-- I've got super bad ADHD so I usually don't even catch things like that. I'm a terrible editor. I wouldn't be surprised if the paper authors were paying attention to this thread, and gave a call to Nature after seeing the parent comment.

The text is still there.

Paper: https://www.nature.com/articles/s41586-023-06747-5

Section title: Geometry theorem prover baselines

Second paragraph under that. 3rd sentence from the end of the para.

Paper: https://www.nature.com/articles/s41586-023-06747-5

Section title: Geometry theorem prover baselines

Second paragraph under that. 3rd sentence from the end of the para.

Weird. I can't find the sentence you are quoting in the paper.

Paper: https://www.nature.com/articles/s41586-023-06747-5

Section title: Geometry theorem prover baselines

Second paragraph under that. 3rd sentence from the end of the para.

Thank you.

Though this particular model doesn't sound generalizable, the neuro-symbolic approach seems very promising to me:

- linking the (increasingly powerful) "system 1" tools that are most of current ML with more structured "system 2" tools, like logical proof generation, which can plan and and check the veracity / value of output. - System 2 chugs along 'till it gets stuck, then system 1 jumps in to provide an intuitive guess on what part of state space to check next. - Here they leveraged the ability to generate proofs by computer to create a data set of 100m proofs, enabling scalable self-supervised learning. Seem to me the symbolic domains are well formed to allow such generation of data, which while low value in each instance, might allow valuable pre-training in aggregate.

Putting these elements together is an approach that could get us quite far.

The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.

The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.

You do not need to solve everything at once. This approach has the potential to revolution both math and programming by moving formal verification from being a niche tool into a regular part of every practitioners toolbox.

It also completely solves (within the domain it applies) one of the most fundamental problems of AI that the current round is calling "hallucinations"; however that solution only works because we have a non AI system to prove correctness.

At a high level, this approach is not really that new. Biochem has been using AI to help find candidate molecules, which are then verified by physical experimentation.

Combinatorical game AI has been using the AI as an input to old fasion monte carlo searches

however that solution only works because we have a non AI system to prove correctness.

I'm not sure why you are calling it non-AI. There's no reason why some AI system has to be some single neural network like GPT and not a hacked together conglomeration of a bunch of neural networks and symbolic logic systems. Like is it cheating to use a SAT solver? Is a SAT solver itself not artificial intelligence of a kind?

In modern terminology, AI = machine learning model. The hand coded GOFAI from the past is just called software.

No, you can't just declare a word redefined when it is still in common usage in its correct meaning, which includes both GOFAI and deep learning/neural network approaches. AI is effect defined, not architecture defined.

Moreover, it's not so recently that people began training networks to help make guesses for branch and bound type solvers.

however that solution only works because we have a non AI system to prove correctness

But this is actually a really common scenario. Checking a solution for correctness is often much easier than actually finding a correct solution in the first place.

This I suspect is the closest chance to reaching some flavor of AGI

As soon as ChatGPT got released, I tried to make it solve IMO-style problems. It failed.

If this is legit, this is huge. Finding geometric proofs means proving things, and proving propositions is what an intelligence does. In my opinion, this is the closest we have gotten so far to AGI. Really excited to see what the future holds.

As soon as ChatGPT got released, I tried to make it solve IMO-style problems. It failed.

Have you tried the same questions with ChatGPT 4? It is a transformational change (no pun intended) over the earlier releases, and over all open-source models.

Just today, I needed to interpret some awkwardly-timestamped log data. I asked it a few questions along the lines of "What time it was 10,000 seconds before xx:yy:zz?" It didn't give me the answer, but it wrote and executed a Python program that did.

From the paper:

When producing full natural-language proofs on IMO-AG-30, however, GPT-4 has a success rate of 0% ...

"Also, the C++ code this talking dog wrote for me is full of buffer overflow vulnerabilities, and my Commodore 64 still can't run Crysis. There's nothing but hype here, move along."

News flash: either AI gains the ability to postulate theorems, generate proofs, and validate them, or mathematics is as dead as Latin. Humans have reached their limit.

Proving propositions has been computer work for several decades.

This is not a pure LLM. Search algorithms are obviously incredibly powerful at solving search problems. The LLM contribution is an "intuitive" way to optimize the search.

This is basically semantic pruning.

From the paper:

When producing full natural-language proofs on IMO-AG-30, however, GPT-4 has a success rate of 0% ...

To train AlphaGeometry's language model, the researchers had to create their own training data to compensate for the scarcity of existing geometric data. They generated nearly half a billion random geometric diagrams and fed them to the symbolic engine. This engine analyzed each diagram and produced statements about their properties. These statements were organized into 100 million synthetic proofs to train the language model.

With all the bickering about copyright, could something similar be used for coding llms? Would kill the ip issues, at least for coding

What statements about properties of randomly generated code snippets would be useful for coding LLMs? You would need to generate text explaining what each snippet does, but that would require an existing coding LLM, so any IP concerns would persist.

Yeah, but couldn't some system be built that understands what the differnt code snippets do by compiling them or whatever?

It's the topic of this fascinating paper:

https://arxiv.org/abs/2207.14502

Hey, thanks! I'll add it to my reading list

For language, the symbolic engine itself would likely be trained on copyrighted input, unlike the geometry engine, since math & math facts are not covered by copyright.

If you couple random text genrsrion and such an engine for language, you'd be laundering your training using the extra step (and the quality will likely be worse due to multiplicative errors)

Interesting that the transformer used is tiny. From the paper:

"We use the Meliad library for transformer training with its base settings. The transformer has 12 layers, embedding dimension of 1,024, eight heads of attention and an inter-attention dense layer of dimension 4,096 with ReLU activation. Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads. Our customized tokenizer is trained with ‘word’ mode using SentencePiece and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding. Sequence packing is also used because more than 90% of our sequences are under 200 in length."

It's not tiny, this is a quite normal size outside the field of LLMs, e.g. normal-sized language models, or also translation models, or acoustic models. Some people even would call this large.

It's tiny by the standards of transformers, pretty sure most transformers trained (across all domains) are larger than this

It's tiny by the standards of LLMs; _L_LM might give you some indication as to where they fit in the overall landscape.

No. Where do you have this from?

Looking at NeurIPS 2023:

https://openreview.net/group?id=NeurIPS.cc/2023/Conference#t...

Some random spotlight papers:

- https://openreview.net/pdf?id=YkBDJWerKg: Transformer (VPT) with 248M parameters

- https://openreview.net/pdf?id=CAF4CnUblx: Vit-B/16 with 86M parameters

- https://openreview.net/pdf?id=3PjCt4kmRx: Transformer with 282M parameters

Also, in my field (speech recognition, machine translation, language modeling), all using Transformer variants, this is a pretty normal model size.

This suggests there may be some more low hanging fruit in the hard sciences for transformers to bite at, so long as they can be properly formulated. This was not a problem of scaling it seems.

As a former problem designer for IMO and similar contests, I deeply enjoyed reading this paper. At the same time, I'd like to point out that it was clear Geometry had to be the first topic to give up against AI (i.e., smart knowledge and inference-method indexing) Among math olympiad topics, Geometry problems are often the most "mechanical." Once you can express the problem in terms of coordinates (think XY or complex plane) you're looking at a finite set of steps a computer can use to find a solution. Of course, time limits and human error get in the way of this being practical during the IMO itself. Back in the day, I'd use WolframAlpha to verify geometry proofs for problems I designed (+conjectures) using this approach.

Algebra (especially inequalities) can be similar – pull off some intense calculations and you often have your answer.

Where I'm really excited to see intelligent systems make progress is with Number Theory and Combinatorics problems. The search spaces are far more complex and they often require proving that something is impossible. These are the problems that would be difficult to solve with brute force computation.

As a consumer of those problems, first of all, thank you! Even decades on from high school I occasionally enjoy working on them.

But agree that geometry was obviously going to be first. From what I’ve gathered here, it’s not “brute forcing” in terms of relying on algebraic geometry, vectors, or complex number solutions, but it is brute force in that it’s exhaustively looking for “interesting” constructions.

Geometry was always my worst subject, but even so I felt like if given the right construction the problem was much easier. Unfortunately I never developed the intuition to hit on those constructions quickly. It seems this AI doesn’t either, but it can churn through them much faster — There are only so many perpendiculars and parallels and bisectors you can construct which you can more or less mechanically evaluate (map out all angles and ratios, try power of a point, etc.)

While this is incredibly impressive, it seems like Deep Mind:Kasparov::AlphaGeo:Terry Tao in the “engine vs AI” sense.

I agree Algebra is likely next. Like geometry, more often than not you “just” need a clever substitution or 3, of which there are only so many options to choose from.

Some combinatorics problems I think would also be amenable to this search strategy (e.g., finding things to count 2 ways), but that seems like a bridge further away, and only gets you a subset of problems.

Number theory would be my guess for the final frontier before a perfect 42.

Heh, can’t edit now, but s/Deep Mind/Deep Blue/

How did you get into such a position? Is there an application process of some sort?

After verifying solvability, what is the process for selecting the specific problems that will show up in the finished set? Is it by vote or some other evaluation?

Each year, national math olympiad committees submit their toughest problems to the IMO. A select group makes the "shortlist," a pool of ~32 questions (8 per geometry, algebra, number theory, combinatorics). From this pool, the final six IMO problems are chosen through a rigorous voting process.

What makes a great IMO problem? It's all about originality. Ideal problems aren't just about applying advanced theorems, but rather test creative thinking and mastery of fundamental principles. Of course, there were cases in which the IMO committee didn't realize a problem was an obvious corrolary of a grad-level theorem, and some well-read (highschool) students benefited from knowing that!

My Journey: I was once a top-ranked competitor myself. After high school, I joined my country's math olympiad committee.

I was all ready to be skeptical for most of these kind of works its outputs are 'not like human proofs' but then I saw Evan Chen's quote that it is indeed clean human-readable proof. Evan Chen is a prominent member of Olympiad math community and an author of famous Olympiad geometry book[1] so this time I will have to concede that machines indeed have conquered part of the IMO problems.

[1]: https://web.evanchen.cc/geombook.html

But then again, there's an error in the proof of IMO P3, in Fig1.f and Step 26. of the full proof in supplimentary material[1], which it states that ∠GMD = ∠GO2D, which is incorrect. It should be ∠GMD + ∠GO2D = π. I am trying to follow its logic but I cannot parse Step 25. Did it hallucinate this step?

It has the right idea of O2 being on the nine point circle though.

edit: I retract my statement. It looks like it is using directed angles[2] which then the statement becomes correct.

[1]: https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

[2]: https://web.evanchen.cc/handouts/Directed-Angles/Directed-An...

I haven’t checked the proof yet, but could it be possible that it’s using directed angles? Threw me off initially when I first learned it as well, but it can be handy for different configurations.

Looks like you are correct. I did not know about directed angles. Thanks.

I appreciate that the authors released code and weights with their paper! This is the first high-profile DeepMind paper I can recall that has runnable inference code + checkpoints released. (Though I'm happy to be corrected by earlier examples I've missed)

I don't yet see a public copy of the training set / example training code, but still this is a good step towards providing something other researchers can build on -- which is after all the whole point of academic papers!

Yeap. I'm missing the datasets as well. They have generated 100M synthetic examples ... Were these examples generated with AlphaGeometry? Where is the filtering code and initial input to generate these synthetics?

Im I'm wrong that they are using t5 model? At least they are using the sentencepiece t5 vocabulary.

How many GPU hours have they spend training this model? Which training parameters were used?

Don't get me wrong, I find this system fascinating it is what applied engineering should look like. But I'd like to know more about the training details and the initial data they have used as well as the methods of synthetic data generation.

The methods section of the paper describes the training data generation as well as the model settings: https://www.nature.com/articles/s41586-023-06747-5#Sec16

ty for pointing that out!

(via https://news.ycombinator.com/item?id=39030186, but we'll merge that thread hither)

https://archive.is/https://www.nytimes.com/2024/01/17/scienc...

https://web.archive.org/web/20240117162133/https://www.nytim...

It was interesting. One of the reviewers noted that one of the generated proofs didn’t seem that elegant in his opinion.

Given enough compute, I wonder how much this would be improved by having it find as many solutions as possible within the allotted time and proposing the simplest one.

Elegance would require more than this. Reasoning, for one.

The issue is that the computer-generated proofs operate at a very low level, step by step, like writing a program in assembly language without the use of coherent structure.

The human proof style instead chunks the parts of a solution into human-meaningful "lemmas" (helper theorems) and builds bodies of theory into well-defined and widely used abstractions like complex numbers or derivatives, with a corpus of accepted results.

Some human proofs of theorems also have a bit of this flavor of inscrutable lists of highly technical steps, especially the first time something is proven. Over time, the most important theorems are often recast in terms of a more suitable grounding theory, in which they can be proven with a few obvious statements or sometimes a clever trick or two.

I'm very curious how often the LM produces a helpful construction. Surely it must be doing better than random chance, but is it throwing out thousands of constructions before it finds a good one, or is it able to generate useful proposals at a rate similar to human experts?

They say in the paper, "Because the language model decoding process returns k different sequences describing k alternative auxiliary constructions, we perform a beam search over these k options, using the score of each beam as its value function. This set-up is highly parallelizable across beams, allowing substantial speed-up when there are parallel computational resources. In our experiments, we use a beam size of k = 512, the maximum number of iterations is 16 and the branching factor for each node, that is, the decoding batch size, is 32."

But I don't totally understand how 512 and 16 translate into total number of constructions proposed. They also note that ablating beam size and max iterations seems to only somewhat degrade performance. Does this imply that the model is actually pretty good at putting helpful constructions near the top, and only for the hardest problems does it need to produce thousands?

IMHO: this bumps, hard, against limitations of language / human-machine analogies.

But let's try -- TL;DR 262,144, but don't take it literally:

- The output of a decoding function is a token. ~3/4 of a word. Let's just say 1 word.

- Tokens considered per token output = 262,144 Total number of token considerations for 1 output token = beam_size * branching_factor * max_iterations = 512 * 32 * 16 = 262,144.

- Let's take their sample solution and get a word count. https://storage.googleapis.com/deepmind-media/DeepMind.com/B...

- Total tokens for solution = 2289

- Total # of tokens considered = 600,047,616 = 262,144 * 2289

- Hack: ""number of solutions considered"" = total tokens considered / total tokens in solution

- 262,144 (same # as number of tokens we viewed at each iteration step, which makes sense)

thanks TIL what Beam search is https://en.wikipedia.org/wiki/Beam_search

The real TIL (to me) is that the previous state-of-the-art could solve 10 of these! I'd heard there was a decision algorithm for plane geometry problems but I didn't know it was a practical one. Some searching turned up http://www.mmrc.iss.ac.cn/~xgao/paper/book-area.pdf as a reference.

Yes, and even the non-neural-network, symbolic plus linear algebra component of AlphaGeometry is able to outperform the previous state of the art. So a decent amount of work here went into the components that aren't neural networks at all.

Oh hey! I ghost-wrote an earlier article that references the same idea (https://blog.google/technology/ai/bard-improved-reasoning-go...). I wonder if they came to it independently or if they read mine and it planted some seeds? Either way, super cool work.

They have been working on these ideas since like 2021 they said, so I don't think it's that recent.

It looks like there's some interesting works to connect ML with symbolic reasoning (or searching). I'm closer to layman in this area but IIUC the latter is known to be rife with yet-to-be-understood heuristics to prune out the solution space and ML models are pretty good at this area. I'm not in a position to suggest what needs to happen to further push the boundary, but in my impression it looks like the current significant blocker is that we don't really have a way to construct a self-feedback loop structure that consistently iterates and improves the model from its own output. If this can be done properly, we may see something incredible in a few years.

I didn't look further into their method, but it seems to me that symbolic reasoning is only important insofar as it makes the solution verifiable. That still is a glorious capability, but a very narrow one.

So create ai modules for every niche subject and have a OpenAI like function calling system to inference them?

Yes but I would argue geometry is not niche it is the foundation of reasoning about the physical world not just math problems

Do normal proofs for Olympiad level geometry questions require 150+ steps like the one that was generated here? It seems inelegant.

When a human solves these problems, you might state a "step" in natural language. Like, "apply a polar transformation to this diagram". But that same single step, if you translate into a low-level DSL, could be dozens of different steps.

For an example, check out this formalization of a solution to a non-geometry IMO problem:

https://gist.github.com/mlyean/b4cc46cf6b0705c1226511a2b404d...

Roughly the same solution is written here in human language:

https://artofproblemsolving.com/wiki/index.php/1972_IMO_Prob...

It's not precisely clear what counts as a "step" but to me this seems like roughly 50 steps in the low-level formalized proof, and 5 steps in the human-language proof.

So, my conclusion is just that I wouldn't necessarily say that a 150-step automated proof was inelegant. It can also be that the language used to express these proofs fills in more of the details.

It's like the difference between counting lines of code in compiled code and a high-level language, counting the steps is really dependent on the level at which you express it.

I suppose we can now replace high schoolers.

and their teachers

What deductive system are they using to verify the proofs? I'm asking because the conventions of olympiad geometry are slightly different from those of the rest of mathematics (you can make informal "general position" arguments; you can pooh-pooh issues of sign and betweenness; you can apply theorems to unstated limiting cases; etc.), and it is far from obvious to me how this logic can be formalized without causing contradictions.

They built their own symbolic geometry engine, called DDAR. The IMO judges will typically let you get away with much less rigor than a symbolic engine like this produces, so I don't think they are taking advantage of any of these conventions.

This is fascinating and inspiring progress! What's crazy to me is how many people are looking to put stakes in the ground in front of this, claiming breathlessly before anyone can speak, that this isn't real AI progress or this isn't actually doing it how humans supposedly do, and implying that therefore it's not a big deal. I'm sorry to break to you folks, this is just another sign post moment. One more event to mark our place in history as we sail on. And when we look back on it, there isn't going to be all the disclaimers people are trying to throw down now. We had imagenet and deep blue, we had alphazero, alphago, GPT4, alphafold, and now we have AlphaGeometry.As DJ Khaled likes to point out, this is.. Another one

This reads like another "edge lord" or "karma farming "comment. Seems to be more and more frequent here lately.

The point of the forum is that people who are interested in these things can discuss them together openly. People are having discussions about the way they interpret the paper and what they think it means. You seem to be telling those people that's pointless and we should just accept this as a breakthrough because you told us so. Which I think goes against the spirit of this forum and the ability for people to look at the world with an objective lens.

If this work is as good as you say it is, and it very well might be, it will stand on it's own no matter what people say about it. Time will tell if this was meaningful or not.

“Our AI system”

It’s our maths. No one else owns it?

Funny game the AI field is playing now. One-upmanship seems to be the aim of the game.

The AI system that they produced.

“Our” doesn’t always imply ownership/title. I do not own my dad.

This sounds like the famous system 1 and system 2 thinking with the prover doing the systematic system 2 thinking and the llm doing the intuitive system 1 thinking.

Would something like this help ai/computer see the world better for robotics and self driving etc. I am not a programmer or in this field but I don't know why intuitally reading the headline alone made me feel the world of robotic and self driving etc ie where ai interact with the physical world is about to change.

the hounds of tindalos will be sniffing around this

And it's been open sourced. The code and model both!

Just in case the DOI link from the press-release doesn't work for you -- https://www.nature.com/articles/s41586-023-06747-5 here's the paper.

My understanding is that they encoded domain in several dozens of mechanical rules described in extended data table 1, and then did transformer-guided brute force search for solutions.

AlphaGeometry paper: https://www.nature.com/articles/s41586-023-06747-5

I wonder how well it would do with html page layout. It is nice to see advancements in geometric reasoning!

To give you a basis for comparison, I've found that (paid) ChatGPT even quite recently was not able to reason about simple html and css geometric constructs, such as what div is inside what container: with just two divs or containers on the page, it still applied styling to the wrong one and I had to move the line myself to the one it meant to style, even after I described to ChatGPT what the page currently looked like.

I just tried it again, it did succeed with this prompt:

"Create an HTML page that you divide in half vertically, you will set the background of divs to different colors. After dividing the page in half vertically, on the left half of the page put three different divs each taking up a third of the vertical space, color them red, orange, yellow. On the right half of the page create two different divs each taking half of the vertical space, color them green, blue. Within each div put a paragraph in black letters with the color and location of the div it is located in, such as "Left hand side, top div. This background is red." and so on for top, middle, and bottom on the left and top and bottom on the right, with the colors red, orange, yellow, green and blue respectively as their backgrounds. Reply with the complete html and css style page in single html file that I can copy and paste into a single file, when loaded it should show the layout I have described." When I try this right now I get a successsful page.

Here is the code it generated: https://paste.sh/drpp3Kci#FoQHQT6YtMknnI-y2N7beqAd which you can see created this https://ibb.co/NFwnvWL

It previously failed at similarly layout questions.

If ChatGPT had this "Olympiad-level" geometric reasoning it would be amazing for producing html pages that include complex layout. It currently doesn't always succeed at this.

Layout is the area that I've found ChatGPT is weakest in for generating code, which I chalk up to the fact that it doesn't have a vision subsystem so it can't really picture about what it's doing.

bruh! with 100 mln elements in the dataset it is basically printing out previously known problems during testing.

Maybe it's mentioned somewhere but I missed it, could someone provide a list of the five problems that AlphaGeometry failed to solve? The paper mentioned IMO 2008p6 and IMO 2019p2. What are the other three?

Also relevant: https://aimoprize.com/

($10m prize for models that can perform well at IMO)

Given triangle ABC, by Euclidean construction, find D on AB and E on BC so that the lengths

AD = DE = EC.

There's a worked example near the article's start proving that an isoceles triangle has equal angles: the so-called pons asinorum. The caption concludes with "In this example, just one construct is required."

That strikes me as odd, because I'd expect the symbolic engine to require zero such constructs to find a proof. Although dropping a bisector is the standard grade-school proof strategy, there's a well-known proof that defines no additional entities: since AB = AC, we have (by side-side-side) that △ABC ≅ △ACB, and therefore that ∠ABC = ∠ACB.

It's funny because what that proof is most famous for is having been rediscovered by an automated theorem prover in the 1950s-1960s. (Wikipedia states that this is a myth, and that it was only rediscovered by Marvin Minsky pretending to be an automated theorem prover... which is true; but then Herbert Gelernter later wrote such a theorem prover! [1])

[1] https://www.qedcat.com/function/27.3.pdf#page=18

Very interesting. I was suspicious they could get an LLM to do this alone, and reading the description they don't. It's still AI I would say, maybe even more so since it alternates between a generative machine learning model to come up with hypothesis points. But then it uses a optimizer to solve for those pointa and simplify the problem. So it really is a mathematical reasoning system. Also good to note that is tuned specificially to these types of geometry problems.

Links to the paper appear broken, but here is a link:

https://www.nature.com/articles/s41586-023-06747-5

summary: generate 500M synthetic examples where you start with random constructions, deduce (not sure until saturation or fixed depth) and use each conclusion along with its dependency graph and original premises as a training example. The 500M reduces to 100M after dedupe and they used 7.5M CPU hours. 9M/100M used auxiliary/exogenous constructions (which I think of as like if you're doing this on program search and the program requires some constant pi that isn't derivable from the inputs). They encode the examples for a transformer and finetune on the auxiliary construction subset to eek out 2 more problems. Then beam search using the model to generate next term and logic engine to expand the closure until the conclusion is in the set.

They make a big deal about these auxiliary constructions but I'm not sure the results back that up. I also don't understand what the "without pretraining" means; they say without pretraining it gets 21 problems, without fine-tuning 23 problems, and altogether 25. So are they getting 21/25 without their synthetic data at all??

I want to look at how the dependency graph is encoded in the training examples and whether you might expect different results from different encodings. Ideally I'd think you'd want that part of the model/embedding to be invariant across all possible (big number!) of encodings.

The human comparisons are really weak in Fig 2 and unfortunately is probably what the media mainly focuses on. I do find the results interesting and wonder how well this translates into program synthesis for example. The paper mentioned in these comments which I have yet to read more "Language Models Can Teach Themselves to Program Better" sounds related. I'd be very interested to see more methods that are very different from LLM heuristic beam search, even though it does seem compelling and they get to publish lots of papers with it.

Euclidean geometry is decidable. Does that make it easier for computers, compared to other IMO problems?

I'm a big fan of approaches like this, that combine deep learning / newer techniques w/ "GOFAI" approaches.

The most surprising observation for me is that the well known Wu’s Method for proving geometric theorems solved 10 problems ! AlphaGeometry solved 25 problems.