What I wonder, as a computer scientist:
If you want to solve grade school math problems, why not use an 'add' instruction? It's been around since the 50s, runs a billion times faster than an LLM, every assembly-language programmer knows how to use it, every high-level language has a one-token equivalent, and doesn't hallucinate answers (other than integer overflow).
We also know how to solve complex reasoning chains that require backtracking. Prolog has been around since 1972. It's not used that much because that's not the programming problem that most people are solving.
Why not use a tool for what it's good for and pick different tools for other problems they are better for? LLMs are good for summarization, autocompletion, and as an input to many other language problems like spelling and bigrams. They're not good at math. Computers are really good at math.
There's a theorem that an LLM can compute any computable function. That's true, but so can lambda calculus. We don't program in raw lambda calculus because it's terribly inefficient. Same with LLMs for arithmetic problems.
I feel very comfortable saying, as a mathematician, that the ability to solve grade school maths problems would not be at all a predictor of ability to solve real mathematical problems at a research level.
The reason LLMs fail at solving mathematical problems is because: 1) they are terrible at arithmetic, 2) they are terrible at algebra, but most importantly, 3) they are terrible at complex reasoning (more specifically they mix up quantifiers and don't really understand the complex logical structure of many arguments) 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility
Solving grade-school problems might mean progress in 1 and 2, but that is not at all impressive, as there are perfectly good tools out there that solve those problems just fine, and old-style AI researchers have built perfectly good tools for 3. The hard problem to solve is problem 4, and this is something you teach people how to do at a university level.
(I should add that another important problem is what is known as premise selection. I didn't list that because LLMs have actually been shown to manage this ok in about 70% of theorems, which basically matches records set by other machine learning techniques.)
(Real mathematical research also involves what is known as lemma conjecturing. I have never once observed an LLM do it, and I suspect they cannot do so. Basically the parameter set of the LLM dedicated to mathematical reasoning is either large enough to model the entire solution from end to end, or the LLM is likely to completely fail to solve the problem.)
I personally think this entire article is likely complete bunk.
Edit: after reading replies I realise I should have pointed out that humans do not simply backtrack. They learn from failed attempts in ways that LLMs do not seem to. The material they are trained on surely contributes to this problem.