It's funny, this kind of subtree mutation was looked at pretty deeply by Koza and Adamı in the 90s under the rubric of Genetic Algorithms, but with a slightly different optimization function
One ref in the paper to 2000 for GAs for fast generation of program trees, but that's missing the main show
Hope they're reading this and dig into those guys work
You can also say backpropagation is the chain rule from centuries ago.
It is a computationally clever application of the chain rule to minimize the amount of computation needed to compute gradients for all parameters in the network.
IMO backprop is the most trivial implementation of differentiation in neural networks. Do you know an easier way to compute gradients with larger overhead? If so, please share it.
since you asked ... how about Monte Carlo with Gibbs sampling?
Backprop is the application of dynamic programming to the chain rule for total derivatives, which sounds trivial only in retrospect.
My first forays into making neural networks used replacement rules to modify an expression tree until all the “D” operators went away, but that takes exponential complexity in network depth if you aren’t careful. Finite differences is linear in number of parameters, as is differentiation by Dual Numbers
You can do forward propagation. Humans typically finds forward easier than backwards.
I totally didn't realize this until these comments. Neat!
I went digging in wikipedia.. the Backpropagation article was created in 2005 and yet the mention of association/derivation from the chain rule wasn't mentioned until 2014, through a borrow from the German article
https://en.wikipedia.org/w/index.php?title=Backpropagation&o...
Backpropogration is just an application of the chain rule -- cool that we all learned it in high school!
These kind of Genetic Algorithms are still being researched in academia. I attended a seminar a couple of years ago on the subject. It’s still a total dead end imho.
I used to (early 00s) be super big into GA and GP until a professor of mine at the time described the whole class of algorithms as "Marginally better than brute forcing". That really resonated with my experience, and was just too spot-on to ignore.
GP finds good working trees for complex real world problems in a single night on a consumer GPU which you would definitely need to brute force for (millions of) years to find. Quite a margin!
Genetic Programming [1], specifically. I have both his two bricks from '92 and '94 (Genetic Programming: On the Programming of Computers by Means of Natural Selection, and Genetic Programming II : Automatic Discovery of Reusable Programs). I've not read his two later ones.
The big problem they seemed to get stuck at was partially doing it fast enough, and partially ending up with a result that was comprehensible. The latter in particular seems to be far better with LLMs. You tended to end up spending a lot of time trying to reorganise and prune trees to get something that you could decipher, and so it seemed like the primary, and too limited, value became algorithms where you could invest a lot of resources into trying to find more optimal versions of very small/compact algorithms you could justify spending time on. But the challenged there is that there are often so many far lower hanging fruits in most code bases that few people get to the point where it's worth trying.
I still love the idea at a conceptual level...
[1] https://www.genetic-programming.com/johnkoza.html
Some more recent alternatives to Koza's GP use some very different search mechanisms. FFX & PGE are both very fast.
https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_...
https://arxiv.org/pdf/2209.09675
I authored PGE and have thought that RL, and more recently diffusion techniques, might help these algos. All of the algos need better ways to guide the search, or help it get unstuck from local optima, which happens surprisingly fast. Most work in GP / EVO is about avoiding premature convergence.
Are these the references?
- https://web.archive.org/web/20021224053225/http://smi-web.st...
- https://www.genetic-programming.com/jkpdf/tr1314.pdf
Wow, what a flash from the past! Was playing around a LOT with GP around that time, and the name Koza is certainly familiar. I even think I did some semi-similar things, instead of my normal approach which was simplistic but inefficient in that lots of invalid code was generated.
On my last comment I suggested the authors may not be familiar with Koza+Adami, but didn't realize the corresponding author is Stuart Russell, co-author of "Artificial Intelligence: A Modern Approach", with Peter Norvig.. the "The authoritative, most-used AI textbook, adopted by over 1500 schools." according to their site. https://aima.cs.berkeley.edu/
Whoops!