I can't assess this, but I do worry that overnight some algorithmic advance will enhance LLMs by orders of magnitude and the next big model to get trained is suddenly 10,000x better than GPT-4 and nobody's ready for it.
I've spent some time playing with their Jupyter notebooks. The most useful (to me, anyway) is their Example_3_classfication.ipynb ([1]).
It works as advertised with the parameters selected by the authors, but if we modified the network shape in the second half of the tutorial (Classification formulation) from (2, 2) to (2, 2, 2), it fails to generalize. The training loss gets down to 1e-9, while test loss stays around 3e-1. Getting to larger network sizes does not help either.
I would really like to see a bigger example with many more parameters and more data complexity and if it could be trained at all. MNIST would be a good start.
Update: I increased the training dataset size 100x, and that helps with the overfitting, but now I can't get training loss below 1e-2. Still iterating on it; a GPU acceleration would really help - right now, my progress is limited by the speed of my CPU.
1. https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
Update2: got it to 100% training accuracy, 99% test accuracy with (2, 2, 2) shape.
Changes:
1. Increased the training set from 1000 to 100k samples. This solved overfitting.
2. In the dataset generation, slightly reduced noise (0.1 -> 0.07) so that classes don't overlap. With an overlap, naturally, it's impossible to hit 100%.
3. Most important & specific to KANs: train for 30 steps with grid=5 (5 segments for each activation function), then 30 steps with grid=10 (and initializing from the previous model), and then 30 steps with grid=20. This is idiomatic to KANs and covered in the Example_1_function_fitting.ipynb: https://github.com/KindXiaoming/pykan/blob/master/tutorials/...
Overall, my impressions are:
- it works!
- the reference implementation is very slow. A GPU implementation is dearly needed.
- it feels like it's a bit too non-linear and training is not as stable as it's with MLP + ReLU.
- Scaling is not guaranteed to work well. Really need to see if MNIST is possible to solve with this approach.
I will definitely keep an eye on this development.
This makes me wonder what you could achieve if instead of iteratively growing the grid, or worrying about pruning or regularization, you governed network topology with some sort of evolutionary algorithm.
Believe there is a Google paper out there that tried that
1000s, there is a whole field and set of conferences. You can find more by searching "Genetic Programming" or "Symbolic Regression"
KAN, with the library of variables and math operators, very much resembles this family of algos, problems, and limitations. The lowest hanging fruit they usually leave on the proverbial tree is that you can use fast regression techniques for the constants and coefficients. No need to leave it up to random perturbations or gradient descent. What you really need to figure out is the form or shape of the model, rather than leaving it up to the human (in KAN)
You can do much better by growing an AST with memoization and non-linear regression. So much so, the EVO folks gave a best paper to a non-EVO, deterministic algorithm at their conference
https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_... (author)
Increased the training set from 1000 to 100k samples. This solved overfitting.
Solved over fitting or created more? Even if your sets are completely disjoint with something like two moons the more data you have the lower the variance.
It’d be really cool to see a transformer with the MLP layers swapped for KANs and then compare its scaling properties with vanilla transformers
Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.
At small input size, yes the MLP dominates compute. At large input attention matters more
This is the first thought came to my mind too.
Given its sparse, Will this be just replacement for MoE.
After trying this out with the fourier implementation above, swapping MLP/Attention Linear layers for KANs (all, or even a few layers) produces diverging loss. KANs don't require normalization for good forward pass dynamics, but may be trickier to train in a deep net.
It's so refreshing to come across new AI research different from the usual "we modified a transformer in this and that way and got slightly better results on this and that benchmark." All those new papers proposing incremental improvements are important, but... everyone is getting a bit tired of them. Also, anecdotal evidence and recent work suggest we're starting to run into fundamental limits inherent to transformers, so we may well need new alternatives.[a]
The best thing about this new work is that it's not an either/or proposition. The proposed "learnable spline interpolations as activation functions" can be used in conventional DNNs, to improve their expressivity. Now we just have to test the stuff to see if it really works better.
Very nice. Thank you for sharing this work here!
---
Everyone is getting tired of those papers.
This is science as is :)
95% percent will produce mediocre-to-nice improvements to what we already have so there were reserachers that eventually grow up and do something really exciting
Nothing wrong with incremental improvements. Giant leaps (almost always) only happen because of a lack of your niche domain expertise. And I mean niche niche
There's a ton actually. Just they tend to go through extra rounds of review (or never make it...) and never make it to HN unless there's special circumstances (this one is MIT and CIT). Unfortunately we've let PR become a very powerful force (it's always been a thing, but seems more influential now). We can fight against this by up voting things like this and if you're a reviewee, not focusing on sota (it's clearly been gamed and clearly leading us in the wrong direction)
I read a book on NNs by Robert Hecht Nielsen in 1989, during the NN hype of the time (I believe it was the 2nd hype cycle, the first beginning with Rosenblatt’s original hardware perceptron and dying with Minsky and Pappert’s “Perceptrons” manuscript a decade or two earlier).
Everything described was laughably basic by modern standards, but the motivation given in that book was the Kolmogorov representation theorem: a modest 3 layer networks with the right activation function can represent any continuous m-to-n function.
Most research back then focused on 3 layer networks, possibly for that reason. Sigmoid activation was king, and vanishing gradients the main issue. It took 2 decades until AlexNet brought NN research back from the AI winter of the 1990’s
Feels like someone stuffed splines into decision trees.
splines, yes.
I'm not seeing decision trees, though. Am I missing something?
"KANs’ nodes simply sum incoming signals without applying any non-linearities." (page 2 of the PDF)
I definitely think I'm projecting and maybe seeing things that aren't there. If you replaced splines with linear weights, it kind of looks like a decision tree to me.
It’s Monte Carlo all the way down
From the preprint - 100 input dimensions is considered "high", and most problems considered have 5 or fewer input dimensions. This is typical of physics-inspired settings I've seen considered in ML. The next step would be demonstrating them on MNIST, which, at 784 dimensions is tiny by modern standards.
In actual business processes there are lots of ML problems with fewer than 100 input dimensions. But for most of them decision trees are still competitive with neural networks or even outperform them.
The aid to explainability seems at least somewhat compelling. Understanding what a random forest did isn't always easy. And if what you want isn't the model but the closed form of what the model does, this could be quite useful. When those hundred input dimensions interact nonlinearly in a million ways thats nice. Or more likely I'd use it when I don't want to find a pencil to derive the closed form of what I'm trying to do.
How does back propagation work now? Do these suffer from vanishing or exploding gradients?
At page 6 it explains how they did back propagation https://arxiv.org/pdf/2404.19756 (and in page 2 it says that previous efforts to leverage Kolmogorov-Arnold representation failed to use backpropagation), so maybe using backpropagation to train multilayer networks with this architecture is their main contribution?
Unsurprisingly, the possibility of using Kolmogorov-Arnold representation theorem to build neuralnetworks has been studied [8, 9, 10, 11, 12, 13]. However, most work has stuck with the original depth-2 width-(2n + 1) representation, and did not have the chance to leverage more modern techniques (e.g., back propagation) to train the networks. Our contribution lies in generalizing the original Kolmogorov-Arnold representation to arbitrary widths and depths, revitalizing and contextualizing it in today’s deep learning world, as well as using extensive empirical experiments to highlight its potential role as a foundation model for AI + Science due to its accuracy and interpretability.
No, the activations are a combination of the basis function and the spline function. It's a little unclear to me still how the grid works, but it seems like this shouldn't suffer anymore than a generic relu MLP.
https://kindxiaoming.github.io/pykan/intro.html
At the end of this example, they recover the symbolic formula that generated their training set: exp(x₂² + sin(3.14x₁)).
It's like a computation graph with a library of "activation functions" that is optimised, and then pruned. You can recover good symbolic formulas from the pruned graph.
Maybe not meaningful for MNIST.
I wonder if Breiman’s ACE (alternating conditional expectation) is useful as a building block here.
It will easily recover this formula, because it is separable under the log transformation (which ACE recovers as well).
But ACE doesn’t work well on unseparable problems - not sure how well KAN will.
Perhaps a hasty comment but linear combinations of B-splines are yet another (higher-degree) B-spline. Isn't this simply fitting high degree B-splines to functions?
That would be true for a single node / single layer. But once the output of one layer is fed into the input of the next it is not just a linear combination of splines anymore.
The success we're seeing with neural networks is tightly coupled with the ability to scale - the algorithm itself works at scale (more layers), but it also scales well with hardware, (neural nets mostly consist of matrix multiplications, and GPUs have specialised matrix multiplication acceleration) - one of the most impactful neural network papers, AlexNet, was impactful because it showed that NNs could be put on the GPU, scaled and accelerated, to great effect.
It's not clear from the paper how well this algorithm will scale, both in terms of the algorithm itself (does it still train well with more layers?), and ability to make use of hardware acceleration, (e.g. it's not clear to me that the structure, with its per-weight activation functions, can make use of fast matmul acceleration).
It's an interesting idea, that seems to work well and have nice properties on a smaller scale; but whether it's a good architecture for imagenet, LLMs, etc. is not clear at this stage.
with its per-weight activation functions
Sounds like something which could be approximated by a DCT (discrete cosine transform). JPEG compression does this, and there are hardware accelerations for it.
can make use of fast matmul acceleration
Maybe not, but matmul acceleration was done in hardware because it's useful for some problems (graphics initially).
So if these per weight activations functions really work, people will be quick to figure out how to run them in hardware.
Eli5: why aren't these more popular and broadly used?
Because they have just been invented!
Interesting!
Would this approach (with non-linear learning) still be able to utilize GPUs to speed up training?
Seconded. I’m guessing you could create an implementation that is able to do that and then write optimised triton/cuda kernels to accelerate them but need to investigate further
I quickly skimmed the paper, got inspired to simplify it, and created some Pytorch Layer :
https://github.com/GistNoesis/FourierKAN/
The core is really just a few lines.
In the paper they use some spline interpolation to represent 1d function that they sum. Their code seemed aimed at smaller sizes. Instead I chose a different representation, aka fourier coefficients that are used to interpolate the functions of individual coordinates.
It should give an idea of Kolmogorov-Arnold networks representation power, it should probably converge easier than their spline version but spline version have less operations.
Of course, if my code doesn't work, it doesn't mean theirs doesn't.
Feel free to experiment and publish paper if you want.
you really are a pragmatic programmer, Noesis
Very interesting! Kolmogorov neutral networks can represent discontinuous functions [1], but I've wondered about how practically applicable they are. This repo seems to show that they have some use after all.
Looks super interesting
I wonder how many more new architectures are going to be found in the next few years
This really reminds me of petrinets but an analog version? But instead of places and discrete tokens we have activation functions and signals. You can only trigger a transition if an activation function (place) has the right signal (tokens).
Looks very interesting, but my guess would be that this would run into the problem of exploding/vanishing gradients at larger depths, just like TanH or sigmoid networks do.
1. Interestingly the foundations of this approach and MLP were invented / discovered around the same time about 66 years ago:
1957: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...
1958: https://en.wikipedia.org/wiki/Multilayer_perceptron
2. Another advantage of this approach is that it has only one class of parameters (the coefficients of the local activation functions) as opposed to MLP which has three classes of parameters (weights, biases, and the globally uniform activation function).
3. Everybody is talking transformers. I want to see diffusion models with this approach.
What to be worried about? Technical progress will happen, sometimes by sudden jumps. Some company will become a leader, competitors will catch up after a while.
"Technical progress" has been destroying our habitat for centuries, causing lots of other species to go extinct. Pretty much the entire planet surface has been 'technically progressed', spreading plastics, climate change and whatnot over the entirety of it.
Are you assuming that this particular "progress" would be relatively innocent?
On the other hand, the same "technical progress" (if we're putting machine learning, deforestation, and mining in the same bag) gave you medicine, which turns many otherwise deadly diseases into inconveniences and allows you to work less than 12 hrs/7 days per week to not die from hunger in a large portion of the world. A few hundred years ago, unless you were born into the lucky 0.01% of the ruling population, working from dawn to sunset was the norm for a lot more people than now.
I'm not assuming that something 10k x better than GPT-4 will be good or bad; I don't know. I was just curious what exactly to be worried about. I think in the current state, LLMs are already advanced enough for bad uses like article generation for SEO, spam, scams, etc., and I wonder if an order of magnitude better model would allow for something worse.
Where did you learn that history?
What do you mean by "better"?
I had a European peasant in the 1600-1700s in mind when I wrote about the amount of work. During the season, they worked all day; off-season, they had "free time" that went into taking care of the household, inventory, etc., so it's still work. Can't quickly find a reliable source in English I could link, so I can be wrong here.
"Better" was referring to what OP wrote in the top comment. I guess 10x faster, 10x longer context, and 100x less prone to hallucinations would make a good "10k x better" than GPT-4.
Sorry, I can't fit that with what you wrote earlier: "12 hrs/7 days per week to not die from hunger".
Those peasants payed taxes, i.e. some of their work was exploited by an army or a priest rather than hunger, and as you mention, they did not work "12 hrs/7 days per week".
Do you have a better example?
Many species went extinct during Earth's history. Evolution requires quite aggressive competition.
The way the habitat got destroyed by humans is stupid because it might put us in danger. You can call me "speciesist" but I do care more for humans rather than for a particular other specie.
So I think progress should be geared towards human species survival and if possible preventing other species extinction. Some of the current developments are a bit too much on the side of "I don't care about anyone's survival" (which is stupid and inefficient).
If other species die, we follow shortly. This anthropocentric view really ignore how much of our food chain exists because of other animals surviving despite human activities.
Evolution is the result of catastrophies and atrocities. You use the word as if it has positive connotations, which I find weird.
How do you come to the conclusion "stupid" rather than evil? Aren't we very aware of the consequences of how we are currently organising human societies, and have been for a long time?
I would worry if I'd own Nvidia shares.
Actually, that would be fantastic for NVIDIA shares;
1. A new architecture would make all/most of these upcoming Transformer accelerators obsolete => back to GPUs.
2. Higher performance LLMs on GPUs => we can speed up LLMs with 1T+ parameters. So, LLMs become more useful, so more of GPUs would be purchased.
1. A new architecture would make all/most of these upcoming Transformer accelerators obsolete => back to GPUs.
There's no guarantee that that is what would happen. The right (or wrong, depending on your POV) algorithmic breakthrough might make GPU's obsolete for AI, by making CPU's (or analog computing units, or DSP's, or "other") the preferred platform to run AI.
I think this is unlikely. There has never (in the visible fossil record) been a mutation that suddenly made tigers an order of magnitude stronger and faster, or humans an order of magnitude more intelligent. It's been a long time (if ever?) since chip transistor density made a multiple-order-of-magnitude leap. Any complex optimized system has many limiting factors and it's unlikely that all of them would leap forward at once. The current generation of LLMs are not as complex or optimized as tigers or humans, but they're far enough along that changing one thing is unlikely to result in a giant leap.
If and when something radically better comes along, say an alternative to back-propagation that is more like the way our brains learn, it will need a lot of scaling and refinement to catch up with the then-current LLM.