return to table of content

A new type of neural network is more interpretable

thomasahle
28 replies
1d1h

KANs can be modeled as just another activation architecture in normal MLPs, which is of course not surprising, since they are very flexible. I made a chart of different types of architectures here: https://x.com/thomasahle/status/1796902311765434694

Curiously KANs are not very efficient when implemented with normal matrix multiplications in Pytorch, say. But with a custom cuda kernel, or using torch.compile they can be very fast: https://x.com/thomasahle/status/1798408687981297844

byteknight
24 replies
1d

Side question:

Can people this deep in the field read that visualization with all the formulas and actually grok what's going on? I'm trying to understand just how far behind I am from the average math person (obviously very very very far, but quantifiable lol)

Krei-se
14 replies
23h37m

You don't need to be more good in math than in high school. AI is a chain of functions and you derive over those to get to the loss-function (gradient) to tell you which parameters to change to get a better result (simplified!).

Now this structure of functions is different in each implementations, but the type of function is quite similar - even though a large model will combine billions of those nodes and weights. Those visualizations tell you f.e. that some models connect neurons back to ones earlier in the chain to better remember a state. But the activation function is usually a weight and threshold.

KAN changes the functions on the edges to more sophisticated ones than just "multiply by 0.x" and uses known physical formulas that you can actually explain to a human instead of the result coming from 100x different weights which tell you nothing.

The language models we use currently may map how your brain works, but how strong the neurons are connected and to which others does not tell you anything. Instead a computer can chain different functions like you would chain a normal work task and explain each step to you / combine those learned routines on different tasks.

I am by no means an expert in this field, but i do a lot of category theory, especially for the reason that i wanted a more explainable neuron network. So take my pov with a grain of salt, but please don't be discouraged to learn this. If you can program a little and remember some calculus you can definitely grasp these concepts after learning the vocabulary!

godelski
12 replies
21h23m

You don't need to be more good in math than in high school.

I'm very tired of this... it needs to stop as it literally hinders ML progress

1) I know one (ONE) person who took multivariate calculus in high school. They did so by going to the local community college. I know zero people who took linear algebra. I just checked the listing of my old high school. Over a decade later neither multivariate calculus nor linear algebra is offered.

2) There's something I like to tell my students

  You don't need math to train a good model, but you do need to know math to know why your model is wrong.
I'm sure many here recognize the reference[0], but being able to make a model that performs successfully on a test set[1] is not always meaningful. For example, about a year ago I was working a very big tech firm and increased their model's capacity on customer data by over 200% with a model that performed worse on their "test set". No additional data was used, nor did I make any changes to the architecture. Figure that out without math. (note, I was able to predict poor generalization performance PRIOR to my changes and accurately predict my model's significantly higher generalization performance)

3) Math isn't just writing calculations down. That's part of it -- a big part -- but the concepts are critical. And to truly understand those concepts, you at some point need to do these calculations. Because at the end of the day, math is a language[2].

4) Just because the simplified view is not mathematically intensive does not mean math isn't important nor does it mean there isn't extremely complex mathematics under the hood. You're only explaining the mathematics in a simple way that is only about the updating process. There's a lot more to ML. And this should obviously be true since we consider them "black boxes"[3]. A lack of interpretability is not due to an immutable law, but due to our lack of understanding of a highly complex system. Yes, maybe each action in that system is simple, but if that meant the system as a whole was simple then I welcome you to develop a TOE for physics. Emergence is useful but also a pain in the ass[4].

[0] https://en.wikipedia.org/wiki/All_models_are_wrong

[1] For one, this is more accurately called a validation set. Test sets are held out. No more tuning. You're done. This is self-referential to my point.

[2] If you want to fight me on this, at least demonstrate to me you have taken an abstract algebra course and understand ideals and rings. Even better if axioms and set theory. I accept other positions, but too many argue from the basis of physics without understanding the difference between a physics and physics. Just because math is the language of physics does not mean math (or even physics) is inherently an objective principle (physics is a model).

[3] I hate this term. They are not black, but they are opaque. Which is to say that there is _some_ transparency.

[4] I am using the term "emergence" in the way a physicist would, not what you've seen in an ML paper. Why? Well read point 4 again starting at footnote [3].

programjames
5 replies
18h8m

I know many people who did take multivariate calculus, group/ring theory, and thermodynamics in high school, and think this should be the norm. I believe I consider "high school" math what most people consider "undergraduate", and everything up to linear algebra goes under "middle school" in my mental model (ages 12-14). So, I'm probably one of those people propagating, "ML math is easy, you only need a high school knowledge!" but I acknowledge that's still more than most people ever learn.

andrewflnr
4 replies
16h22m

I acknowledge that's still more than most people ever learn.

So you know you're wrong as a matter of plain fact, but you're going to continue to spout your "mental model" as truth anyway?

What are you trying to say here? It doesn't matter much what "should" be high school knowledge unless you're designing curriculum. If no one actually learns it in high school then a phrase like "you only need high school knowledge" means nothing to most people.

programjames
2 replies
13h15m

As I said,

I know many people who did take...

In fact, the vast majority of my friends did, so my mental model is more useful to me than one that apportions a larger cut to the rest of the population. I also find it egregious that thirteen years of schooling doesn't get everyone to this level, so I want to hold the education system accountable by not loosening my standard.

If [almost] no one actually learns it in high school then a phrase like "you only need high school knowledge" means nothing to most people.

I agree that this isn't as good at conveying information (unless the consensus changes), but that's not all I'm trying to do.

godelski
1 replies
12h0m

  > so my mental model 
This is the point though. If you know your mental model is wrong, you should update your model rather that perpetuate the errors. It's okay to be wrong and no one is upset at you for being wrong (at least not me). But if you are knowingly wrong, don't try to justify it, use the signal to help you change your model. I know it isn't easy, but recognize that defending your bad model makes this harder. It is okay to admit fault and you'll often be surprised how this can turn a conversation around. (FWIW, I think a lot of people struggle with this, including me. This comment is even me trying to reinforce this behavior in myself. But I think you will also be receptive because I think your intent and words diverged; I hope I can be part of that feedback signal that so many provided to me.)

  > so I want to hold the education system accountable
So hold them accountable, not the people in these. I think you intend to blame the system, but I think if you read your message carefully, you'll see a very reasonable interpretation is that you're blaming the person. It is because you're suggesting this is a level of math that everyone should know.

For a frame of reference the high school I went to is (currently) in the top 20% of CA and top 10% of the country. I checked their listings and while there's a 50% participation rate in AP (they also have IB), they do not offer Linear Algebra or anything past Calc I. So I think this should help you update your model to consider what opportunities people have. I think this is especially important because we should distinguish opportunity from potential and skill. I firmly believe metrics hinder the chance of any form of meritocracy in part due to the fact that opportunity is so disproportionate (more so due to to the fact that metrics are models. And you know what they say about all models ;).

If we want to actually make a better society and smarter population, we should not be diminutive to people for the lack of opportunities that are out of their control. Instead I think we should recognize this and make sure that we are not the ones denying opportunities. If we talk about education, (with exception at the extreme ends) I think we can recognize that the difference between a top tier high school student and a bit below average, is not that huge. Post undergrad it certainly grows, but I don't think it is that large either. So I'm just saying, a bit of compassion goes a long ways. Opportunity compounds, so the earlier the better. I'm fond of the phrase "the harder I work, the luckier I get" because your hard work does contribute to your success, but it isn't the only factor[0]. We know "advanced" math, so we know nothing in real life is univariate, right? You work hard so that you may take advantage of opportunities that come your way, but the cards you are dealt are out of your control. And personally, I think we should do our best to ensure that we the dominating factor that determines outcome is due to what someone can actually control. And more importantly, we recognize how things compound (also [0]).

I'm not mad or angry with you. But I think you should take a second to reevaluate your model. I'm sure it has utility, but I'm sure you're not always in a setting where it is useful (like now). If you are, at least recognize how extreme your bubble is.

[0] I highly suggest watching, even if you've seen it before. https://www.youtube.com/watch?v=3LopI4YeC4I

programjames
0 replies
9h28m

I think we should do our best to ensure that we the dominating factor that determines outcome is due to what someone can actually control.

I think this is where I'm coming from as well. When I got to university, I met tons of people who were just connected to the right resources, be they textbooks, summer camps, math tutors, or college counselors. My lucky break was a smart father and learning about AoPS in 4th grade, but I still wish I knew what else was out there.

It'd be great if people didn't need to get lucky to learn this stuff. There is a whole group of people paid to set standards and make people aware of what is out there. The standards filter down from the board of education to the teachers, and the teachers don't actually have much sway in what they teach (re: r/teachers). So, my ultimate goal for imposing my definition of "high school math" on everyone else is to make it common enough that the standards reflect that, rather than a slow trend of weakening standards that has happened in the past few decades[*].

But... now that I type this all out, it seems pretty far removed, and probably does more harm than good (except in my bubble). It'd be much more effective to send a few emails or get myself elected to one of these seats.

[*]: Note, standards have obviously risen since the early 1900s, but they've actually fallen in the last twenty years.

godelski
0 replies
11h19m

  > "you only need high school knowledge" means nothing to most people.
I think people intend to use it to tell people the barrier is low. But by trivializing the difficulties of calculus (may be easy now, but was it before you learned it?), you place that barrier higher than it was before. The result is the opposite of the intent.

I'll even state it now, as someone who highly advocates for learning math:

  You don't even need calculus to build good models. At most, a rudimentary understanding of algebra, but I'm not sure even that. A little programming skill, which can freely and easily be obtained, is all that is necessary to begin. So if you can read and can motivate yourself, you can build good and useful models. It might just take longer if you don't have these yet.
With that said, be cautious that you fall victim to your success. The barrier to entry may be (very) low, but it is a long way to the top. So don't ignore the fundamentals and use your excitement and success to motivate yourself through the boring and hard parts. Unfortunately, there's a steep curve to reap the rewards of your math knowledge (in ML. You'll reap rewards even in daily life much sooner!). But it is well worth it. ML is only a magical black box because you have not achieved this yet (this does not mean ML becomes a white box). Not knowing what you don't know makes it hard to progress. But I promise math will help illuminate things (e.g. understanding when and where CNNs vs transformers should be used inside architectures; how many parameters you need in hidden layers; how to make your models robust; why they fail; how to identify where they will fail before it happens; and much more. These are enormously helpful, and more if you wish to build products and not just research papers or blogs. If models are black boxes due to the composition of easily and well understood functions, I think you can probably guess how small and subtle changes can have large effects on performance. You'll at least learn a bit about this concept (chaos) in differential equations).

Onavo
3 replies
20h40m

1) I know one (ONE) person who took multivariate calculus in high school.

Unless you are specifically dealing with intractable Bayesian integral problems, the multivariate calculus involved in NNs are primarily differentiation, not integration. The fun problems like boundary conditions and Stokes/Green that makes up the meat of multivariable calculus don't truly apply when you are dealing with differentiation only. In other words you only need the parts of calc 2/3 that can be taught in an afternoon, not the truly difficult parts.

I'm sure many here recognize the reference[0], but being able to make a model that performs successfully on a test set[1] is not always meaningful. (sic) ...[2] If you want to fight me on this, at least demonstrate to me you have taken an abstract algebra course and understand ideals and rings. Even better if axioms and set theory.

Doesn't matter, if it creates value, it is sufficiently correct for all intents and purposes. Pray tell me how discrete math and abstract algebra has anything to do with day to day ML research. If you want to appeal to physics sure, plenty of Ising models, energy functions, and belief propagation in ML but you have lost all credibility bringing up discrete math.

Again those correlation tests you use to fact check your model are primarily linear frequentist models. Most statistics practitioners outside of graduate research will just be plugging formulas, not doing research level proofs.

Just because the simplified view is not mathematically intensive does not mean math isn't important nor does it mean there isn't extremely complex mathematics under the hood. You're only explaining the mathematics in a simple way that is only about the updating process. There's a lot more to ML.

Are you sure? The traditional linear algebra (and similar) models never (or rarely) outperformed neural networks, except perhaps on efficiency, absent hardware acceleration and all other things being equal. A flapping bird wing is beautiful from a bioengineering point of view but the aerospace industry is powered by dumb (mostly) static airfoils. Just because something is elegant doesn't mean it solves problems. A scaled up CNN is about as boring a NN can get, yet it beats the pants off all those traditional computer vision algorithms that I am sure contain way more "discrete math and abstract algebra".

That being said, more knowledge is always a good thing, but I am not naive enough to believe that ML research can only be advanced by people with "mathematical maturity". It's still in the highly empirical stage where we experimentation (regardless of whether it's guided by mathematical intuition) dominates. I have seen plenty of interesting ML results from folks who don't know what ELBOs and KL divergences are.

godelski
2 replies
18h59m

  > intractable Bayesian integral problems
With ML, most of what we are doing is modeling intractable distributions...

  > the multivariate calculus involved in NNs are primarily differentiation
Sure, but I'm not sure what your critique is here. This is confirming my point. Maybe I should have been clearer by adding a line that most people do not take calculus in high school. While it is offered there, these are the advance courses, and I'd be wary of being so pejorative. I know a large number of great mathematicians, computer scientists, and physicists who did not take calculus in high school. I don't think we need to discourage anyone or needlessly make them feel dumb. I'd rather encourage more to undertake further math education and I believe the lessons learned from calculus are highly beneficial in real world every day usage, without requiring explicit formula writing (as referenced in my prior post).

Which as a side note, I've found this is an important point and one of the most difficult lessons to learn to be an effective math teacher: Once you understand something, it often seems obvious and it is easy to forget how much you struggled to get to that point. If you can remember the struggle, you will be a better teacher. I also encourage teaching as revisiting can reveal the holes in your knowledge and often overconfidence (but the problem repeats as you teach a course for a long time). Clearly this is something that Feynman recognized and lead to his famous studying technique.

  > Doesn't matter, if it creates value
Value is too abstract and I think you should clarify. If you need a mine, digging it with a spoon creates value. But I don't understand your argument here and it appears to me that you also don't agree since you later discuss traditional (presumably GLMs?) statistics models vs ML. This argument seems to suggest that both create value but one creates _more_ value. And in this sense, yes I agree that it is important to consider what has more value. After all, isn't all of this under the broad scope of optimization? ;)

  > Pray tell me how discrete math and abstract algebra has anything to do with day to day ML research.
Since we both answered the first part I'll address the second. First, I'm not sure I claimed abstract algebra was necessary, but that's a comment about if you were going to argue with me about "math being a language". So miscommunication. Second off, there's quite a lot of research on equivalent networks, gradient analysis, interpretability, and so on that does require knowledge of fields, groups, rings, sets, and I'll even include measure theory. Like how you answered the first part, there's a fair amount of statistics.

  > Most statistics practitioners outside of graduate research will just be plugging formulas
And? I may be misinterpreting, but this argument suggests to me that you believe that this effort was fruitless. But I think you discount that the knowledge gained from this is what enables one to know which tools to use. Again, referencing the prior point in not needing to explicitly write equations. The knowledge gained is still valuable and I believe that through mathematics is the best way we have to teach these lessons in a generalizable manner. And personally I'd argue that it is common to use the wrong tools due to lack of nuanced understanding and one's natural tendency to get lazy (we all do it, including me). So even if a novice could use a flow chart for analysis, I hope we both realize how often the errors will appear. And how these types of errors will __devalue__ the task.

I think there is also an issue with how one analyzes value and reward. We're in a complicated enough society -- certainly a field -- that it is frequent for costs to be outsourced to others and to time. It is frequent to gain reward immediately or in the short term but have overall negative rewards in the medium to long term. It is unfortunate that these feedback signals degrade (noise) with time, but that is the reality of the world. I can even give day to day examples if you want (as well as calc), but this is long enough.

  > Are you sure? The traditional linear algebra (and similar) models never (or rarely) outperformed neural networks
I don't know how to address this because I'm not sure where I made this claim. Though I will say that there are plenty of problems where traditional methods do win out, where xgboost is better, and that computational costs are a factor in real world settings. But it is all about context. There's no strictly dominating method. But I just don't think I understand your argument because it feels non-sequitur.

  > A flapping bird wing...  [vs] static airfoils.
I think this example better clarifies your lack of understanding in areospace engineering rather than your argument. I'm guessing you're making this conclusion due to observation rather than from principles. There is a lot of research that goes into ornithopters, and this is not due to aesthetics. But again, context matters; there is no strictly dominating method.

I think miscommunication is happening on this point due to a difference in usage of "elegance." If we reference MW, I believe you are using it with definition 1c while I'm using it with 1d. As in, it isn't just aesthetics. There's good reason nature went down this path instead of another. It's the same reason the context matters, because all optimization problems are solved under constraints. Solution spaces are also quite large, and as we've referenced before, in these large intractable spaces, there's usually no global optima. This is often even true in highly constrained problems.

  > more knowledge is always a good thing
Glad we agree. I hope we all try to continually learn and challenge our own beliefs. I do want to ensure we recognize the parts of our positions that we agree upon and not strictly focus on the differentiation.

  > ML research can only be advanced by people with "mathematical maturity"
No such claim was ever made and I will never make such a claim. Nor will I make such a claim about any field. If you think it has, I'd suggest taking a second to cool off and reread what I wrote with this context in mind. Perhaps we'll be in much more agreement then. (specifically what I tell my students and the meaning of the referenced "all models are wrong but some models are useful".) Misinterpretation has occurred. The fault can be mine, but I'm lacking the words to adequately clarify so I hope this can do so. I'm sorry to outsource the work to you, but I did try to revise and found it lacking. I think this will likely be more efficient. I do think this is miscommunication on both sides and I hope we both can try to minimize this.

Onavo
1 replies
18h46m

With ML, most of what we are doing is modeling intractable distributions...

I am aware and we specifically don't directly compute those because they are intractable, hence rendering the need for low level ML practitioners to be familiar with their theoretical properties to be mostly unnecessary. MCMC exists for a reason and modern deep learning contains almost zero direct integration. There are lots of sampling but few integrals.

I have seen high schoolers use and implement VAEs without understanding what the reparametrization trick is.

Value is too abstract and I think you should clarify

The value of LLMs and similar deep learning classifiers/generators is self evident. If your research is only good for publishing papers, you should stay in academia. You are in no position to judge or gatekeep ML research.

I think this example better clarifies your lack of understanding in areospace engineering rather than your argument.

I am a pilot, software engineer, and a machine learning practitioner with plenty of interdisciplinary training in other scientific fields. I assure you I am more than familiar with the basics of fluid dynamics and flight principles. Granny knows how to suck eggs, no need for the lecture.

First, I'm not sure I claimed abstract algebra was necessary, but that's a comment about if you were going to argue with me about "math being a language"

You claimed that people needed to know rings, groups and set theory to debate you on understanding ML. ̶I̶ ̶t̶h̶i̶n̶k̶ ̶y̶o̶u̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶o̶n̶e̶ ̶w̶h̶o̶ ̶n̶e̶e̶d̶s̶ ̶t̶o̶ ̶g̶o̶ ̶b̶a̶c̶k̶ ̶t̶o̶ ̶s̶c̶h̶o̶o̶l̶ ̶a̶n̶d̶ ̶s̶t̶o̶p̶ ̶g̶a̶t̶e̶ ̶k̶e̶e̶p̶i̶n̶g̶.̶ ̶ ̶Y̶o̶u̶ ̶r̶e̶m̶i̶n̶d̶ ̶m̶e̶ ̶o̶f̶ ̶t̶h̶o̶s̶e̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶a̶l̶ ̶p̶r̶o̶g̶r̶a̶m̶m̶e̶r̶s̶ ̶w̶h̶o̶ ̶w̶o̶u̶l̶d̶ ̶r̶e̶w̶r̶i̶t̶e̶ ̶n̶e̶u̶r̶a̶l̶ ̶n̶e̶t̶w̶o̶r̶k̶ ̶l̶i̶b̶r̶a̶r̶i̶e̶s̶ ̶i̶n̶ ̶H̶a̶s̶k̶e̶l̶l̶ ̶b̶e̶l̶i̶e̶v̶i̶n̶g̶ ̶c̶a̶t̶e̶g̶o̶r̶y̶ ̶t̶h̶e̶o̶r̶y̶ ̶w̶o̶u̶l̶d̶ ̶u̶n̶l̶o̶c̶k̶ ̶s̶o̶m̶e̶ ̶m̶a̶g̶i̶c̶ ̶i̶n̶s̶i̶g̶h̶t̶ ̶t̶h̶a̶t̶ ̶w̶o̶u̶l̶d̶ ̶l̶e̶a̶d̶ ̶t̶h̶e̶m̶ ̶t̶o̶w̶a̶r̶d̶s̶ ̶A̶G̶I̶.̶

̶I̶t̶ ̶m̶u̶s̶t̶ ̶b̶e̶ ̶n̶i̶c̶e̶ ̶u̶p̶ ̶t̶h̶e̶r̶e̶ ̶i̶n̶ ̶t̶h̶e̶ ̶i̶v̶o̶r̶y̶ ̶t̶o̶w̶e̶r̶ ̶o̶f̶ ̶a̶c̶a̶d̶e̶m̶i̶a̶.̶ I pity your students. Those who teach has a duty to encourage value creation and seeking out knowledge for its own sake, not constantly dangling a carrot in front of the student like leading a donkey. Don't gatekeep.

I don't know how to address this because I'm not sure where I made this claim.

I am referring specifically to: I'm sure many here recognize the reference[0], but being able to make a model that performs successfully on a test set[1] is not always meaningful. For example, about a year ago I was working a very big tech firm and increased their model's capacity on customer data by over 200% with a model that performed worse on their "test set". No additional data was used, nor did I make any changes to the architecture. Figure that out without math. (note, I was able to predict poor generalization performance PRIOR to my changes and accurately predict my model's significantly higher generalization performance).

̶T̶h̶e̶r̶e̶ ̶a̶r̶e̶ ̶m̶a̶n̶y̶ ̶w̶a̶y̶s̶ ̶t̶o̶ ̶t̶e̶s̶t̶ ̶c̶a̶u̶s̶a̶l̶i̶t̶y̶.̶ ̶T̶h̶e̶ ̶d̶a̶t̶a̶ ̶s̶c̶i̶e̶n̶c̶e̶/̶s̶t̶a̶t̶i̶s̶t̶i̶c̶ ̶w̶a̶y̶s̶ ̶a̶r̶e̶ ̶S̶p̶e̶a̶r̶m̶a̶n̶/̶P̶e̶a̶r̶s̶o̶n̶ ̶r̶a̶n̶k̶s̶ ̶a̶n̶d̶ ̶t̶ ̶t̶e̶s̶t̶s̶.̶ ̶T̶h̶o̶s̶e̶ ̶a̶r̶e̶ ̶g̶e̶n̶e̶r̶a̶l̶l̶y̶ ̶l̶i̶n̶e̶a̶r̶.̶ ̶ ̶h̶t̶t̶p̶s̶:̶/̶/̶l̶i̶n̶d̶e̶l̶o̶e̶v̶.̶g̶i̶t̶h̶u̶b̶.̶i̶o̶/̶t̶e̶s̶t̶s̶-̶a̶s̶-̶l̶i̶n̶e̶a̶r̶/̶ ̶ ̶A̶l̶t̶e̶r̶n̶a̶t̶i̶v̶e̶l̶y̶ ̶t̶h̶e̶r̶e̶ ̶a̶r̶e̶ ̶M̶L̶ ̶m̶e̶t̶h̶o̶d̶s̶ ̶l̶i̶k̶e̶ ̶g̶r̶a̶p̶h̶i̶c̶a̶l̶ ̶m̶o̶d̶e̶l̶s̶ ̶b̶u̶t̶ ̶I̶ ̶d̶o̶n̶'̶t̶ ̶t̶h̶i̶n̶k̶ ̶t̶h̶a̶t̶'̶s̶ ̶w̶h̶a̶t̶ ̶y̶o̶u̶ ̶a̶r̶e̶ ̶r̶e̶f̶e̶r̶r̶i̶n̶g̶ ̶t̶o̶ ̶h̶e̶r̶e̶.̶ ̶F̶o̶r̶ ̶d̶e̶e̶p̶ ̶l̶e̶a̶r̶n̶i̶n̶g̶ ̶s̶p̶e̶c̶i̶f̶i̶c̶a̶l̶l̶y̶ ̶t̶h̶e̶r̶e̶ ̶a̶r̶e̶ ̶t̶r̶i̶c̶k̶s̶ ̶w̶i̶t̶h̶ ̶s̶a̶m̶p̶l̶i̶n̶g̶ ̶t̶h̶a̶t̶ ̶y̶o̶u̶ ̶c̶a̶n̶ ̶u̶s̶e̶ ̶t̶o̶ ̶e̶y̶e̶b̶a̶l̶l̶ ̶t̶h̶i̶n̶g̶s̶,̶ ̶g̶u̶i̶d̶e̶d̶ ̶b̶y̶ ̶i̶n̶t̶u̶i̶t̶i̶o̶n̶.̶ ̶ ̶H̶e̶r̶e̶'̶s̶ ̶a̶ ̶g̶o̶o̶d̶ ̶r̶e̶f̶e̶r̶e̶n̶c̶e̶ ̶o̶f̶ ̶w̶h̶a̶t̶ ̶I̶ ̶m̶e̶a̶n̶:̶ ̶ ̶h̶t̶t̶p̶s̶:̶/̶/̶m̶a̶t̶h̶e̶u̶s̶f̶a̶c̶u̶r̶e̶.̶g̶i̶t̶h̶u̶b̶.̶i̶o̶/̶p̶y̶t̶h̶o̶n̶-̶c̶a̶u̶s̶a̶l̶i̶t̶y̶-̶h̶a̶n̶d̶b̶o̶o̶k̶/̶l̶a̶n̶d̶i̶n̶g̶-̶p̶a̶g̶e̶.̶h̶t̶m̶l̶ ̶h̶t̶t̶p̶s̶:̶/̶/̶a̶r̶x̶i̶v̶.̶o̶r̶g̶/̶a̶b̶s̶/̶2̶3̶0̶5̶.̶1̶8̶7̶9̶3̶ ̶ ̶A̶g̶a̶i̶n̶ ̶m̶o̶r̶e̶ ̶o̶f̶ ̶t̶h̶e̶s̶e̶ ̶a̶r̶e̶ ̶e̶m̶p̶i̶r̶i̶c̶a̶l̶ ̶c̶o̶m̶m̶o̶n̶ ̶s̶e̶n̶s̶e̶.̶ No need for mathematical maturity or any grasp of discrete mathematics.

Maybe I should have been clearer by adding a line that most people do not take calculus in high school. While it is offered there, these are the advance courses, and I'd be wary of being so pejorative. I know a large number of great mathematicians, computer scientists, and physicists who did not take calculus in high school. I don't think we need to discourage anyone or needlessly make them feel dumb. I'd rather encourage more to undertake further math education and I believe the lessons learned from calculus are highly beneficial in real world every day usage, without requiring explicit formula writing (as referenced in my prior post).

Okay fair you have a point. I forgot not all schools offer AP classes and advanced mathematics.

I believe we both share the view that education is important, but disagree on how much mathematical understanding is truly necessary to apply or advance ML. I suppose we will have to agree to disagree.

godelski
0 replies
14h21m

  > but disagree on how much mathematical understanding is truly necessary to apply or advance ML
We do not disagree on this point. I have been explicitly clear about this and stated it several times. And this is the last instance I will do so.

We do disagree on one thing, but it isn't about math, science, or ML. If you would like to have a real conversation, I would be happy to. But it is required that you respond in good faith and more carefully read what I've written. I expect you to respect my time as much as I've respected yours.

You should be prod of your credentials and the work you've accomplished. I intimately understand the hard work it takes to achieve each one of those things, but I don't want to have a pissing contest or try to diminish yours. You should be proud of them. But if you want to take your anger out on someone, I suggest going elsewhere. HN is not the place for that and I personally will have none of it.

Krei-se
1 replies
12h20m

I don't see any lessons here, just rambling.

godelski
0 replies
11h33m

Then allow me to clarify:

  - Very few high schools in America offer these classes. Even fewer people take them. The lie to yourself is not recognizing your bubble. You might think you're encouraging others, but you're doing the opposite. People who had those opportunities are likely not the ones that feel like ML is beyond their capabilities. 

  - While you can be successful in ML without math, this does not mean you should discourage its pursuit (just as you shouldn't place it as a gate keeping requirement. Even Calc and LA aren't required!). 

  -  Math is about a way of thinking and approaching problems. These skills generalize beyond the ability to solve mathematical functions. 

  - The mathematical knowledge compounds and will make your models better. This may be nonobvious, especially given your suggested background, you've lived with this knowledge for quite some time. But if you haven't gone into things like statistical theory (more than ISLR), probability, metric theory, optimization, and so on, it is quite difficult to see how these help you in the same way it's hard to see what's on a shelf above you. It can also be difficult to explain how these help if you lack the language. But if you want to build good products (that work in the real world and not just in a demo), you'll find this knowledge is invaluable. If you don't understand why, let this be a signal of your overconfidence. Models aren't worth shit if they don't generalize (I'm not talking about AGI, I'm talking about generalizing to customer data)[0].
[0] Being an ML researcher, I specifically have a horse in this race. The more half assed scam products (e.g. Rabbit, Devin, etc) that get out there, the more the public turns to believing ML is another Silicon Valley hype scam. Hype is (unfortunately) essential and allows for bootstrapping, but the game is to replace the bubble before it pops. The more you put into that bubble the more money comes, but also the more ground you have to make up, and the less time you have to do so. Success is the bubble popping without anyone noticing, not how loud it pops.

woolion
0 replies
9h0m

If you do a lot of category theory, you most likely have a high "mathematical maturity" (Terry Tao spoke about this) Even if the math is fairly basic, you need to understand what is important where, which function could be replaced, etc. With mathematical maturity you realize how some details are not really significant while they take a lot mental space when you don't have it. It's part of the progression.

thomasahle
5 replies
23h42m

The tensor diagrams are not quite standard (yet). That's why I also include more "classical" neural network diagrams next to them.

I've recently been working on a library for doing automatic manipulation and differentiation of tensor diagrams (https://github.com/thomasahle/tensorgrad), and to me they are clearly a cleaner notation.

For a beautiful introduction to tensor networks, see also Jordan Taylor's blog post (https://www.lesswrong.com/posts/BQKKQiBmc63fwjDrj/graphical-...)

thomasahle
0 replies
13h7m

This stuff is super cool! It basically generalizes tensor diagrams to general computational graphs.

However, when thinking about ML architectures, I actually like that classical tensor diagrams make it harder to express non-associative architectures. E.g. RNNs are much harder to write than Transformers.

cshimmin
1 replies
10h5m

I'm familiar with almost all of these architectures, but not the tensor diagram notation. I can't figure out what "B" is? I thought maybe it's a bias vector, but then why does it only appear on the input data, and not on subsequent fc layers?

cgadski
0 replies
7h8m

B is the number of data vectors going on. You can erase the line labeled by B without much loss. (You just get the diagram for the feed-forward of a single vector.)

cgadski
0 replies
10h9m

After learning about tensor diagrams a few months ago, they're my default notation for tensors. I liked your chart and also Jordan Taylor's diagram for multi-head attention.

Some notes for other readers seeing this for the first time:

My favorite property of that these diagrams is that they make it easy to re-interpret a multilinear expression as a multilinear function of any of its variables. For example, in standard matrix notation you'd write x^T A x to get a quadratic form with respect to the variable x. I think most people read this either left to right or right to left: take a matrix-vector product, and then take an inner product between vectors. Tensor notation is more like prolog: the diagram

  x - A - x 
involves these two indices/variables (the lines) "bound" by three tensors/relations (A and two copies of x.) That framing makes it easier to think about the expression as a function of A: it's just a "Frobenius inner product" between -A- and the tensor product -x x-. The same thing happens with the inner product between a signal and a convolution of two other signals. In standard notation it might take a little thought to remember how to differentiate <x, y * z> with respect to y (<x, y * z> = <y, x * z'> where x' is a time-reversal), but thinking with a tensor diagram reminds you to focus on the relation x = y + z (a 3-dimensional tensor) constraining the indices x, y and z of your three signals. All of this becomes increasingly critical when you have more indices involved. For example, how can you write the flattened matrix vec(AX + XB) as a matrix-vector product of vec(X) so we can solve the equation AX + XB = C? (Example stolen from your book.)

I still have to get a hold of all the rules for dealing with non-linearities ("bubbles") though. I'll have to take a look at your tensor cookbook :) I'm also sad that I can't write tensor diagrams easily in my digital notes.

Tensor diagrams are algebraically the same thing as factor graphs in probability theory. (Tensors correspond to factors and indices correspond to variables.) The only difference is that factors in probability theory need to be non-negative. You can define a contraction over indices for tensors taking values in any semiring though. The max-plus semiring gives you maximum log-likelihood problems, and so on.

danielmarkbruce
0 replies
22h39m

Yes. But it's not difficult math in 99% of cases, it's just notation. It may as well be written in Japanese.

canjobear
0 replies
18h40m

Not hard to understand. The visualization is more or less the computation graph that PyTorch builds up. And the einsum code is even clearer.

There’s definitely a practice effect though. I know people who aren’t used to it will have their eyes glaze over when they read einsum notation.

Mc91
0 replies
23h58m

I'm not deep in the field at all, I did about four hours of Andrew Ng's deep learning course, and have played around a little bit with Pytorch and Python (although more to install LLMs and Stable Diffusion than to do Pytorch directly, although I did that a little too). I also did a little more reading and playing with it all, but not that much.

Do I understand the Python? Somewhat. I know a relu is a rectified linear unit, which is a type of activation function. I have seen einsum before but forget what it is.

For the classical diagram I know what the nodes, edges and weights are. I have some idea what the formulas do, but not totally.

I'm unfamiliar with tensor diagrams.

So I have very little knowledge of this field, and I have a decent grasp of some of what it means, a vague grasp on other parts, and tensor diagrams I have little to no familiarity with.

kherud
2 replies
1d

Interesting, thanks for sharing! Do you have an explanation or idea why compilation slows some architectures down?

thomasahle
1 replies
23h50m

Consider the function:

    relu(np.outer(x, y)) @ z.
This takes n^2 time and memory in the naive implementation. But clearly, the memory could be reduced to O(n) with the right "fusing" of the operations.

KANs are similar. This is the forward code for KANs:

   x = einsum("bi,oik->boik", x, w1) + b1
   x = einsum("boik,oik->bo", relu(x), w2) + b2
This is the forward code for a Expansion / Inverse Bottleneck MLPs:

   x = einsum("bi,iok->bok", x, w1) + b1
   x = einsum("bok,okp->bp", relu(x), w2) + b2
Both take nd^2 time, but Inverse Bottleneck only takes nd memory. For KANs to match the memory usage, the two einsums must be fused.

It's actually quite similar to flash-attention.

godelski
0 replies
21h19m

Which is to say, a big part is lack of optimization.

Personally, I think this is fine in context. Context that it is a new formulation and the difficulty and non-obviousness of optimization. Shouldn't be expected that every researcher can recognize and solve all optimization problems.

smusamashah
14 replies
1d1h

One downside of KANs is that they take longer per parameter to train—in part because they can’t take advantage of GPUs. But they need fewer parameters. Liu notes that even if KANs don’t replace giant CNNs and transformers for processing images and language, training time won’t be an issue at the smaller scale of many physics problems.

They don't even say that it might be possible to take advantage of GPUs in future. Reads like a fundamental problem with these.

scotty79
10 replies
1d1h

I wonder what's the issue ... GPUs can do very complex stuff

hansvm
5 replies
1d1h

A usual problem is that GPUs don't branch on instructions efficiently. A next most likely problem is that they don't branch on data efficiently. Ideas fundamentally requiring the former or the latter are hard to port efficiently.

A simple example of something hard to port to a GPU is a deep (24 lvls) binary tree with large leaf sizes (4kb). Particular trees can be optimized further, particular operations on trees might have further optimizations, and trees with nicer dimensionality might have tricks available, but solving that problem in the abstract is 32x slower on a GPU that "good" GPU problems. That's not a death knell, but it cuts down substantially the constraints which would make a GPU a better fit than a CPU.

Instruction branching is much worse, when required. Runtime is exponential.

As far as KANs are concerned, the problem is more with data branching. Each spline computation requires its own set of data and is only used once. The math being done on the aggregate computations is non-negligible, but fast relative to the memory loads. You quickly enter a regime where (1) you're bottlenecked on RAM bandwidth, and (2) for a given RAM load you can't efficiently use the warp allocated to it.

You can tweak the parameters a bit to alleviate that problem (smaller splines allow you to load and parallelize a few at once, larger ones allow you to do more work at once), but it's a big engineering challenge to fully utilize a GPU for that architecture. Your best bets are (1) observing something clever allowing you to represent the same result with different computations, and (2) a related idea, construct a different KAN-inspired algorithm with similar expressivity and more amenable to acceleration. My gut says (2) is more likely, but we'll see.

More succinctly: The algorithm as written is not a good fit for the GPU primitives we have. It might be possible to bridge that gap, but that isn't guaranteed.

scotty79
2 replies
23h13m

What if instead of splines there were Fourier series or something like that? Would that be easier to infer and learn on GPU if it was somehow teachable?

EDIT: FourierKAN exists https://arxiv.org/html/2406.01034v1

jbay808
0 replies
18h25m

I'd expect Chebyshev polynomials to be much faster and easier to work with than splines, certainly, and probably Fourier series as well. (Especially if there aren't trig instructions in hardware, because then each sine or cosine is itself a chebyshev polynomial to evaluate).

hansvm
0 replies
20h51m

Fourier subcomponents are definitely teachable in general. I'd expect that FourierKAN to have similar runtime to a normal KAN, only really benefitting from a GPU on datasets where you get better predictive performance than a normal KAN.

earthnail
1 replies
1d1h

What about cards with higher memory bandwidth, like Groq’s LPUs? Would that help with data branching?

hansvm
0 replies
20h57m

Data branching in general, no (pulling from 32 places is still 32x as expensive in that architecture, but you might be able to load bigger chunks in each place). For a KAN, a bit (it shifts the constants involved when I was talking about smaller vs bigger splines above -- sparsity and dropout will tend to make the GPU tend toward that worst-case though). You still have the problem that you're heavily underutilizing the GPU's compute.

MattPalmer1086
1 replies
1d1h

I suspect it is because they have different activation functions on each edge, rather than using the same one over lots of data.

XMPPwocky
0 replies
1d

Are the activation functions truly different, or just different parameter values to one underlying function?

raidicy
0 replies
1d1h

From my limited understanding: No one has written GPU code for it yet.

UncleOxidant
0 replies
1d1h

There's not a lot of details there, but GPUs tend to not like code with a lot of branching. I'm guessing that's probably the issue.

nickpsecurity
0 replies
1d1h

I’ve seen neural nets combined with decision trees. There’s a few ways to do such hybrids. One style essentially uses the accurate, GPU-trained networks to push the interpretable networks to higher accuracy.

Do any of you think that can be done cost-effectively with KAN’s? Especially using pre-trained, language models like LlaMa-3 to train the interpretable models?

endymi0n
0 replies
5h28m

This looks interesting for sure:

"ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU" https://arxiv.org/abs/2406.02075#

Ameo
5 replies
22h33m

I've tried out and written about[1] KANs on some small-scale modeling, comparing them to vanilla neural networks, as previously discussed here: https://news.ycombinator.com/item?id=40855028.

My main finding was that KANs are very tricky to train compared to NNs. It's usually possible to get per-parameter loss roughly on par with NNs, but it requires a lot of hyperparameter tuning and extra tricks in the KAN architecture. In comparison, vanilla NNs were much easier to train and worked well under a much broader set of conditions.

Some people commented that we've invested an incredible amount of effort into getting really good at training NNs efficiently, and many of the things in ML libraries (optimizers like Adam, for example) are designed and optimized specifically for NNs. For that reason, it's not really a good apples-to-apples comparison.

I think there's definitely potential in KANs, but they aren't a magic bullet. I'm also a bit dubious about interpretability claims; the splines that are usually used for KANs don't really offer much more insight to me than just analyzing the output of a neuron in a lower layer of a NN.

[1] https://cprimozic.net/blog/trying-out-kans/

Lerc
1 replies
15h46m

This is sort of my view as well, most of the hype and the criticisms of KANs seem to be fairly unfounded.

I do think they have a lot of potential, but what has been published so far does not represent a panacea. Perhaps they will have an impact like transformers, perhaps they will only serve in a little niche. You can't really tell immediately how refinements will alter the usability.

Finding out what those refinements are and how they change things is what research is all about. I have been quite enjoying following https://github.com/mintisan/awesome-kan progress and seeing the variety of things being tried. I have a few ideas of my own I might try at sometime.

Between KANs and fixed activation function networks there is an entire continuum of activation function tuning available for research.

Buckets of simple parameter activation functions something like xsigmoid(mx) ( ReLU when m is large, GeLU at m=1.7, SiLU at m=1). This adds a small number of parameters for presumably some game

Single activation functions as above per neuron.

Multi parameterizable activation functions, in batches, or per neuron.

Many parameter function approximators, in batches, or per neuron.

Full KANs without weights.

I can see some significant acclaim being awarded to the person who can calculate a unified formula for determining where additional parameters should go for the largest impact.

sigmoid10
0 replies
4h58m

My big issue with KANs is that MLPs can trivially be made functionally identical to them up to an arbitrarily small error. Just take a group of neurons/layers and thanks to the UAT you can get them to model any reasonably well behaved activation function. Now redefine that group as a KAN node and you have something that works exactly the same way. In that sense it is actually strange that KANs with the same number of parameters don't outperform MLPs. This could be seen as a hint that activation functions are not really what matters in the end. This is also something that the biology of real neural networks seems to suggest from experiments with rats. Although there is far too little conclusive research in that area. I'm about 50:50 on whether this question will be solved by biologists or computer scientists.

wanderingmind
0 replies
13h50m

Really detailed work. Thank you. For those looking to jump straight to code, here is the link to codebase discussed in the blog.

https://github.com/Ameobea/kan

smus
0 replies
20h32m

Not just the optimizers, but the initialization schemes for neural networks have been explicitly tuned for stable training of neural nets with traditional activation functions. I'm not sure as much work has gone into intialization for KANs

I 100% agree with the idea that these won't be any more interpretable and I've never understood the argument that they would be. Sure, if the NN was a single neuron I can see it, but as soon as you start composing these things you lose all interpretability imo

alexnewman
0 replies
11h12m

I’m very happy to hear someone else say the quiet part out loud . Everyone claims nn aren’t interpretable, but that’s never been my experience . Quiet the contrary

asdfman123
4 replies
20h59m

Can someone ELIF this for me?

I understand how neural networks try to reduce their loss function to get the best result. But what's actually different about the KANs?

yobbo
0 replies
2h17m

The output of an MLP is a black-box function f(x, y).

The output of a KAN is a nice formula like exp(0.3sin(x) + 4cos(y)). This is what is meant by interpretable.

svachalek
0 replies
18h13m

I'm not an ML person and am just learning from this article, but I understand a little bit about ML and the key thing I get out of it is the footnote in the diagram.

A regular neural network (MLP) has matrices full of floating point numbers that act as weights. A weight is a linear function y=wx, meaning if I plot the input x and output y on cartesian coordinates, it will generate a straight line. Increasing or decreasing the input also increases or decreases the output by consistent amounts. We won't have points where increasing the output suddenly has more or less effect than the previous increase, or starts sending the output in the other direction. So we train the network by having it learn multiple layers of these weights and also connecting them with some magic glue functions that are part of the design, not something that is trained up. The end result is the output can have a complex relationship with the input by being passed through all these layers.

In contrast, in a KAN rather than weights (acting as linear functions) we let the network learn other kinds of functions. These are nonlinear so it's possible that as we increase the input, the output keeps rising in an accelerating fashion, or turns around and starts decreasing. We can learn much more complex relationships between input and output, but lose some of the computational efficiency of the MLP approach (huge matrix operations are what GPUs are built for, while you need a CPU to do arbitrary math).

So with the KAN we end up with few but more complex "neurons", made up of complex functions. And if I understand what they're getting at here, the appeal of this is that you can inspect one of those neurons and get a clear formula that describes what it is doing, because all the complexity is distilled into a formula in the neuron. While with an MLP you have to track what is happening through multiple layers of weights and do more work to figure out how it all works.

Again I'm not in the space, but I imagine the functions that come out of a KAN still aren't super intuitive formulas that look like something out of Isaac Newton's notebooks, they're probably full of bizarre constants and unintuitive factors that cancel each other out.

Lerc
0 replies
15h9m

I'm not sure if this counts as ELIF but it's a gross simplification

perceptron layer is

output = simple_function( sum(many_inputs*many_weights) + extra_weight_for_bias)

a KAN layer is

output = sum(fancy_functions(many_inputs))

but I could be wrong, it's been a day.

Grimblewald
0 replies
13h13m

a kan is, in a way, like a network of networks, each edge representing its own little network of sorts. I could be very wrong, I am still digesting the article myself, but that is my superficial take.

jcims
2 replies
1d

I can find descriptions at one level or another (eg RNN vs CNN) but is there a deeper kingdom/phylum/class type taxonomy of neural network architectures that can help a layman understand how they differ and how they align, ideally with specific references to contemporary ones in use or being researched?

I don't know why I'm interested because I'm not planning to actually do any work in the space, but I always struggle to understand when some new architecture is announced if it's a fundamental shift or if it's an optimization.

jcims
0 replies
1d

Perfect!!! Thank you!

zygy
1 replies
23h55m

Naive question: what's the intuition for how this is different from increasing the number of learnable parameters on a regular MLP?

slashdave
0 replies
20h38m

Orthogonality ensures that each weight has its own, individual importance. In a regular MLP, the weights are naturally correlated.

xg15
1 replies
21h51m

Then they could summarize the entire KAN in an intuitive one-line function (including all the component activation functions), in some cases perfectly reconstructing the physics function that created the dataset.

The idea of KANs sounds really exciting, but just to nitpick, you could also write any traditional NN as a closed-form "one line" expression - the line will just become very very long. I don't see how the expression itself would become less complex if you used splines instead of weights (even if this resulted in less neurons for the same decision boundary).

rsfern
0 replies
13h4m

In the original KAN paper, they do two things to address this: first they have some sparsity-inducing regularization, and second they have a symbolification step so that you can ideally find a compact symbolic model after learning a sparse computation graph of splines.

I guess in principle you could do something similar with MLPs but since MLP representations are sort of delocalized they might be harder to sparsify and symbolify

BenoitP
1 replies
1d1h

I wonder if a set of learned function (can|does) reproduce the truth tables from First Order Logic.

I think it'd be easy to check.

----

Anyways that's great news for differentiability. For now 'if' conditions expressed in JAX are tricky (at least for me), and are de facto an optimization barrier. If they're learnable and already into the network, I'd say that's a great thing.

zeknife
0 replies
22h50m

It is easy to construct an MLP that implements any basic logic function. But XOR requires at least one hidden layer.

theptip
0 replies
3h6m

One downside of KANs is that they take longer per parameter to train—in part because they can’t take advantage of GPUs.

This seems like a big gap. Anyone know if this is a fundamental architecture mismatch, or just no one has written the required CUDA kernels yet?

novaRom
0 replies
7h36m

I am a bit skeptical. There were a lot of papers and experiments in 80s and 90s about different ANN architectures alternative to f(x*w+b). The reality today all practical SOTA models are still multiply-accumulate-threshold based. It's just its speed and simplicity.

noduerme
0 replies
10h38m

This sounds a bit like allowing each neuron's function to perform its own symbolic regression? But at predicting physical phenomena you might get better performance per cycle from just an A-Life swarm of symbolic regression cells competing than trying to harness them as a single organism. Why do you need a NN to model what's basically a deterministic result set, and why is that a good test?

martingoodson
0 replies
10h49m

We hosted Ziming Liu at the London Machine Learning Meetup a few weeks ago. He gave a great talk on this fascinating work.

Here's the recording https://youtu.be/FYYZZVV5vlY?si=ReoygVJMgY9oje3p

Bluestein
0 replies
1d1h

(I am wondering if there might not be a perverse incentive not to improve on interpretability for major incumbents ...

... given how, what you can "see" (ie. have visibility into) is something that regulatory stakeholders can ask you to exercise control over, or for oversight or information about ...

... whereas a "black box" they have trained and control - but few understand - can perhaps give you "plausible deniability" of the "we don't know how it works either" type.-