KANs can be modeled as just another activation architecture in normal MLPs, which is of course not surprising, since they are very flexible. I made a chart of different types of architectures here: https://x.com/thomasahle/status/1796902311765434694
Curiously KANs are not very efficient when implemented with normal matrix multiplications in Pytorch, say. But with a custom cuda kernel, or using torch.compile they can be very fast: https://x.com/thomasahle/status/1798408687981297844
Side question:
Can people this deep in the field read that visualization with all the formulas and actually grok what's going on? I'm trying to understand just how far behind I am from the average math person (obviously very very very far, but quantifiable lol)
You don't need to be more good in math than in high school. AI is a chain of functions and you derive over those to get to the loss-function (gradient) to tell you which parameters to change to get a better result (simplified!).
Now this structure of functions is different in each implementations, but the type of function is quite similar - even though a large model will combine billions of those nodes and weights. Those visualizations tell you f.e. that some models connect neurons back to ones earlier in the chain to better remember a state. But the activation function is usually a weight and threshold.
KAN changes the functions on the edges to more sophisticated ones than just "multiply by 0.x" and uses known physical formulas that you can actually explain to a human instead of the result coming from 100x different weights which tell you nothing.
The language models we use currently may map how your brain works, but how strong the neurons are connected and to which others does not tell you anything. Instead a computer can chain different functions like you would chain a normal work task and explain each step to you / combine those learned routines on different tasks.
I am by no means an expert in this field, but i do a lot of category theory, especially for the reason that i wanted a more explainable neuron network. So take my pov with a grain of salt, but please don't be discouraged to learn this. If you can program a little and remember some calculus you can definitely grasp these concepts after learning the vocabulary!
I'm very tired of this... it needs to stop as it literally hinders ML progress
1) I know one (ONE) person who took multivariate calculus in high school. They did so by going to the local community college. I know zero people who took linear algebra. I just checked the listing of my old high school. Over a decade later neither multivariate calculus nor linear algebra is offered.
2) There's something I like to tell my students
I'm sure many here recognize the reference[0], but being able to make a model that performs successfully on a test set[1] is not always meaningful. For example, about a year ago I was working a very big tech firm and increased their model's capacity on customer data by over 200% with a model that performed worse on their "test set". No additional data was used, nor did I make any changes to the architecture. Figure that out without math. (note, I was able to predict poor generalization performance PRIOR to my changes and accurately predict my model's significantly higher generalization performance)3) Math isn't just writing calculations down. That's part of it -- a big part -- but the concepts are critical. And to truly understand those concepts, you at some point need to do these calculations. Because at the end of the day, math is a language[2].
4) Just because the simplified view is not mathematically intensive does not mean math isn't important nor does it mean there isn't extremely complex mathematics under the hood. You're only explaining the mathematics in a simple way that is only about the updating process. There's a lot more to ML. And this should obviously be true since we consider them "black boxes"[3]. A lack of interpretability is not due to an immutable law, but due to our lack of understanding of a highly complex system. Yes, maybe each action in that system is simple, but if that meant the system as a whole was simple then I welcome you to develop a TOE for physics. Emergence is useful but also a pain in the ass[4].
[0] https://en.wikipedia.org/wiki/All_models_are_wrong
[1] For one, this is more accurately called a validation set. Test sets are held out. No more tuning. You're done. This is self-referential to my point.
[2] If you want to fight me on this, at least demonstrate to me you have taken an abstract algebra course and understand ideals and rings. Even better if axioms and set theory. I accept other positions, but too many argue from the basis of physics without understanding the difference between a physics and physics. Just because math is the language of physics does not mean math (or even physics) is inherently an objective principle (physics is a model).
[3] I hate this term. They are not black, but they are opaque. Which is to say that there is _some_ transparency.
[4] I am using the term "emergence" in the way a physicist would, not what you've seen in an ML paper. Why? Well read point 4 again starting at footnote [3].
I know many people who did take multivariate calculus, group/ring theory, and thermodynamics in high school, and think this should be the norm. I believe I consider "high school" math what most people consider "undergraduate", and everything up to linear algebra goes under "middle school" in my mental model (ages 12-14). So, I'm probably one of those people propagating, "ML math is easy, you only need a high school knowledge!" but I acknowledge that's still more than most people ever learn.
So you know you're wrong as a matter of plain fact, but you're going to continue to spout your "mental model" as truth anyway?
What are you trying to say here? It doesn't matter much what "should" be high school knowledge unless you're designing curriculum. If no one actually learns it in high school then a phrase like "you only need high school knowledge" means nothing to most people.
As I said,
In fact, the vast majority of my friends did, so my mental model is more useful to me than one that apportions a larger cut to the rest of the population. I also find it egregious that thirteen years of schooling doesn't get everyone to this level, so I want to hold the education system accountable by not loosening my standard.
I agree that this isn't as good at conveying information (unless the consensus changes), but that's not all I'm trying to do.
For a frame of reference the high school I went to is (currently) in the top 20% of CA and top 10% of the country. I checked their listings and while there's a 50% participation rate in AP (they also have IB), they do not offer Linear Algebra or anything past Calc I. So I think this should help you update your model to consider what opportunities people have. I think this is especially important because we should distinguish opportunity from potential and skill. I firmly believe metrics hinder the chance of any form of meritocracy in part due to the fact that opportunity is so disproportionate (more so due to to the fact that metrics are models. And you know what they say about all models ;).
If we want to actually make a better society and smarter population, we should not be diminutive to people for the lack of opportunities that are out of their control. Instead I think we should recognize this and make sure that we are not the ones denying opportunities. If we talk about education, (with exception at the extreme ends) I think we can recognize that the difference between a top tier high school student and a bit below average, is not that huge. Post undergrad it certainly grows, but I don't think it is that large either. So I'm just saying, a bit of compassion goes a long ways. Opportunity compounds, so the earlier the better. I'm fond of the phrase "the harder I work, the luckier I get" because your hard work does contribute to your success, but it isn't the only factor[0]. We know "advanced" math, so we know nothing in real life is univariate, right? You work hard so that you may take advantage of opportunities that come your way, but the cards you are dealt are out of your control. And personally, I think we should do our best to ensure that we the dominating factor that determines outcome is due to what someone can actually control. And more importantly, we recognize how things compound (also [0]).
I'm not mad or angry with you. But I think you should take a second to reevaluate your model. I'm sure it has utility, but I'm sure you're not always in a setting where it is useful (like now). If you are, at least recognize how extreme your bubble is.
[0] I highly suggest watching, even if you've seen it before. https://www.youtube.com/watch?v=3LopI4YeC4I
I think this is where I'm coming from as well. When I got to university, I met tons of people who were just connected to the right resources, be they textbooks, summer camps, math tutors, or college counselors. My lucky break was a smart father and learning about AoPS in 4th grade, but I still wish I knew what else was out there.
It'd be great if people didn't need to get lucky to learn this stuff. There is a whole group of people paid to set standards and make people aware of what is out there. The standards filter down from the board of education to the teachers, and the teachers don't actually have much sway in what they teach (re: r/teachers). So, my ultimate goal for imposing my definition of "high school math" on everyone else is to make it common enough that the standards reflect that, rather than a slow trend of weakening standards that has happened in the past few decades[*].
But... now that I type this all out, it seems pretty far removed, and probably does more harm than good (except in my bubble). It'd be much more effective to send a few emails or get myself elected to one of these seats.
[*]: Note, standards have obviously risen since the early 1900s, but they've actually fallen in the last twenty years.
I'll even state it now, as someone who highly advocates for learning math:
With that said, be cautious that you fall victim to your success. The barrier to entry may be (very) low, but it is a long way to the top. So don't ignore the fundamentals and use your excitement and success to motivate yourself through the boring and hard parts. Unfortunately, there's a steep curve to reap the rewards of your math knowledge (in ML. You'll reap rewards even in daily life much sooner!). But it is well worth it. ML is only a magical black box because you have not achieved this yet (this does not mean ML becomes a white box). Not knowing what you don't know makes it hard to progress. But I promise math will help illuminate things (e.g. understanding when and where CNNs vs transformers should be used inside architectures; how many parameters you need in hidden layers; how to make your models robust; why they fail; how to identify where they will fail before it happens; and much more. These are enormously helpful, and more if you wish to build products and not just research papers or blogs. If models are black boxes due to the composition of easily and well understood functions, I think you can probably guess how small and subtle changes can have large effects on performance. You'll at least learn a bit about this concept (chaos) in differential equations).Unless you are specifically dealing with intractable Bayesian integral problems, the multivariate calculus involved in NNs are primarily differentiation, not integration. The fun problems like boundary conditions and Stokes/Green that makes up the meat of multivariable calculus don't truly apply when you are dealing with differentiation only. In other words you only need the parts of calc 2/3 that can be taught in an afternoon, not the truly difficult parts.
Doesn't matter, if it creates value, it is sufficiently correct for all intents and purposes. Pray tell me how discrete math and abstract algebra has anything to do with day to day ML research. If you want to appeal to physics sure, plenty of Ising models, energy functions, and belief propagation in ML but you have lost all credibility bringing up discrete math.
Again those correlation tests you use to fact check your model are primarily linear frequentist models. Most statistics practitioners outside of graduate research will just be plugging formulas, not doing research level proofs.
Are you sure? The traditional linear algebra (and similar) models never (or rarely) outperformed neural networks, except perhaps on efficiency, absent hardware acceleration and all other things being equal. A flapping bird wing is beautiful from a bioengineering point of view but the aerospace industry is powered by dumb (mostly) static airfoils. Just because something is elegant doesn't mean it solves problems. A scaled up CNN is about as boring a NN can get, yet it beats the pants off all those traditional computer vision algorithms that I am sure contain way more "discrete math and abstract algebra".
That being said, more knowledge is always a good thing, but I am not naive enough to believe that ML research can only be advanced by people with "mathematical maturity". It's still in the highly empirical stage where we experimentation (regardless of whether it's guided by mathematical intuition) dominates. I have seen plenty of interesting ML results from folks who don't know what ELBOs and KL divergences are.
Which as a side note, I've found this is an important point and one of the most difficult lessons to learn to be an effective math teacher: Once you understand something, it often seems obvious and it is easy to forget how much you struggled to get to that point. If you can remember the struggle, you will be a better teacher. I also encourage teaching as revisiting can reveal the holes in your knowledge and often overconfidence (but the problem repeats as you teach a course for a long time). Clearly this is something that Feynman recognized and lead to his famous studying technique.
Value is too abstract and I think you should clarify. If you need a mine, digging it with a spoon creates value. But I don't understand your argument here and it appears to me that you also don't agree since you later discuss traditional (presumably GLMs?) statistics models vs ML. This argument seems to suggest that both create value but one creates _more_ value. And in this sense, yes I agree that it is important to consider what has more value. After all, isn't all of this under the broad scope of optimization? ;) Since we both answered the first part I'll address the second. First, I'm not sure I claimed abstract algebra was necessary, but that's a comment about if you were going to argue with me about "math being a language". So miscommunication. Second off, there's quite a lot of research on equivalent networks, gradient analysis, interpretability, and so on that does require knowledge of fields, groups, rings, sets, and I'll even include measure theory. Like how you answered the first part, there's a fair amount of statistics. And? I may be misinterpreting, but this argument suggests to me that you believe that this effort was fruitless. But I think you discount that the knowledge gained from this is what enables one to know which tools to use. Again, referencing the prior point in not needing to explicitly write equations. The knowledge gained is still valuable and I believe that through mathematics is the best way we have to teach these lessons in a generalizable manner. And personally I'd argue that it is common to use the wrong tools due to lack of nuanced understanding and one's natural tendency to get lazy (we all do it, including me). So even if a novice could use a flow chart for analysis, I hope we both realize how often the errors will appear. And how these types of errors will __devalue__ the task.I think there is also an issue with how one analyzes value and reward. We're in a complicated enough society -- certainly a field -- that it is frequent for costs to be outsourced to others and to time. It is frequent to gain reward immediately or in the short term but have overall negative rewards in the medium to long term. It is unfortunate that these feedback signals degrade (noise) with time, but that is the reality of the world. I can even give day to day examples if you want (as well as calc), but this is long enough.
I don't know how to address this because I'm not sure where I made this claim. Though I will say that there are plenty of problems where traditional methods do win out, where xgboost is better, and that computational costs are a factor in real world settings. But it is all about context. There's no strictly dominating method. But I just don't think I understand your argument because it feels non-sequitur. I think this example better clarifies your lack of understanding in areospace engineering rather than your argument. I'm guessing you're making this conclusion due to observation rather than from principles. There is a lot of research that goes into ornithopters, and this is not due to aesthetics. But again, context matters; there is no strictly dominating method.I think miscommunication is happening on this point due to a difference in usage of "elegance." If we reference MW, I believe you are using it with definition 1c while I'm using it with 1d. As in, it isn't just aesthetics. There's good reason nature went down this path instead of another. It's the same reason the context matters, because all optimization problems are solved under constraints. Solution spaces are also quite large, and as we've referenced before, in these large intractable spaces, there's usually no global optima. This is often even true in highly constrained problems.
Glad we agree. I hope we all try to continually learn and challenge our own beliefs. I do want to ensure we recognize the parts of our positions that we agree upon and not strictly focus on the differentiation. No such claim was ever made and I will never make such a claim. Nor will I make such a claim about any field. If you think it has, I'd suggest taking a second to cool off and reread what I wrote with this context in mind. Perhaps we'll be in much more agreement then. (specifically what I tell my students and the meaning of the referenced "all models are wrong but some models are useful".) Misinterpretation has occurred. The fault can be mine, but I'm lacking the words to adequately clarify so I hope this can do so. I'm sorry to outsource the work to you, but I did try to revise and found it lacking. I think this will likely be more efficient. I do think this is miscommunication on both sides and I hope we both can try to minimize this.I am aware and we specifically don't directly compute those because they are intractable, hence rendering the need for low level ML practitioners to be familiar with their theoretical properties to be mostly unnecessary. MCMC exists for a reason and modern deep learning contains almost zero direct integration. There are lots of sampling but few integrals.
I have seen high schoolers use and implement VAEs without understanding what the reparametrization trick is.
The value of LLMs and similar deep learning classifiers/generators is self evident. If your research is only good for publishing papers, you should stay in academia. You are in no position to judge or gatekeep ML research.
I am a pilot, software engineer, and a machine learning practitioner with plenty of interdisciplinary training in other scientific fields. I assure you I am more than familiar with the basics of fluid dynamics and flight principles. Granny knows how to suck eggs, no need for the lecture.
You claimed that people needed to know rings, groups and set theory to debate you on understanding ML. ̶I̶ ̶t̶h̶i̶n̶k̶ ̶y̶o̶u̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶o̶n̶e̶ ̶w̶h̶o̶ ̶n̶e̶e̶d̶s̶ ̶t̶o̶ ̶g̶o̶ ̶b̶a̶c̶k̶ ̶t̶o̶ ̶s̶c̶h̶o̶o̶l̶ ̶a̶n̶d̶ ̶s̶t̶o̶p̶ ̶g̶a̶t̶e̶ ̶k̶e̶e̶p̶i̶n̶g̶.̶ ̶ ̶Y̶o̶u̶ ̶r̶e̶m̶i̶n̶d̶ ̶m̶e̶ ̶o̶f̶ ̶t̶h̶o̶s̶e̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶a̶l̶ ̶p̶r̶o̶g̶r̶a̶m̶m̶e̶r̶s̶ ̶w̶h̶o̶ ̶w̶o̶u̶l̶d̶ ̶r̶e̶w̶r̶i̶t̶e̶ ̶n̶e̶u̶r̶a̶l̶ ̶n̶e̶t̶w̶o̶r̶k̶ ̶l̶i̶b̶r̶a̶r̶i̶e̶s̶ ̶i̶n̶ ̶H̶a̶s̶k̶e̶l̶l̶ ̶b̶e̶l̶i̶e̶v̶i̶n̶g̶ ̶c̶a̶t̶e̶g̶o̶r̶y̶ ̶t̶h̶e̶o̶r̶y̶ ̶w̶o̶u̶l̶d̶ ̶u̶n̶l̶o̶c̶k̶ ̶s̶o̶m̶e̶ ̶m̶a̶g̶i̶c̶ ̶i̶n̶s̶i̶g̶h̶t̶ ̶t̶h̶a̶t̶ ̶w̶o̶u̶l̶d̶ ̶l̶e̶a̶d̶ ̶t̶h̶e̶m̶ ̶t̶o̶w̶a̶r̶d̶s̶ ̶A̶G̶I̶.̶
̶I̶t̶ ̶m̶u̶s̶t̶ ̶b̶e̶ ̶n̶i̶c̶e̶ ̶u̶p̶ ̶t̶h̶e̶r̶e̶ ̶i̶n̶ ̶t̶h̶e̶ ̶i̶v̶o̶r̶y̶ ̶t̶o̶w̶e̶r̶ ̶o̶f̶ ̶a̶c̶a̶d̶e̶m̶i̶a̶.̶ I pity your students. Those who teach has a duty to encourage value creation and seeking out knowledge for its own sake, not constantly dangling a carrot in front of the student like leading a donkey. Don't gatekeep.
I am referring specifically to: I'm sure many here recognize the reference[0], but being able to make a model that performs successfully on a test set[1] is not always meaningful. For example, about a year ago I was working a very big tech firm and increased their model's capacity on customer data by over 200% with a model that performed worse on their "test set". No additional data was used, nor did I make any changes to the architecture. Figure that out without math. (note, I was able to predict poor generalization performance PRIOR to my changes and accurately predict my model's significantly higher generalization performance).
̶T̶h̶e̶r̶e̶ ̶a̶r̶e̶ ̶m̶a̶n̶y̶ ̶w̶a̶y̶s̶ ̶t̶o̶ ̶t̶e̶s̶t̶ ̶c̶a̶u̶s̶a̶l̶i̶t̶y̶.̶ ̶T̶h̶e̶ ̶d̶a̶t̶a̶ ̶s̶c̶i̶e̶n̶c̶e̶/̶s̶t̶a̶t̶i̶s̶t̶i̶c̶ ̶w̶a̶y̶s̶ ̶a̶r̶e̶ ̶S̶p̶e̶a̶r̶m̶a̶n̶/̶P̶e̶a̶r̶s̶o̶n̶ ̶r̶a̶n̶k̶s̶ ̶a̶n̶d̶ ̶t̶ ̶t̶e̶s̶t̶s̶.̶ ̶T̶h̶o̶s̶e̶ ̶a̶r̶e̶ ̶g̶e̶n̶e̶r̶a̶l̶l̶y̶ ̶l̶i̶n̶e̶a̶r̶.̶ ̶ ̶h̶t̶t̶p̶s̶:̶/̶/̶l̶i̶n̶d̶e̶l̶o̶e̶v̶.̶g̶i̶t̶h̶u̶b̶.̶i̶o̶/̶t̶e̶s̶t̶s̶-̶a̶s̶-̶l̶i̶n̶e̶a̶r̶/̶ ̶ ̶A̶l̶t̶e̶r̶n̶a̶t̶i̶v̶e̶l̶y̶ ̶t̶h̶e̶r̶e̶ ̶a̶r̶e̶ ̶M̶L̶ ̶m̶e̶t̶h̶o̶d̶s̶ ̶l̶i̶k̶e̶ ̶g̶r̶a̶p̶h̶i̶c̶a̶l̶ ̶m̶o̶d̶e̶l̶s̶ ̶b̶u̶t̶ ̶I̶ ̶d̶o̶n̶'̶t̶ ̶t̶h̶i̶n̶k̶ ̶t̶h̶a̶t̶'̶s̶ ̶w̶h̶a̶t̶ ̶y̶o̶u̶ ̶a̶r̶e̶ ̶r̶e̶f̶e̶r̶r̶i̶n̶g̶ ̶t̶o̶ ̶h̶e̶r̶e̶.̶ ̶F̶o̶r̶ ̶d̶e̶e̶p̶ ̶l̶e̶a̶r̶n̶i̶n̶g̶ ̶s̶p̶e̶c̶i̶f̶i̶c̶a̶l̶l̶y̶ ̶t̶h̶e̶r̶e̶ ̶a̶r̶e̶ ̶t̶r̶i̶c̶k̶s̶ ̶w̶i̶t̶h̶ ̶s̶a̶m̶p̶l̶i̶n̶g̶ ̶t̶h̶a̶t̶ ̶y̶o̶u̶ ̶c̶a̶n̶ ̶u̶s̶e̶ ̶t̶o̶ ̶e̶y̶e̶b̶a̶l̶l̶ ̶t̶h̶i̶n̶g̶s̶,̶ ̶g̶u̶i̶d̶e̶d̶ ̶b̶y̶ ̶i̶n̶t̶u̶i̶t̶i̶o̶n̶.̶ ̶ ̶H̶e̶r̶e̶'̶s̶ ̶a̶ ̶g̶o̶o̶d̶ ̶r̶e̶f̶e̶r̶e̶n̶c̶e̶ ̶o̶f̶ ̶w̶h̶a̶t̶ ̶I̶ ̶m̶e̶a̶n̶:̶ ̶ ̶h̶t̶t̶p̶s̶:̶/̶/̶m̶a̶t̶h̶e̶u̶s̶f̶a̶c̶u̶r̶e̶.̶g̶i̶t̶h̶u̶b̶.̶i̶o̶/̶p̶y̶t̶h̶o̶n̶-̶c̶a̶u̶s̶a̶l̶i̶t̶y̶-̶h̶a̶n̶d̶b̶o̶o̶k̶/̶l̶a̶n̶d̶i̶n̶g̶-̶p̶a̶g̶e̶.̶h̶t̶m̶l̶ ̶h̶t̶t̶p̶s̶:̶/̶/̶a̶r̶x̶i̶v̶.̶o̶r̶g̶/̶a̶b̶s̶/̶2̶3̶0̶5̶.̶1̶8̶7̶9̶3̶ ̶ ̶A̶g̶a̶i̶n̶ ̶m̶o̶r̶e̶ ̶o̶f̶ ̶t̶h̶e̶s̶e̶ ̶a̶r̶e̶ ̶e̶m̶p̶i̶r̶i̶c̶a̶l̶ ̶c̶o̶m̶m̶o̶n̶ ̶s̶e̶n̶s̶e̶.̶ No need for mathematical maturity or any grasp of discrete mathematics.
Okay fair you have a point. I forgot not all schools offer AP classes and advanced mathematics.
I believe we both share the view that education is important, but disagree on how much mathematical understanding is truly necessary to apply or advance ML. I suppose we will have to agree to disagree.
We do disagree on one thing, but it isn't about math, science, or ML. If you would like to have a real conversation, I would be happy to. But it is required that you respond in good faith and more carefully read what I've written. I expect you to respect my time as much as I've respected yours.
You should be prod of your credentials and the work you've accomplished. I intimately understand the hard work it takes to achieve each one of those things, but I don't want to have a pissing contest or try to diminish yours. You should be proud of them. But if you want to take your anger out on someone, I suggest going elsewhere. HN is not the place for that and I personally will have none of it.
I don't see any lessons here, just rambling.
Then allow me to clarify:
[0] Being an ML researcher, I specifically have a horse in this race. The more half assed scam products (e.g. Rabbit, Devin, etc) that get out there, the more the public turns to believing ML is another Silicon Valley hype scam. Hype is (unfortunately) essential and allows for bootstrapping, but the game is to replace the bubble before it pops. The more you put into that bubble the more money comes, but also the more ground you have to make up, and the less time you have to do so. Success is the bubble popping without anyone noticing, not how loud it pops.If you do a lot of category theory, you most likely have a high "mathematical maturity" (Terry Tao spoke about this) Even if the math is fairly basic, you need to understand what is important where, which function could be replaced, etc. With mathematical maturity you realize how some details are not really significant while they take a lot mental space when you don't have it. It's part of the progression.
The tensor diagrams are not quite standard (yet). That's why I also include more "classical" neural network diagrams next to them.
I've recently been working on a library for doing automatic manipulation and differentiation of tensor diagrams (https://github.com/thomasahle/tensorgrad), and to me they are clearly a cleaner notation.
For a beautiful introduction to tensor networks, see also Jordan Taylor's blog post (https://www.lesswrong.com/posts/BQKKQiBmc63fwjDrj/graphical-...)
These remind me of interaction combinators [1], which are being used in the Bend programming language [1]. I think it'd be good for the standard to also be a valid interaction net.
[1]: https://core.ac.uk/download/pdf/81113716.pdf
[2]: https://news.ycombinator.com/item?id=40390287
This stuff is super cool! It basically generalizes tensor diagrams to general computational graphs.
However, when thinking about ML architectures, I actually like that classical tensor diagrams make it harder to express non-associative architectures. E.g. RNNs are much harder to write than Transformers.
I'm familiar with almost all of these architectures, but not the tensor diagram notation. I can't figure out what "B" is? I thought maybe it's a bias vector, but then why does it only appear on the input data, and not on subsequent fc layers?
B is the number of data vectors going on. You can erase the line labeled by B without much loss. (You just get the diagram for the feed-forward of a single vector.)
After learning about tensor diagrams a few months ago, they're my default notation for tensors. I liked your chart and also Jordan Taylor's diagram for multi-head attention.
Some notes for other readers seeing this for the first time:
My favorite property of that these diagrams is that they make it easy to re-interpret a multilinear expression as a multilinear function of any of its variables. For example, in standard matrix notation you'd write x^T A x to get a quadratic form with respect to the variable x. I think most people read this either left to right or right to left: take a matrix-vector product, and then take an inner product between vectors. Tensor notation is more like prolog: the diagram
involves these two indices/variables (the lines) "bound" by three tensors/relations (A and two copies of x.) That framing makes it easier to think about the expression as a function of A: it's just a "Frobenius inner product" between -A- and the tensor product -x x-. The same thing happens with the inner product between a signal and a convolution of two other signals. In standard notation it might take a little thought to remember how to differentiate <x, y * z> with respect to y (<x, y * z> = <y, x * z'> where x' is a time-reversal), but thinking with a tensor diagram reminds you to focus on the relation x = y + z (a 3-dimensional tensor) constraining the indices x, y and z of your three signals. All of this becomes increasingly critical when you have more indices involved. For example, how can you write the flattened matrix vec(AX + XB) as a matrix-vector product of vec(X) so we can solve the equation AX + XB = C? (Example stolen from your book.)I still have to get a hold of all the rules for dealing with non-linearities ("bubbles") though. I'll have to take a look at your tensor cookbook :) I'm also sad that I can't write tensor diagrams easily in my digital notes.
Tensor diagrams are algebraically the same thing as factor graphs in probability theory. (Tensors correspond to factors and indices correspond to variables.) The only difference is that factors in probability theory need to be non-negative. You can define a contraction over indices for tensors taking values in any semiring though. The max-plus semiring gives you maximum log-likelihood problems, and so on.
Yes. But it's not difficult math in 99% of cases, it's just notation. It may as well be written in Japanese.
Not hard to understand. The visualization is more or less the computation graph that PyTorch builds up. And the einsum code is even clearer.
There’s definitely a practice effect though. I know people who aren’t used to it will have their eyes glaze over when they read einsum notation.
I'm not deep in the field at all, I did about four hours of Andrew Ng's deep learning course, and have played around a little bit with Pytorch and Python (although more to install LLMs and Stable Diffusion than to do Pytorch directly, although I did that a little too). I also did a little more reading and playing with it all, but not that much.
Do I understand the Python? Somewhat. I know a relu is a rectified linear unit, which is a type of activation function. I have seen einsum before but forget what it is.
For the classical diagram I know what the nodes, edges and weights are. I have some idea what the formulas do, but not totally.
I'm unfamiliar with tensor diagrams.
So I have very little knowledge of this field, and I have a decent grasp of some of what it means, a vague grasp on other parts, and tensor diagrams I have little to no familiarity with.
Interesting, thanks for sharing! Do you have an explanation or idea why compilation slows some architectures down?
Consider the function:
This takes n^2 time and memory in the naive implementation. But clearly, the memory could be reduced to O(n) with the right "fusing" of the operations.KANs are similar. This is the forward code for KANs:
This is the forward code for a Expansion / Inverse Bottleneck MLPs: Both take nd^2 time, but Inverse Bottleneck only takes nd memory. For KANs to match the memory usage, the two einsums must be fused.It's actually quite similar to flash-attention.
Which is to say, a big part is lack of optimization.
Personally, I think this is fine in context. Context that it is a new formulation and the difficulty and non-obviousness of optimization. Shouldn't be expected that every researcher can recognize and solve all optimization problems.