The whole "mystery" of transformer is that instead of a linear sequence of static weights times values in each layer, you now have 3 different matrices that are obtained from the same input through multiplication of learned weights, and then you just multiply the matrices together. I.e more parallelism which works out nice, but very restrictive since the attention formula is static.
We arent going to see more progress until we have a way to generalize the compute graph as a learnable parameter. I dunno if this is even possible in the traditional sense of gradients due to chaotic effects (i.e small changes reflect big shifts in performance), it may have to be some form of genetic algorithm or pso that happens under the hood.
That's a bold statement since a ton of progress has been made without learning the compute graph.
We have made progress in efficiency, not functionality. Instead of searching google or stack overflow or any particular documentation, we just go to Chatgpt.
Information compression is cool, but I want actual AI.
Fascinating. What’s “actual AI”?
It’s whatever computers can’t do, dummy! :P
Is Ibn Sina (Avicenna, year ~1000) fine?
Or, "Intelligence is the ability to reason, determining concepts".
(And a proper artificial such thing is something that does it well.)
The idea that there has been no progress in functionality is silly.
Your whole brain might just be doing "information compression" by that analogy. An LLM is sort of learning concepts. Even Word2Vec "learned" than king - male + female = queen and that's a small model that's really just one part (not exact, but similar) of a transformer.
From my naive perspective, there seems to be a plateau, that everyone is converging on, somewhere between ChatGPT 3.5 and 4 level of performance, with some suspecting that the implementation of 4 might involve several expert models, which would already be extra sauce, external to the LLM. This, combined with the observation that generative models converge to the same output, given the same training data, regardless of architecture (having trouble finding the link, it was posted here some weeks ago), external secret sauce, outside the model, might be where the near term gains are.
I suppose we'll see in the next year!
We already have competitors to Transformers
https://arxiv.org/abs/2312.00752
Where do I enter in my credit card info?
You hire people to implement a product based on this?
A ton of progress can be made climbing a tree, but if your goal is reaching the moon it becomes clear pretty quickly that climbing taller trees will never get you there.
True, but it is the process of climbing trees that gives the insight whether taller trees help or not and if not, what to do next.
With enough thrust, even p̵i̵g̵s̵ trees can fly.
This is basically this - it can learn ignore some paths, and amplify something more important, then you can just cut this paths without sensible loss of quality. The problem is that you are not going to win anything from this - non-matrix multiplication would be slower or the same.
The issue is that you are thinking of this in terms of information compression, which is what LLMs are.
Im more concerned with an LLM having the ability to be trained to the point where a subset of the graph represents all the nand gates necessary for a cpu and ram, so when you ask it questions it can actually run code to compute them accurately instead of offering a statistical best guess, i.e decompression after lossy compression.
Just give it a computer? Even a virtual machine. It can output assembly code or high level code that gets compiled.
The issue is not having access to the cpu, the issue is that the model being able to be trained in such a way that it has representative structures for applicable problem solving. Furthermore, the structures itself should
Philosophically, you can start ad hoc-ing functionalities on top of LLMs and expect major progress. Sure, you can make them better, but you will never get to the state where AI is massively useful.
For example, lets say you gather a whole bunch of experts in respective fields, and you give them a task to put together a detailed plan on how to build a flying car. You will have people doing design, doing simulations, researching material sourcing, creating CNC programs for manufacturing parts, sourcing tools and equipment, writing software, e.t.c. And when executing this plan, they would be open to feedback for anything missed, and can advise on how to proceed.
The AI with above capability should be able to go out on the internet, gather respective data, run any soft of algorithms it needs to run, and perhaps after a month of number crunching on a cloud rented TPU rack produce step by step plan with costs on how to do all of that. And it would be better than those experts because it should be able to create a much higher fidelity simulations to account for things like vibration and predict if some connector if going to wobble loose .
LLMs seem like the least efficient way to accomplish this. NAND gates, for example, are inherently 1-bit operators, but LLMs use more. If weights are all binary, than gradients are restricted to -1, 0, and 1, which doesn't give you much room to make incremental improvements. You can add extra bits back, but that's pure overhead. But all this is besides the real issue, which is that LLMs and NNs in general are inherently fuzzy; they guess. Computers aren't, we have perfect simulators.
Consider how humans design things. We don't talk through every CPU cycle to convince ourself a design works, we use bespoke tooling. Not all problems are language shaped.
Evolution created various neural structures in biological brains (visual cortex, medulla, thalamus, etc) rather ad-hoc, and those resulted in "massively useful" systems. Why should AI be different?
Well, just remember that NAND gates are made of transistors themselves which are a statistical model of a sort… just designed to appear digital when combined to that NAND level.
This is why I am very interested in analog again—quantum stuff is statistical already, so why go from statistical (analog) to digital (huge drop off of performance, e.g. just look at basic addition in a ALU) and back to statistical. Very interested. Not sure if it will ever be worth it, but can’t rule it out.
How can gradient descent work on compute graphs when the space of compute graphs is discrete?
Perhaps in a way similar to this paper: https://arxiv.org/abs/1806.09055
I wonder why this hasn't taken off.
It can’t. There’s no gradient since it’s not a sufficiently nice space for them. You can use gradient free methods but I’d be shocked if there was an efficient enough way to do that
Genetic algorithms figured out GI the first time, but it took a while.
Could you please expand?
Evolution built our brains.
Though to be fair, actual biological evolution is more complex than simple genetic algorithms. More like evolution strategies with meta-parameter-learning and adaptive rate tuning among other things.
can you explain exactly what you mean by this? I understand what a compute graph is, but I'm not getting the idea of making a learnable parameter.
nevermind, after looking at my own question 3-4 times it clicked.
Couldn't find your contact info. Email me?
That's not it at all. What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them. The big theoretical advantage over RNNs (which were used for sequences prior to transformers), is that transformers support this in a lossless way, as each element has full access to all the information in every other element in the sequence (or at least all the ones that occurred before it in time sequences). RNNs and "linear transformers" on the other hand compress past values, so generally the last element of a long sequence will not have access to all the information in the first element of the sequence (unless the RNN internal state was really really big so it didn't need to discard any information).
Agreed. Seems analogous with how human mental processes are used to solve the kind of problems we'd like LLMs to solve (going beyond "language processing" which transformers do well, to actual reasoning which they can only mimic). Although you risk it becoming a Turing machine by giving it flow control & then training is a problem as you say. Perhaps not intractable though.
hyperparameter tuning does already go some of the way towards learning the compute graph, though very constrained and with a lot more training required.