It’s cool that progress is being made on alternative LLM architectures, and I did upvote the link.
However, I found this article somewhat frustrating. Showing the quality of the model is only half of the story, but the article suddenly ends there. If people are going to be motivated to adopt an entirely different architecture, then performance and context size deserve at least as much discussion.
Given the linear nature being shown, it seems like the primary thing people are going to want to see is the next frontier of LLMs: context sizes of ~1M tokens. The word “context” does not even appear in this article, which is disappointing. If there were a discussion of context, it would be nice to see if it passes the passkey test.
The article also appears to reuse a chart from RWKV-4 showing how awesome a linear function is compared to a quadratic one, but… cool story? It’s not even clear what this chart is truly showing. Is this chart only showing generated tokens, or is this including prompt tokens? As I have never used RWKV, I have no idea how the prompt processing speed compares to the token generation speed. Prompt processing speed has been a big problem for Mixtral, for example.
As a reader, I want to see a couple of actual examples of X prompt tokens + Y generated tokens, and the tokens/s of X and Y for RKWV-5 and for Mistral on the same hardware. On the Mistral side, it is trivial to collect this information in llama.cpp, but I don’t know how the tooling is for RWKV.
RWKV does not have context size, or in other way do look at it, it does have infinite one.
As far as I understand this, there is internal state that holds new information while reading input, later information can overwrite previous ones with is arguably human like behaviour.
If later input overwrites previous input in the internal state, it means the model does have a limit to how much input it can "remember" at any given time and that limit is less than infinite.
You can think of it like your own memory. Can you remember a very important thing from 10 years ago? Can you remember every single thing since then? Some things will remain for basically infinite period, some will have a more limited scope.
I'm not sure I understand your concept of human memory.
It is pretty well established that very few people are able to remember details of things for any reasonable period of time. The way that we keep those memories is by recalling them and playing the events over again in our mind. This 'refreshes' them, but at the expense of 'corrupting' them. It is almost certain that things important to you that you are sure you remember correctly are wrong on many details -- you have at times gotten a bit hazy on some aspect, tried to recall it 'figured it out' and stored that as your original memory without knowing it.
To me, 'concepts', like doing math or riding a bike, on the other hand, are different in the sense that you don't know how to ride a bike, as in you couldn't explain the muscle movements needed to balance and move on a bicycle, but when you get on it, you go through the process of figuring out the process again. So even though you 'never forget how to ride a bike' you never really knew how to do it, you just got good at learning how to do it incredibly quickly every time you tried.
Can you correct me on any misconceptions I may have about either how I think memories work, or how my thoughts should coincide with how these models work?
I was going more for an eli5 answer than making comparisons to specific brain concepts. That main idea was that the RNN keeps a rolling context so there's no clear cutoff... I suspect if you tried, you could fine-tune this to remember some things better than others - some effectively forever, others would degrade the way you said.
There's a limit to the amount, but not to the duration (in theory). It can hold on to something it considers important for an arbitrary amount of time.
One of three things has to be true. Either:
a) this is false
b) perfect recall is false (ie. as the internal state is overwritten, you lose information about previous entries in the context)
c) the inference time scales by the context length.
It’s not possible to have perfect recall over an arbitrary length in fixed time.
Not hard. Totally not possible at all
That would mean you can scan an infinite amount of data perfectly in fixed time.
So… Hrm… this kind of claim rings some kind of alarm bells, when it’s combined with this kind of sweeping announcement.
It seems to good to true; either it’s not that good, or the laws of the universe no longer hold true.
(b) is the sacrifice made in these linear attention type architectures.
As a mitigation, you can leave a few normal attention layers in the model but replace the rest.
perfect recall is often a function of the architecture allowing for data to bleed through linkages. you can increase the perfect token recall through dialated wavenet structures, or, in the case of v5, the use of multi-head linear attention creates multiple pathways where information can skip forward in time
Here is a relevant tidbit from the RWKV paper "Limitations" section (https://arxiv.org/abs/2305.13048):
There’s a difference between the computation requirements of long context lengths and the accuracy of the model on long context length tasks.
In principle it has no context size limit, but (last time I checked) in practice there is one for implementation reasons.
These models don't have a fixed context size and are progressively fine-tuned for longer and longer contexts. The context length also doesn't impact inference cost.
Another aspect of performance is not just how well does the trained model perform, but is it data efficient (performance per token trained)? The comparison with Pythia (an open GPT) is shown in the article.
The rwkv4 paper is quite detailed and has examples of prompt and responses on the last few pages
https://arxiv.org/abs/2305.13048
And iirc rwkv5 is very similar to retnet which is detailed here
https://arxiv.org/abs/2307.08621
Edit now that I thought more about, the data efficiency seems like a highly important aspect given their noble goal to be fully multi lingual. This is fairly interesting theoretically as well and for other applications where abundance of data is not a given
For linear transformers, the current metric is "perfect token recall", the ability for the model to recall a randomized sequence of data. You can find the limit of a particular model architecture by training a model of a particular size to echo randomized data, and I believe this was touched on in the zoo-ology paper.
This doesnt prevent the model from retaining sequences or information beyond this metric, as information can easily be compressed in the state, but it anything within that window can be perfectly recalled by the model.
Internal testing has placed the value for Eagle around the 2.5k ptr[perfect token recall] mark, while community fine tunes done on the partial checkpoints for long distance information gathering and memorization have been shown to easily dwarf that.
prompt processing speed benefits from the same gemm optimizations as standard transformers, with the extra benefit of those gemm optimizations working for batch inference as well (no need for vllm as memory allocation is static per agent)