Eagle 7B: Soaring past Transformers

It’s cool that progress is being made on alternative LLM architectures, and I did upvote the link.

However, I found this article somewhat frustrating. Showing the quality of the model is only half of the story, but the article suddenly ends there. If people are going to be motivated to adopt an entirely different architecture, then performance and context size deserve at least as much discussion.

Given the linear nature being shown, it seems like the primary thing people are going to want to see is the next frontier of LLMs: context sizes of ~1M tokens. The word “context” does not even appear in this article, which is disappointing. If there were a discussion of context, it would be nice to see if it passes the passkey test.

The article also appears to reuse a chart from RWKV-4 showing how awesome a linear function is compared to a quadratic one, but… cool story? It’s not even clear what this chart is truly showing. Is this chart only showing generated tokens, or is this including prompt tokens? As I have never used RWKV, I have no idea how the prompt processing speed compares to the token generation speed. Prompt processing speed has been a big problem for Mixtral, for example.

As a reader, I want to see a couple of actual examples of X prompt tokens + Y generated tokens, and the tokens/s of X and Y for RKWV-5 and for Mistral on the same hardware. On the Mistral side, it is trivial to collect this information in llama.cpp, but I don’t know how the tooling is for RWKV.

RWKV does not have context size, or in other way do look at it, it does have infinite one.

As far as I understand this, there is internal state that holds new information while reading input, later information can overwrite previous ones with is arguably human like behaviour.

If later input overwrites previous input in the internal state, it means the model does have a limit to how much input it can "remember" at any given time and that limit is less than infinite.

You can think of it like your own memory. Can you remember a very important thing from 10 years ago? Can you remember every single thing since then? Some things will remain for basically infinite period, some will have a more limited scope.

I'm not sure I understand your concept of human memory.

It is pretty well established that very few people are able to remember details of things for any reasonable period of time. The way that we keep those memories is by recalling them and playing the events over again in our mind. This 'refreshes' them, but at the expense of 'corrupting' them. It is almost certain that things important to you that you are sure you remember correctly are wrong on many details -- you have at times gotten a bit hazy on some aspect, tried to recall it 'figured it out' and stored that as your original memory without knowing it.

To me, 'concepts', like doing math or riding a bike, on the other hand, are different in the sense that you don't know how to ride a bike, as in you couldn't explain the muscle movements needed to balance and move on a bicycle, but when you get on it, you go through the process of figuring out the process again. So even though you 'never forget how to ride a bike' you never really knew how to do it, you just got good at learning how to do it incredibly quickly every time you tried.

Can you correct me on any misconceptions I may have about either how I think memories work, or how my thoughts should coincide with how these models work?

I was going more for an eli5 answer than making comparisons to specific brain concepts. That main idea was that the RNN keeps a rolling context so there's no clear cutoff... I suspect if you tried, you could fine-tune this to remember some things better than others - some effectively forever, others would degrade the way you said.

There's a limit to the amount, but not to the duration (in theory). It can hold on to something it considers important for an arbitrary amount of time.

One of three things has to be true. Either:

a) this is false

b) perfect recall is false (ie. as the internal state is overwritten, you lose information about previous entries in the context)

c) the inference time scales by the context length.

It’s not possible to have perfect recall over an arbitrary length in fixed time.

Not hard. Totally not possible at all

That would mean you can scan an infinite amount of data perfectly in fixed time.

So… Hrm… this kind of claim rings some kind of alarm bells, when it’s combined with this kind of sweeping announcement.

It seems to good to true; either it’s not that good, or the laws of the universe no longer hold true.

(b) is the sacrifice made in these linear attention type architectures.

As a mitigation, you can leave a few normal attention layers in the model but replace the rest.

perfect recall is often a function of the architecture allowing for data to bleed through linkages. you can increase the perfect token recall through dialated wavenet structures, or, in the case of v5, the use of multi-head linear attention creates multiple pathways where information can skip forward in time

Here is a relevant tidbit from the RWKV paper "Limitations" section (https://arxiv.org/abs/2305.13048):

  First, the linear attention of RWKV leads to significant efficiency gains but still, it may also limit the model’s performance on tasks that require recalling minutiae information over very long contexts. This is due to the funneling of information through a single vector representation over many time steps, compared with the full information maintained by the quadratic attention of standard Transformers. In other words, the model’s recurrent architecture inherently limits its ability to “look back” at previous tokens, as opposed to traditional self-attention mechanisms. While learned time decay helps prevent the loss of information,it is mechanistically limited compared to full self-attention.

There’s a difference between the computation requirements of long context lengths and the accuracy of the model on long context length tasks.

In principle it has no context size limit, but (last time I checked) in practice there is one for implementation reasons.

These models don't have a fixed context size and are progressively fine-tuned for longer and longer contexts. The context length also doesn't impact inference cost.

Another aspect of performance is not just how well does the trained model perform, but is it data efficient (performance per token trained)? The comparison with Pythia (an open GPT) is shown in the article.

The rwkv4 paper is quite detailed and has examples of prompt and responses on the last few pages

https://arxiv.org/abs/2305.13048

And iirc rwkv5 is very similar to retnet which is detailed here

https://arxiv.org/abs/2307.08621

Edit now that I thought more about, the data efficiency seems like a highly important aspect given their noble goal to be fully multi lingual. This is fairly interesting theoretically as well and for other applications where abundance of data is not a given

For linear transformers, the current metric is "perfect token recall", the ability for the model to recall a randomized sequence of data. You can find the limit of a particular model architecture by training a model of a particular size to echo randomized data, and I believe this was touched on in the zoo-ology paper.

This doesnt prevent the model from retaining sequences or information beyond this metric, as information can easily be compressed in the state, but it anything within that window can be perfectly recalled by the model.

Internal testing has placed the value for Eagle around the 2.5k ptr[perfect token recall] mark, while community fine tunes done on the partial checkpoints for long distance information gathering and memorization have been shown to easily dwarf that.

prompt processing speed benefits from the same gemm optimizations as standard transformers, with the extra benefit of those gemm optimizations working for batch inference as well (no need for vllm as memory allocation is static per agent)

This shows the model architecture, be it transformer, Mamba, SSM or RWKV - doesn't really matter when compared to the impact of the training set. We're spending too much time debating models when we should be talking about language data, a reservoir of human experience won at great sacrifice by humanity.

And the same data when used to train humans creates modern capable people. Alone, without society and language, we would be mere shadows of ourselves. What does it say when AI acquires so many capabilities from language data? maybe intelligence was not centered in the brain. It's a social process.

Of course it is in the brain, the brain created and evolved the language as a very powerful tool. If intelligence was in the language then other animals would be as intelligent as us

Scientific advancement requires both brains and knowledge transfer over generations.

"If I have seen further, it is by standing on the shoulders of giants."

Knowledge transfer over generations is a function of the brain.

Other species have much more limited ability to transfer knowledge intergenerationally, and that is because the human brain's capability for symbolic language is much more advanced than other animals', who are not able to encode knowledge nearly as efficiently.

The point is it's a function of many connected brains, not just one brain.

Sure but that's still only possible for the human brain, other species brains aren't capable of encoding knowledge and using that to collaborate with other members.

Other animals don't have our advanced language. In fact it is the lack of language transmission that keeps them down. What I am arguing is that we're pretty limited individually, only together, and with plenty of time, do we get so smart.

LLMs learning from the same text and gaining human like capabilities shows just how much of intelligence is crystalized in culture. If it works without brains, then brains were not the essential ingredient.

Humans without culture would need 10,000 years or more to recover, and have to pay the same price as the first time around. Culture is smarter than us.

I agree that on balance, we should spend more effort on data than modeling, but it is just not true that modeling doesn't matter. Transformer-2023 is different from Trnasformer-2020 and cumulative improvement is significant. https://arxiv.org/abs/2312.00752 did such benchmark.

If choice between Transformer and RWKV doesn't seem to matter to you, the only reason is that while Transformer-2020 evolved to Transformer-2023, RWKV-v1 (which is from 2021) also evolved to RWKV-v5. If you use Transformer-2020 or RWKV-v1 today you will feel the difference.

It's very much worth discussing architecture since, if Mamba or Based end up working as good as Transformers, a lot of current problems related to quadratic scaling are solved.

We're spending too much time debating models when we should be talking about language data

Models still matter a lot. There's arguably still an abilities gap between LLMs and general intelligence that can only be bridged by a new model.

Both are problems to solve, architecture is as much of a problem as data.

Data has the problem of getting successively tainted by LLMs as well as the lack of open high-quality datasets.

While architecture has the problem of shifting too much focus on a flawed architecture - transformers.

Model architecture matters. RWKV takes significantly less energy than transformer models.

It’s better for the environment and much faster.

It’s not _only_ about performance

maybe intelligence was not centered in the brain. It's a social process.

Is that controversial? We are stand on the shoulder of giants before us and that is why we insist on training younglings for couple of decades on past learnings before they are believed to be of any useful. Even the smartest person won't survive long if dropped in 10000 BC.

Has anybody else noticed the map indicating that there are barely any fluent English speakers outside of the "English-speaking" countries? Or is the threshold for fluency so high that no place even in Europe except the Ireland and the UK qualify at all?

This caught my eye as well. I interpreted it as the map only shows countries that have English as primary language. But the article was not precise enough about this.

I applaud them for focusing on multilingual performance though, as that is an important area of NLP which still has lots of room for improvement.

I highly welcome the effort as well*, but I don't see why they would have to mistake first-language ability for fluency to argue for that. The difference is vast and relevant: anyone with good English reading and writing skills can take advantage of a model and might prefer it over a worse model in their native language.

*: Just sceptical whether there's enough content out there which isn't just (often badly or too straightforwardly) translated from English. Not an issue for the languages with let's say >10 Mio. speakers, but for everything smaller.

Norwegian has ca. 5m speakers, and ChatGPT does not just do fine with both the (mutually intelligible) Norwegian written languages, but also has no problem "translating" to/from several regional dialects when I've experimented.

And that is, I presume - I could be wrong -, before anyone has tried to really mine the Norwegian national library, as even much of what is online is not easily accessible for crawling.

I think there'll be plenty of content for even much smaller languages - especially anywhere with depositary laws -, but it's often going to require cumbersome collection efforts and negotiating access.

It's an arbitrary threshold for the sake of discussion. Of course Norway, one of the most developed countries and richest countries per capita, manages to have good content in its written languages on the internet. Model training also benefits from its linguistic and cultural similarities to Swedish and Danish, and English, Dutch, and German to a lesser degree. I suspect many people from Eastern Europe find Russian models useful for the same reason.

I agree that the situation is not hopeless for languages with a thriving written culture. But for many minority languages there might only be chat messages, some literary works, and vast amounts of machine-translated websites accessible for crawlers. I hope that in the future improved model architectures and training strategies can push the required amount of raw content way down.

Yea, thats why I focused only on the top 25 languages, despite the model being trained for 100+ languages.

Was not confident, on the languages beyond the 25th cut-off, until we build better datasets (which we are in works on with various regional groups!)

Yea, the map is poorly generated IMO as well - sadly there is only a few online tools online i could fine that "help me highlight, places where this list of languages is supported"

So if we want a better map, we might need to redo from scratch

That being said, depending on your sources, you can find multiple citations for 15-18% of the world population support english

With approximately 25% being native speaker (english as first language) and 75% as non-native

[PS: i realise later we were talking about different maps, the comment is for the 2nd map, for the languages we support]

Thanks for flagging that out, swapped it out with another map I found, that shows different colors for different thresholds

So its not "as strict"

That looks better. I think the original really was for native speakers only. Iceland is a nice quick test for that, since it should be >70% fluent in English, even with next to no native speakers. On the previous map it was almost completely white.

English is an official language in Malta and about 70% of the population is classed as "advanced" in English[0]. That only adds up to 280k people though, you'd probably find as many native-level speakers in any big European country, though not at the same density.

[0] https://nso.gov.mt/wp-content/uploads/Skills-Preliminary.pdf

They've joined the Linux Foundation, does that mean the models are going to be eventually censored to satisfy the foundation's AI safety policies? That includes ensuring the models don't generate content that's non-inclusive or against diversity policies?

It is trivial to fine tune any model (whether a base model or an aligned model) to your preferred output preferences as long as you have access to the model weights.

Not trivial for the general public at all, and furthermore, you need much more memory for finetuning than for inference, often making it infeasible for many machine/model combinations.

If you are running a local LLM already (which no one in the "general public is") then the bar is really not that much higher for fine-tuning (either for an individual or community member to do).

And you don't need any additional equipment at all. When I say trivial, I really do mean it - you can go to https://www.together.ai/pricing and see for yourself - a 10M token 3 epoch fine tune on a 7B model will cost you about $10-15 right now. Upload your dataset, download your fine tune weights (or serve via their infrastructure). This is only going to get easier (compare how difficult it was to inference local models last year to what you can do with plug and play solutions like Ollama, LM Studio, or Jan today).

Note also that tuning is a one-time outlay, and merges are even less resource intensive/easier to do.

To put things in perspective, tell me how much cost and effort it would be to tune a model where you don't have the weights at all in comparison.

Running a local LLM - downloading LM studio, installing on Windows, using the search function to search for a popular LLM, click "download", click the button to load the model, chat.

Fine-tuning - obtaining a dataset for your task (this in itself is not trivial), figuring out how the service you linked works (after figuring out that it exists at all), uploading the dataset, paying, downloading the weights - OK, now how do you load them into LM studio?

It's all subjective, of course, but for me there's a considerable difficulty jump there.

Downvoted, because it's a very trolly way to ask this. Especially given the foundation doesn't have an AI safety policy from what I've seen. Let's be better than this...

Currently the main policy is only around copyright - and nothing about AI safety: https://www.linuxfoundation.org/legal/generative-ai

Also in the full power of opensource, if LF really force something the group disagree with, we will just fork

All the other alignment policies, are optional for groups to opt-in

So I would not worry so much about that - the group already has a plan in event we need to leave the Linux Foundation - for example: If USA regulates AI training (since LF is registered under USA)

Try rwkv-demo-api.recursal.ai if you want to try it and dont want to wait for gradio

Thank you! I'm not experienced with 7B models.

3 things stand out to me:

- it's absolutely not useable for the kind of use cases I solve with GPT-4 (code generation, information retrieval)

- it could technically swallow a 50 page PDF, but it's not able to answer questions about it (inference speed was good, but content was garbage)

- it is ok for chatting and translations (how is your day?)

I also tried the demo and I find it pretty much useless at most things even comparing it to a small 7b transformer model like mistral.

From my albeit quick tests, what I found is that it knows clearly less things than mistral, it hallucinates much more, it does not follow instructions, has less reasoning capabilities and asking it to translate a Japanese text into English gave me a bad translated summary instead of the full translation.

I don't see how this is soaring past transformers when clearly it's unable to do any of the useful tasks you can use a transformer model for today...

As written in the post, it is a base model with light instruction tuning, i.e. Llama2, not Llama2-chat. You should evaluate it as a base model. If you evaluate it as a chat model, of course it will perform horribly.

OK, I missed that. Thanks for the clarification.

A common feedback we receive for the RWKV multi-lingual approach is: it hurts our English evaluation scores (...)

Has anyone quantified that specifically? I'd love to read more details since I'd expect the concepts to start mapping between languages at some point. I.e. with enough language fluency I'd expect learning knowledge/reasoning in language to improve the result in another. But I can't find any paper talking about specifically about that.

https://arxiv.org/abs/2306.06688 studied this. (Check out figure 3.) In short, yes, multilingual transfer does happen. But importantly, not at 7B. Being English-centric is possibly okay for LLaMA 70B, but it really is not okay for LLaMA 7B.

I think the group working on rwkv has yes, even if they don't show the details in this article. (I followed them on their discord channels for quite some time).

The interesting take away for me was that training rwkv from zero to intelligible sentences for minority language was more faster (in units of tokens trained!) than other architectures, making it more accessible for cases where large corpus like the Pile don't or can't exist.

Their map showing distribution of English speaking people is just terrible - I am fairly sure that there is at least one percent speaking English in India, Western Europe, Eastern Europe, Russia and China.

Yeah, that's weird. Depending on how you count least India, Pakistan, and Nigeria has more English speakers than the UK in absolute terms (Nigeria might end up either side depending on how strict you are), and Nigeria is on its path to overtake the UK as the country with the second largest number of native English speakers, as it's increasingly often the 2nd language of parents with different 1st languages.

E.g. my ex's parents had Igbo and Yoruba as their 1st languages, but she and all her siblings has English as theirs.

It must set a terribly high threshold (like language certificate holders or graduates of English-speaking schools) or actually report the percentage of native speakers. But one only has to be fluent enough to write chat messages to use a text model!

Is it possible to try this out on something like llama.cpp? Does the different architecture make a difference there?

There's https://github.com/saharNooby/rwkv.cpp, which related-ish[0] to ggml/llama.cpp

[0]: https://github.com/ggerganov/llama.cpp/issues/846

[Mar 2024] An MoE model based on the v5 Eagle 2T model

(note, approximate date) Hyped about this! This could strike a powerful balance between performance and reasonably retained low environmental/token cost impact. Would be cool with improved coverage of Scandinavian languages along with it, but I guess we'll see.

And yeah, I think a true revolution will happen (or might already be) when we realize the value of training data and how to structure and balance its content in the most optimal way for training.

Looking into the nordic pile maybe? There are some datasets

Forgive me for not searching around enough but so far I am not sure about:

- how much RAM is needed

- how many tokens per second with CPU only, like a typical VM/VPS for example

it seems it does not support CPU only

https://lifearchitect.ai/models-table/

IIUC this model type makes the "ALScore" column completely pointless, because the quality of results isn't related to the number of weights in the same way as for regular transformers.

> An eagle, flying past a transformer-looking robot

That hero image is a complete mess (e.g. look at the eagle's forward paw, the "transformer robot"'s right arm or the position of the eagle's left wing).

Why is it that people put up such obviously messed-up images in their articles? Do they not see that level of detail, or do they just find it cool to have some "AI art" in their article, as a kind of an in-group code, like "we use AI"?

I mean, They're an organisation creating AI models, releasing an AI model. Them using AI images isn't hugely shocking to anyone with the capacity to reason. The picture looks cool at first glance, and that is really all that matters. It's just a cool bit of eye-candy leading into the article.

Sounds like this hallucinates pretty easily

From Reddit :

https://www.reddit.com/r/LocalLLaMA/comments/1ad0j63/rwkv5_1...

User: Which is larger, a chicken egg or a cow egg?

Assistant: To determine which is larger, a chicken egg or a cow egg, let's first look at their respective sizes and compare them.

Chicken Egg: The average chicken egg size ranges from 2.5 to 3 inches (6 to 8 cm) in length and 1.5 to 2 inches (3.8 to 5 cm) in diameter.[1] Cow Egg: The average cow egg size is roughly 2 inches (5 cm) in length and 1 inch (2.5 cm) in diameter.[2] Based on these measurements, we can see that the chicken egg is slightly larger than the cow egg, with dimensions of around 3 inches by 1 inch versus 2 inches by 0.75 inches respectively (length x diameter). However, as chickens are much bigger than cows on average,[3] it would be more accurate to say that a chicken egg is roughly three times bigger than a cow egg based on their respective sizes within each species.

So, the answer is: A chicken egg is three times larger than a cow egg in terms of size when taking into account their respective dimensions within each species' average range.

I absolutely should not anthropomorphise LLMs, but I can't get rid of the feelling that "it" was writing this answer with a mischievous smirk, and was having a lot of fun in the process.

The future is weird.

My experiments with RWKV-4 showed good inference speed but suuper slow tokenization speed; I’m not sure if this was RWKV specific or implementation specific: it was some time ago. Any guidance on this for rwkv-5?

It's interesting how all focus is now primarily on decoder-only next-token-prediction models. Encoders (BERT, encoder of T5) are still useful for generating embedding for tasks like retrieval or classification. While there is a lot of work on fine-tuning BERT and T5 for such tasks, it would be nice to see more research on better pre-training architectures for embedding use cases.

We are releasing RWKV-v5 Eagle 7B, licensed as Apache 2.0 license, under the Linux Foundation, and can be used personally or commercially without restrictions

great on the team to actually set up the right incentives for testing and adoption.

FYI a lot of people are asking questions which one of the project members has been answering on Reddit the last few days.

https://www.reddit.com/user/PicoCreator/

Attention is, after all, not what you need.

Can someone knowledgeable about this stuff maybe explain what it means? What is the context and how does this compare to the usual transformer models? I don't get how to interpret some of these benchmarks. It looks like it's as good as Mistral 7B/mistral-tiny?