There is a lot of hype around LLMs, but (BUT!) Mistral well deserves the hype. I use their original 7B model, as well as some derived models, all the time. I can’t wait to see what they release next (which I expect to be a commercial product, although the MoE model set they just released is free).
Another company worthy of some hype is 01.AI which released their Yi-34B model. I have been running Yi locally on my Mac (use “ ollama run yi:34b”) and it is amazing.
Hype away Mistral and 01.AI, hype away…
How do these small models compare to gpt4 for coding and technical questions?
I noticed that gpt3.5 is practically useless to me (either wrong or too generic), while gpt4 provides a decent answer 80% of the time.
They are not close to GPT-4. Yet. But the rate of improvement is higher than I expected. I think there will be open source models at GPT-4 level that can run on consumer GPUs within a year or two. Possibly requiring some new techniques that haven't been invented yet. The rate of adoption of new techniques that work is incredibly fast.
Of course, GPT-5 is expected soon, so there's a moving target. And I can't see myself using GPT-4 much after GPT-5 is available, if it represents a significant improvement. We are quite far from "good enough".
Curious thought: at some point a competitor’s AI might become so advanced, you can just ask it to tell you how to create your own, analogous system. Easier than trying to catch up on your own. Corporations will have to include their own trade secrets among the things that AIs aren’t presently allowed to talk about like medical issues or sex.
How to create my own LLM?
Step 1: get a billion dollars.
That’s your main trade secret.
What is inherent about AIs that requires spending a billion dollars?
Humans learn a lot of things from very little input. Seems to me there's no reason, in principle, that AIs could not do the same. We just haven't figured out how to build them yet.
What we have right now, with LLMs, is a very crude brute-force method. That suggests to me that we really don't understand how cognition works, and much of this brute computation is actually unnecessary.
If we knew how to build humans for cheap, then it wouldn't require spending a billion dollars. Your reasoning is circular.
It's precisely because we don't know how to build these LLMs cheaply that one must so spend so much money to build them.
The point is that it's not inherently necessary to spend a billion dollars. We just haven't figured it out yet, and it's not due to trade secrets.
Transistors used to cost a billion times more than they do now [1]. Do you have any reason to suspect AIs to be different?
[1] https://spectrum.ieee.org/how-much-did-early-transistors-cos...
However you would still need billions of dollars if you want state of the art chips today, say 3nm.
Similarly, LLM may at some point not require a billion dollars, you may be able to get one, on par or surpass GPT4, easily for cheap. The state of the art AI will still require substantial investment.
Maybe not $1 billion, but you'd want quite a few million.
According to [1] a 70B model needs $1.7 million of GPU time.
And when you spend that - you don't know if your model will be a damp squib like Bard's original release. Or if you've scraped the wrong stuff from the internet, and you'll get shitty results because you didn't train on a million pirated ebooks. Or if your competitors have a multimodal model, and you really ought to be training on images too.
So you'd want to be ready to spend $1.7 million more than once.
You'll also probably want $$$$ to pay a bunch of humans to choose between responses for human feedback to fine-tune the results. And you can't use the cheapest workers for that, if you need great english language skills and want them to evaluate long responses.
And if you become successful, maybe you'll also want $$$$ for lawyers after you trained on all those pirated ebooks.
And of course you'll need employees - the kind of employees who are very much in demand right now.
You might not need billions, but $10M would be a shoestring budget.
[1] https://twitter.com/moinnadeem/status/1681371166999707648
And when you spend that - you don't know if your model will be a damp squib like Bard's original release. Or if you've scraped the wrong stuff from the internet, and you'll get shitty results because you didn't train on a million pirated ebooks.
This just screams to me that we don’t have a clue what we’re doing. We know how to build various model architectures and train them, but if we can’t even roughly predict how they’ll perform then that really says a lot about our lack of understanding.
Most of the people replying to my original comment seem to have dropped the “in principle” qualifier when interpreting my remarks. That’s quite frustrating because it changes the whole meaning of my comment. I think the answer is that there isn’t anything in principle stopping us from cheaply training powerful AIs. We just don’t know how to do it at this point.
And also takes 8 hours of sleep per day, and are mostly worthless for the first 18 years. Oh, also they may tell you to fuck off while they go on a 3000 mile nature walk for 2 years because they like the idea of free love better.
Knowing how birds fly ready doesn't make a useful aircraft that can carry 50 tons of supplies, or one that can go over the speed of sound.
This is the power of machines and bacteria. Throwing massive numbers at the problem. Being able to solve problems of cognition by throwing 1GW of power at it will absolutely solve the problem of how our brain does it with 20 watts in a faster period of time.
I agree about training time, but bear in mind LLMs like GPT4 and Mistral also have noisy recall of vastly more written knowledge than any human can read in their lifetime, and this is one of the features people like about LLMs.
You can't replace those types of LLM with a human, the same way you can't replace Google Search (or GitHub Search) with a human.
Acquiring and preparing that data may end up being the most expensive part.
Because that billion dollars gets you the R&D to know how to do it?
The original point was that an “AI” might become so advanced that it would be able to describe how to create a brain on a chip. This is flawed for two main reasons.
1. The models we have today aren’t able to do this. We are able to model existing patterns fairly well but making new discoveries is still out of reach.
2. Any company capable of creating a model which had singularity-like properties would discover them first, simply by virtue of the fact that they have first access. Then they would use their superior resources to write the algorithm and train the next-gen model before you even procured your first H100.
It might work for fine-tuning an open model to a narrow use case.
But creating a base model is out of reach. You need an order of probably hundreds of millions of $$ (if not billion) to get close to GPT 4.
As someone who doesn’t know much about how these models work or are created I’d love to see some kind of breakdown that shows what % of the power of GPT4 is due to how it’s modelled (layers or whatever) vs training data and the computing resources associated with it.
This isn't precisely knowable now, but it might be something academics figure out years from now. Of course, first principles of 'garbage in garbage out' would put data integrity very high, the LLM code itself is supposedly not even 100k lines of code, and the HW is crazy advanced.
so the ordering is probably data, HW, LLM model
This also fits the general ordering of
data = all human knowledge HW = integrated complexity of most technologists LLM = small team
Still requires the small team to figure out what to do with the first two, but it only happened now because the HW is good enough.
LLMs would have been invented by Turing and Shannon et al. almost certainly nearly 100 years ago if they had access to the first two.
By % of cost it's 99.9% compute cost and 0.01% data costs.
In terms of "secret sauce" it's 95% data quality and 5% architectural choices.
That’s true now, but maybe GPT6 will be able to tell you how to build GPT7 on an old laptop, and you’ll be able to summon GPT8 with a toothpick and three cc’s of mouse blood.
Model merging is easy, and a unique model merge may be hard to replicate if you don’t know the original recipe.
Model merging can create truly unique models. Love to see shit from ghost in the shell turn into real life
Yes training a new model from scratch is expensive, but creating a new model that can’t be replicated by fine tuning is easy
The limiting factor isn’t knowledge of how to do it, it is GPU access and RLHF training data.
I’m both excited and scared to think about this “significant improvement” over GPT-4.
It can make our jobs a lot easier or it can take our jobs.
Isn't that the same? At some point, your job becomes so easy that anyone can do it.
It's weird for programmers to be worried about getting automated out of a job when my job as a programmer is basically to try as hard as I can to automate myself out of a job.
You’re supposed to automate yourself out but not tell anyone. Didn’t you see that old simpsons episode from the 90s about the self driving trucks? The drivers rightfully STFU about their innovation and cashed in on great work life balance and Homer ruined it by blabbered about it to everyone, causing the drivers to try to go after him.
We are trying to keep SWE salaries up, and lowering the barrier to entry will drop them.
I expect the demand for SWE to grow faster than productivity gains.
The idea that demand scales to fill supply doesn’t work when supply becomes effectively infinite. Induction from the past is likely wrong in this case
I don't see the current tech making supply infinite. Not even close.
Maybe a more advanced type of model they'll invent in the next years. Who knows... But GPT-like models? Nah, they won't write useful code applicable in prod without supervision by an experient engineer.
LLMs are going to spit out a lot of broken shit that needs fixing. They're great at small context work but full applications require more than they're capable of imo.
Even if so, the next gen model will fix it.
Hey, I doubt it.
I believe one of the problems that OSS models need to solve, is... dataset. All of them lack a good and large dataset.
And this is most noticiable if you ask anything that is not in English-American-ish.
Maybe it should be an independent model in charge only of converting your question to American English and back, instead of trying to make a single model speak all languages
I don't think this is a good idea. A good model if we are really aiming at anything that resembles AGI (or even a good LLM like GPT4) is a model that have world knowledge. The world is not just English.
There’s a lot of world knowledge that is just not present in an American English corpus. For example knowledge of world cuisine & culture. There’s precious few good English sources on Sichuan cooking.
There is indeed already open source models rivaling ChatGPT-3.5 but GPT-4 is an order of magnitude better.
The sentiment that GPT-4 is going to be surpassed by open source models soon is something I only notice on HN. Makes me suspect people here haven't really tried the actual GPT-4 but instead the various scammy services like Bing that claim they are using GPT-4 under the hood when they are clearly not.
Makes me suspect you don't follow HN user base very closely.
You're 100% right and I apologize that you're getting downvoted, in solidarity I will eat downvotes with you.
HNs funny right now because LLMs are all over the front page constantly, but there's a lot of HN "I am an expert because I read comments sections" type behavior. So many not even wrong comments that start from "I know LLaMa is local and C++ is a programming language and I know LLaMa.cpp is on GitHub and software improves and I've heard of Mistral."
Mistral's latest just released model is well below GPT-3 out of the box. I've seen people speculate that with fine-tuning and RLHF you could get GPT-3 like performance out of it but it's still too early to tell.
I'm in agreement with you, I've been following this field for a decade now and GPT-4 did seem to cross a magical threshold for me where it was finally good enough to not just be a curiosity but a real tool. I try to test every new model I can get my hands on and it remains the only one to cross that admittedly subjective threshold for me.
Still, for a 7B model, this is quite impressive.
The early information I see implies it is above. Mind you, that is mostly because GPT-3 was comparatively low: for instance its 5-shot MMLU score was 43.9%, while Llama2 70B 5-shot was 68.9%[0]. Early benchmarks[1] give Mixtral scores above Llama2 70B on MMLU (and other benchmarks), thus transitively, it seems likely to be above GPT-3.
Of course, GPT-3.5 has a 5-shot score of 70, and it is unclear yet whether Mixtral is above or below, and clearly it is below GPT-4’s 86.5. The dust needs to settle, and the official inference code needs to be released, before there is certainty on its exact strength.
(It is also a base model, not a chat finetune; I see a lot of people saying it is worse, simply because they interact with it as if it was a chatbot.)
[0]: https://paperswithcode.com/sota/multi-task-language-understa...
[1]: https://github.com/open-compass/MixtralKit#comparison-with-o...
Have you played with finetunes, like Cybertron? Augmented in wrappers and retrievers like GPT is?
It's not there yet, but its waaaay closer than the plain Mistral chat release.
what types of things do you ask ChatGPT to do for you regarding coding?
Typically a few lines snippets that would require me a few minutes of thinking but that ChatGPT will provide immediately. It often works, but there are setbacks. For instance, if I'm lazy and don't very carefully check the code, it can produce bugs and cancel the benefits.
It can be useful, but I can see how it'll generate a class of lazy coders who can't think by themselves and just try to get the answer from ChatGPT. An amplified Stack Overflow syndrome.
If you can run yi34b, you can run phind-codellama. It's much better than yi and mistral for code questions. I use it daily. More useful than gpt3 for coding, not as good as gpt4, except that I can copy and paste secrets into it without sending them to openai.
Thanks, I will give codellama a try.
Open source models will probably catch up at the same rate as open source search engines have caught up to Google search.
One thing people should keep in mind when reading others’ comments about how good an LLM is at coding, is that the capability of the model will vary depending on the programming language. GPT-4 is phenomenal at Java because it probably ate an absolutely enormous amount of Java in training. Also, Java is a well-managed language with good backwards-compatibility, so patterns in code written at different times are likely to be compatible with each other. Finally, Java has been designed so that it is hard for the programmer to make mistakes. GPT-4 is great for Java because Java is great for GPT-4: it provides what the LLM needs to be great.
How do you use these models? If you don't mind sharing. I use GPT-4 as an alternative to googling, haven't yet found a reason to switch to something else. I'll for example use it to learn about the history, architecture, cultural context, etc of a place when I'm visiting. I've found it very ergonomic for that.
I use them in my editor with my plugin https://github.com/David-Kunz/gen.nvim
Interesting use case, but the issue is wasting all this compute energy for prediction?
Can you explain what you mean by this question?
I’ve use lm studio. It’s not reached peak user friendliness, but it’s a nice enough GUI. You’ll need to fiddle with resource allocation settings and select an optimally quantized model for best performance. But you can do all that in the UI.
lm studio is an accessible simple way to use them. that said expecting them to be anywhere near as good as gpt-4 is going to lead to disappointment.
If you want to experiment Kobold.cpp is a great interface and goes a long distance to guarantee backwards compatibility of outdated model formats.
I host them here: https://app.lamini.ai/playground
You can play with them, tune them, and download the weights
It isn’t exactly the same as open source because weights != source code, but it is close in the sense that it is editable
IMO we just don’t have great tools for editing LLMs like we do for code, but they are getting better
Prompt engineering, RAG, and finetuning/tuning are effective for editing LLMs. They are getting easier and better tooling is starting to emerge
You mind sharing what you find so amazing about Yi-34B? I haven’t had a chance to try it.
I just installed it on my 32B Mac yesterday, first impressions: it does very well reasoning, it does very well answering general common sense world knowledge questions, and so far when it generates Python code, the code works and is well documented. I know this is just subjective, but I have been running a 30B model for a while in my Mac and Yi-34B just feels much better. With 4bit quantization, I can still run Emacs, terminal windows and a web browser with a few tabs without seeing much page faulting. Anyway, please try it and share a second opinion.
The 200K finetunes are also quite good at understanding their huge context.
How does it compare to other models? and with chatgpt in particular?
No comparison to be made.
I concur, Yi 34B and Mistral 7B are fantastic.
But you need to run the top Yi finetunes instead of the vanilla chat model. They are far better. I would recommend Xaboros/Cybertron, or my own merge of several models on huggingface if you want the long context Yi.