Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.
And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.
Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)
So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.
(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)
Phi-3-mini 3.8b: 71.2
Phi-3-small 7b: 74.9
Phi-3-medium 14b: 78.2
Phi-2 2.7b: 58.8
Mistral 7b: 61.0
Gemma 7b: 62.0
Llama-3-In 8b: 68.0
Mixtral 8x7b: 69.9
GPT-3.5 1106: 75.3
(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)
Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.
Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?
5 years? 5 years is a millennia these days.
We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.
Absolutely not on the first one. Not even close.
Why not? There's still 7 months left for breakthroughs.
We already do. It’s called LLama 3 70B Instruct.
Llama 3 is awful in non-English. 95% of their training data is in English....
GPT is still the king when talking about multiple languages/knowledge.
Is it released?
It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.
What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.
In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.
Source?
Right thanks for the reminder, I added it
Thanks, I don't see them being "well above GPT-4", merely 1 point? Also, no idea why one would want to exclude GPT-4-Turbo, the flagship "GPT-4" model, but w/e.
I also don't think they "beat Llama 3 8B"; their own abstract says "rivals that of models such as Mixtral 8x7B and GPT-3.5", "rivals" not even "beats".
Great model, but let's not overplay it.
In the English category: GPT-4-0314 (ELO 1166), Llama 3 8B Instruct (ELO 1161), Mistral-Large-2402 (ELO 1151), GPT-4-0613 (ELO 1148).
You are right, I toned down the language, I got a bit overexcited, and I missed the difference in the versions of GPT-4. And LMSYS is a subjective benchmark for what users prefer, which I'm sure has weird inherent biases.
It's just that any signal of an 3.8B model being anywhere in the vicinity of GPT-4 is huge.
Yeah, GPT3.5, in a phone, at ~1,000 tokens/sec ... nice!
12 tokens per second.
Whoops, made the same mistake as @ignoramous :P
This inductive logic is way overblown.
Judging by a single benchmark? Without even trying it out with real world usage?
Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.
By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"
It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.
As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.
I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.
It's a very significant demonstration of what could be possible soon.
Firstly, English is a highly subjective category.
Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.
While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.
Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.
Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.
When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".
Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.
in my use cases, better writing makes a better answer
Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.
A model on par with GPT-3.5 running on phones!
(weights haven't been released, though)
Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.
Hopefully Microsoft will continue that trend with Phi-3.
I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.
Weights are coming tomorrow.
tomorrow is now: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
Weights will be realised tomorrow, according to one of the tech report authors on Twitter.
Thanks! The HTML version on archive.is has messed up markup and shows 1212 instead: https://archive.is/Ndox6
No, we don't. LMsys is just one, very flawed benchmark.
Why is LMsys flawed?
Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.
Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.
On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.
I wonder if that's a positive or negative. How does it affect hallucinations?
It depends what you want to do. If you want a chat bot that can replace most Google queries, you want as much learned data as possible and the whole Wikipedia consumed. If you want a RAG style system, you want good reasoning about the context and minimal-or-no references to extra information. It's neither positive nor negative without a specific use case.
Where did you get this from?
No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.
"But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.
It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.
I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.
At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.