return to table of content

Phi-3 Technical Report

oersted
39 replies
15h17m

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

Phi-3-mini 3.8b: 71.2

Phi-3-small 7b: 74.9

Phi-3-medium 14b: 78.2

Phi-2 2.7b: 58.8

Mistral 7b: 61.0

Gemma 7b: 62.0

Llama-3-In 8b: 68.0

Mixtral 8x7b: 69.9

GPT-3.5 1106: 75.3

(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

crakenzak
8 replies
15h13m

Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

Deverauxi
2 replies
14h36m

5 years? 5 years is a millennia these days.

We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.

refulgentis
1 replies
14h9m

Absolutely not on the first one. Not even close.

ashirviskas
0 replies
1h37m

Why not? There's still 7 months left for breakthroughs.

bugglebeetle
1 replies
13h40m

We already do. It’s called LLama 3 70B Instruct.

vitorgrs
0 replies
10h58m

Llama 3 is awful in non-English. 95% of their training data is in English....

GPT is still the king when talking about multiple languages/knowledge.

stavros
0 replies
14h47m

Is it released?

regularfry
0 replies
5h32m

It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.

What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.

nl
0 replies
12h48m

What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

moralestapia
6 replies
15h9m

And on LMSYS English, Llama 3 8B is well above GPT-4

Source?

oersted
5 replies
15h7m

Right thanks for the reminder, I added it

moralestapia
4 replies
15h3m

Thanks, I don't see them being "well above GPT-4", merely 1 point? Also, no idea why one would want to exclude GPT-4-Turbo, the flagship "GPT-4" model, but w/e.

I also don't think they "beat Llama 3 8B"; their own abstract says "rivals that of models such as Mixtral 8x7B and GPT-3.5", "rivals" not even "beats".

Great model, but let's not overplay it.

oersted
3 replies
14h53m

In the English category: GPT-4-0314 (ELO 1166), Llama 3 8B Instruct (ELO 1161), Mistral-Large-2402 (ELO 1151), GPT-4-0613 (ELO 1148).

You are right, I toned down the language, I got a bit overexcited, and I missed the difference in the versions of GPT-4. And LMSYS is a subjective benchmark for what users prefer, which I'm sure has weird inherent biases.

It's just that any signal of an 3.8B model being anywhere in the vicinity of GPT-4 is huge.

moralestapia
2 replies
14h46m

Yeah, GPT3.5, in a phone, at ~1,000 tokens/sec ... nice!

mlyle
1 replies
14h20m

at ~1,000 tokens/sec

12 tokens per second.

moralestapia
0 replies
6h32m

Whoops, made the same mistake as @ignoramous :P

jxy
6 replies
14h44m

This inductive logic is way overblown.

Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

oersted
5 replies
14h36m

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

saretup
4 replies
14h4m

Firstly, English is a highly subjective category.

Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.

While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.

YetAnotherNick
3 replies
12h9m

Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

imtringued
2 replies
11h52m

When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".

YetAnotherNick
1 replies
10h34m

Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.

rgbrgb
0 replies
2h32m

just for better writing

in my use cases, better writing makes a better answer

ignoramous
5 replies
15h4m

Phi-3-mini 3.8b: 71.2

Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

A model on par with GPT-3.5 running on phones!

(weights haven't been released, though)

coder543
4 replies
14h47m

(weights haven't been released, though)

Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

Hopefully Microsoft will continue that trend with Phi-3.

outputs 1212 tokens/sec on iOS

I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

intellectronica
1 replies
12h1m

Weights are coming tomorrow.

jph00
0 replies
13h43m

Weights will be realised tomorrow, according to one of the tech report authors on Twitter.

ignoramous
0 replies
13h31m

you meant 12 tokens/sec

Thanks! The HTML version on archive.is has messed up markup and shows 1212 instead: https://archive.is/Ndox6

zone411
2 replies
14h34m

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

No, we don't. LMsys is just one, very flawed benchmark.

ukuina
0 replies
14h1m

Why is LMsys flawed?

Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

oersted
0 replies
14h33m

Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.

viraptor
2 replies
11h23m

On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.

candiodari
1 replies
6h47m

I wonder if that's a positive or negative. How does it affect hallucinations?

viraptor
0 replies
6h12m

It depends what you want to do. If you want a chat bot that can replace most Google queries, you want as much learned data as possible and the whole Wikipedia consumed. If you want a RAG style system, you want good reasoning about the context and minimal-or-no references to extra information. It's neither positive nor negative without a specific use case.

karmasimida
0 replies
9h52m

Where did you get this from?

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

infecto
0 replies
6h29m

"But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.

blackeyeblitzar
0 replies
2h50m

It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.

I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.

alecco
0 replies
9h59m

At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.

modeless
23 replies
14h24m

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.

That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

bt1a
13 replies
14h3m

This won't dethrone Llama 3, but it's equally impressive.

They mention this model's relative weakness in the TruthfulQA eval, since it's more lossy trying to pack 'knowledge' into a small model relative to problem-solving skills (which shine on MMLU)

Regardless - still a very useful thing to have offline and on the fly. Those scores are nothing to scoff at.

Given that these pipelines are likely harder harder to imitate than new architectures like Transformers, I assume there has been and will be an intense focus on synthetic data generation and cleansing. Llama 3 used 15T of tokens in its training corpus vs 4.8T in the "scaled-up" version of phi-3. If you made it to the end of this disjointed ramble I'm sorry

IvanAchlaqullah
7 replies
12h20m

TruthfulQA

Wait, people still use this benchmark? I hear there's a huge flaw on it.

For examples, fine-tuning the model on 4chan make it scores better on TruthfulQA. It becomes very offensive afterwards though, for obvious reasons. See GPT-4chan [1]

[1] https://www.youtube.com/watch?v=efPrtcLdcdM

thomashop
2 replies
10h9m

Couldn't it be that training it on 4chan makes it more truthful for some reason?

wongarsu
1 replies
6h28m

Could it be that people who can talk anonymously with no reputation to gain or lose and no repercussions to fear actually score high on truthfulness? Could it be that truthfulness is actually completely unrelated to the offensiveness of the language used to signal in-group status?

cptcobalt
0 replies
2h49m

This unironically feels like good research & paper potential.

nurumaik
0 replies
9h8m

scores better

very offensive

Any cons?

hoseja
0 replies
11h35m

Looks like a good and useful benchmark.

andy99
0 replies
9h10m

Not sure I understand your example? It's not an offensiveness benchmark, in fact I can imagine a model trained to be inoffensive would do worse on a truth benchmark. I wouldn't go so far as to say truthfulQA is actually testing how truthful a model is or its reasoning. But it's one of the least correlated with other benchmarks which makes it one of the most interesting. Much more so than running most other tests that are highly correlated with MMLU performance. https://twitter.com/gblazex/status/1746295870792847562

andai
0 replies
5h15m

"Omit that training data..."

Grimblewald
4 replies
8h30m

Even llama3 has its issues. Ive been quite impressed so far but if the context gets a little long it freaks out, gets stuck repeating the same token or just fails to finish an answer. This is for the full f16 8B model, so it cant be put down to quantization. It also doesnt quite handle complex instructions as well as the benchmarks would imply should.

andai
2 replies
5h28m

Supposedly LLMs (especially smaller ones) are best suited to tasks where the answer is in the text, i.e. summarization, translation, and answering questions.

Asking it to answer questions on its own is much more prone to hallucination.

To that end I've been using Llama 3 for summarizing transcripts of YouTube videos. It does a decent job, but... every single time (literally 100% of the time), it will hallucinate a random name for the speaker.* Every time! I thought it might be the system prompt, but there isn't one.

My own prompt is just "{text}\n\n###\n\nPlease summarize the text above."

If I ask it to summarize in bullet points, it doesn't do that.

I'm assuming there was something in the (instruct) training data that strongly encourages that, i.e. a format of summaries beginning with the author's name? Seems sensible enough, but obviously backfires when there's literally no data and it just makes something up...

*In videos where the speaker's name isn't in the transcript. If it's a popular field, it will often come up with something plausible (e.g. Andrew Ng for an AI talk.) If it's something more obscure, it'll dream up something completely random.

woodson
0 replies
1h41m

Especially for small models I had very bad results for use in translation. Even trying all kinds of tricks didn’t help (apparently prompting in the target language helps for some). Encoder-decoder models such as FLAN-T5 or MADLAD-400 seemed far superior at equal or even smaller model size.

kiratp
0 replies
1h14m

The technique to use is to give the model an “out” for the missing/negative case.

"{text}\n\n###\n\nPlease summarize the text above. The text is a video transcript. It may not have the names of the speakers in it. If you need to refer to an unnamed speaker, call them Speaker_1, Speaker_2 and so on."

behohippy
0 replies
2h50m

I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.

refulgentis
4 replies
14h10m

Phi-2 wasn't chat/instruct tuned, so it didn't act good in chat, it was a base model. But the benchmark #s were real.

irjustin
2 replies
13h49m

I'm pretty naive so please forgive it's a stupid question.

To me, what the parent comment is saying is that even though the benchmarks are cool, it's not super helpful to the every day person. Because if you can't chat with it very well (even for a narrow context) what utility does it have with great benchmarks?

svnt
1 replies
13h17m

Both are saying the same thing: in order for the base model that is phi to perform well as a chat agent, it would need to be tuned for that purpose before its benchmark results could have real-world value.

imjonse
0 replies
12h37m

From this report. Phi-2 was not instruct tuned indeed.

"Our models went through post-training with both supervised instruction fine-tuning, and preference tuning with DPO. We have worked on generating and curating various instruction and preference data. This has improved the model chat capabilities, robustness, as well as its safety."

nl
0 replies
12h58m

I had a lot of issues trying to get Phi-2 to perform as well as the benchmarks indicated on non-chat tasks.

It felt a lot like it was overfitted to the exact type of tasks (ie, not a data leak) in the benchmarks but if you were trying something a bit off track if didn't know what to do. At the time my hypothesis was that the small model just didn't have the capacity to generalise well enough, but since then Gemma 2B has come out and seems to be ok.

So now I have no idea why, but yes: the benchmarks for Phi-2 didn't represent how it worked for me on real world tasks where you'd expect it top be ok.

ankit219
1 replies
8h28m

Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

moffkalast
0 replies
3h12m

This was the case for Phi-2, it was notoriously rubbish in practical use.

spmurrayzzz
0 replies
4h31m

I don't think we can call it distillation, at least not in the conventional ML use of the word as you're not interacting with any of the actual model architecture, specifically — not computing the loss between the predictions of the parent model and the distilled target model.

This is an important distinction when it comes to assess model collapse risk, which is a risk I think has probably been overstated enough to this point where now its being understated.

HarHarVeryFunny
0 replies
2h28m

The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

The question that Chinchilla tries to answer is: for a given training budget (which you can think of as dollars or FLOPs), what is the optimal trade off of model size and quantity of training data to get the most performant model? Build a large model and train with less data, or build a smaller one and train with more data?

However, another consideration is minimizing total lifetime cost of the model: training cost + inference cost. You could train a model for longer (costing more) in order to get a given level of performance from a smaller model that will be cheaper for inference, or vice versa. For any given projected model lifetime inference volume, there is going to be a different answer.

It's not that Chinchilla-optimal models stopped making sense, but rather that this sort of consideration has people willing to pump more money (tokens) into smaller models to reduce inference cost for that level of capability.

brcmthrowaway
21 replies
15h3m

If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.

thoughtegting
4 replies
14h52m

If I were apple, I would be developing something in total secrecy and then release something ahead of the rest of competition when people least expect it. very big ifs but siri can be updated everywhere overnight and I dont see them rushing into anything like this

golergka
3 replies
14h47m

If I were apple, I would just buy one of the major LLM companies. They have the cash.

bingbingbing777
2 replies
13h40m

They've been buying AI companies and have nothing to show for it.

golergka
0 replies
11h40m

Showing off work in progress is not really their thing.

elbear
0 replies
13h12m

Why do you think that is? Do you think their culture is an obstacle or is it something else?

seydor
2 replies
12h28m

Apple's advantage is that their devices are safeguarding people from the dangers of AI

talldayo
0 replies
1h54m

That's a very eloquent variation on the word "censorship"

Are you next going to tell us that the CIA's access to iCloud data protects their users from terrorism too?

fauigerzigerk
0 replies
9h27m

How so? And what dangers?

PedroBatista
2 replies
14h39m

They'll just do what they have been doing for ~20 years, they wait, pick the "winner", polish the "user experience", call it Apple magic and incorporate that into their product cycles.

Some day will be the day their joke book becomes so mediocre it will not stick anymore, but I think they are safe on this one, for now..

mirekrusin
0 replies
9h0m

Considering that experiments cost tens to hundreds of millions of dollars a pop this may be not that bad strategy.

fauigerzigerk
0 replies
9h28m

True for hardware, but their record on software is far less convincing.

vessenes
0 replies
14h39m

I don't think MS has a special sauce here, just a willingness to publish. To the extent MS has disclosed the bulk of what they are doing with Phi, it's a combination of really nice initial idea "Use written texts + GPT-4 to generate high quality prompts where we know the answer is great because it's written down" and engineering.

To me this is advancing the state of the art as to the impact of data quality, but it doesn't look to me like the phi series have some magical special sauce otherwise. Data of quality and synthetic data creation are not magical moats that Apple can't cross.

I'll say too that I'm psyched to try Phi-3; the sweet spot for me is a model that can be a local coding assistant and still answer random q&a questions with some sophistication. I'm skeptical that 3-8b parameter models will bring the high-level of sophistication sometimes needed in this cycle; there's still a very large gap with the larger models in daily use, despite some often close benchmark scores.

Anyway, Apple-Phi-3 is in no way an impossibility.

oersted
0 replies
14h48m

If anything this is good for them. Apple's play here has always been getting their devices ready for running LLMs locally. This makes it way easier.

moralestapia
0 replies
14h55m

I don't recall Nokia being a 3 trillion dollar company. Your vibes may vary, though.

esafak
0 replies
15h0m

Did they ever claim to be a powerhouse in foundation models? Did your MacBook or iPhone become obsolete or stop working? They use the models, they don't release them because they don't hoard data.

ec109685
0 replies
14h47m

Eh, I think it’s showing that this class of model is becoming commoditized given there is a new one launching every week.

bt1a
0 replies
14h23m

I tore my hair out developing a SwiftUI app that could run llama.cpp and whisper.cpp simultaneously. Was able to run a Q3_K Mistral 7B along with a smaller whisper model eventually, but grinding through XCode is a nightmare.

They're working on MLX but it only recently got swift bindings. They just don't have the DEVELOPERS DEVELOPERS DEVELOPERS coked out attitude i guess

astrange
0 replies
13h33m

I think that when people release new interesting software products it's good for hardware companies.

WanderPanda
0 replies
14h36m

The opposite is the case, with all the advancements, even by doing nothing, Apple (like everyone, including hobbyists) is moving closer to the frontier. Hopefully this trend stays alive!

IncreasePosts
0 replies
14h44m

How exactly does publicized research lead to them not being able to catch up? I don't think anything in this paper is patentable.

Deverauxi
0 replies
14h38m

They have something like 140 billion dollars in cash.

They’ll be fine.

hackerlight
8 replies
15h2m

Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

vessenes
5 replies
14h37m

Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.

torginus
1 replies
13h1m

Except everything that comes out of an LLM (like GPT4) is highly suspect (at least in my experience).

samus
0 replies
9h29m

1. They need it for style and language, not necessarily for the facts

2. Since GPT-4 is seen as the very best general-purpose LLM in existence, it makes sense to emulate its performance with less resources.

3. Phi models are also trained with other high-quality data

xarope
0 replies
14h21m

perhaps that's the best path forward? Text and reference books (hopefully unbiased) for answers, and web scraped data for conversational tone.

astrange
0 replies
13h29m

I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.

YetAnotherNick
0 replies
12h38m

No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.

minimaxir
1 replies
14h58m

Yes, "chinchilla optimal" is a meme, but 15T might turn out to be too many tokens.

wrsh07
0 replies
14h49m

My understanding from this tweet thread [1] is that chinchilla probably overspecified some of the hyperparameters to the model

tl;dr I'm looking forward to having lots of models (ideally models) trained with a wide range of parameters to narrow down "what is actually optimal"

I think there is an interesting tradeoff of data quality and data volume, though

(Eg if we train with the highest quality 10% of our data, does the model improve if we use the other 90%? What if we increase our data size by 10x?)

[1] https://twitter.com/tamaybes/status/1780639257389904013

smartmic
4 replies
11h34m

Hm, roundabout 84 authors of one "scientific" paper. I wonder if this says something about (a) the quality of its content, (b) the path were academic (?) paper publishing goes to, (c) nothing at all, or (d), something entirely else.

samus
0 replies
9h33m

It's a tech report. Fair enough to include the whole lab.

lysecret
0 replies
10h0m

Just means you need a big machine and a lot of capital to make advancement. Take a look at any paper coming out of cern.

a_bonobo
0 replies
11h25m

I have been on far larger author lists :) There's probably a whole team for the training data generation and assessment, a whole team for the safety assessment (section 4), that stuff adds up.

0cf8612b2e1e
0 replies
1h24m

You should see physics. Stuff involving the large hadron collider can be pages of authors.

It costs so little to share the credit if someone was an asset.

minimaxir
0 replies
2h46m

And with a MIT license!

Patrick_Devine
0 replies
19m

And of course if you want to try it out locally, `ollama run phi3`.

simonw
1 replies
15h14m

I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.

minimaxir
0 replies
15h2m

At the least, there's multiple benchmarks noted in the paper (21!) and the results are consistent across all of them.

I'd trust Microsoft to do decontamination testing, although the paper doesn't explicitly mention it other than "The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models."

m3kw9
1 replies
4h52m

Phi-2 was useless for practical purposes except if you want to show your friends that it can write a poem, llama3 8b was slightly better but is still same category, it’s complete trash with coding vs gpt4. Llama3 400b “iS OPen SoURce!” But no you will need to pay to access because most one can not practically afford an A100 and set it up properly.

What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

azinman2
0 replies
2h15m

I just tried to give gpt4 a scrape of a menu page and asked it to reformat it to csv. It hallucinated several parts and missed others. Llama3-70b hasn’t done that. So far it’s been more reliable. You can run a quantized version on consumer hardware or pay significantly less ($10 vs $1 in, $30 vs $1 out) on hosted platforms.

whereismyacc
0 replies
9h20m

they're just spamming "weights or it didn't happen"

i mean, fair

Havoc
1 replies
5h42m

Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though

imjonse
0 replies
45m

Phi-3 is instruct tuned though so hopefully better.

visarga
0 replies
14h23m

This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringement claims can be placated.

ur-whale
0 replies
14h3m

That's a whole lot of Zhangs!

mythz
0 replies
13h57m

I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).

But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It was also the model ranked with the lowest quality answers.

maximsicora
0 replies
9h32m

insane

blackoil
0 replies
14h55m

Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?