HN comments for: Phi-3 Technical Report

oersted

39 replies

15h17m

2024-04-23 03:08:29 UTC

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

Phi-3-mini 3.8b: 71.2

Phi-3-small 7b: 74.9

Phi-3-medium 14b: 78.2

Phi-2 2.7b: 58.8

Mistral 7b: 61.0

Gemma 7b: 62.0

Llama-3-In 8b: 68.0

Mixtral 8x7b: 69.9

GPT-3.5 1106: 75.3

(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

crakenzak

8 replies

15h13m

2024-04-23 03:12:33 UTC

Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

Deverauxi

2 replies

14h36m

2024-04-23 03:49:22 UTC

5 years? 5 years is a millennia these days.

We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.

refulgentis

1 replies

14h9m

2024-04-23 04:16:32 UTC

Absolutely not on the first one. Not even close.

ashirviskas

0 replies

1h37m

2024-04-23 16:48:51 UTC

Why not? There's still 7 months left for breakthroughs.

bugglebeetle

1 replies

13h40m

2024-04-23 04:45:37 UTC

We already do. It’s called LLama 3 70B Instruct.

vitorgrs

0 replies

10h58m

2024-04-23 07:28:01 UTC

Llama 3 is awful in non-English. 95% of their training data is in English....

GPT is still the king when talking about multiple languages/knowledge.

stavros

0 replies

14h47m

2024-04-23 03:39:09 UTC

Is it released?

regularfry

0 replies

5h32m

2024-04-23 12:53:17 UTC

It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.

What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.

0 replies

12h48m

2024-04-23 05:37:59 UTC

What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

moralestapia

6 replies

15h9m

2024-04-23 03:16:59 UTC

And on LMSYS English, Llama 3 8B is well above GPT-4

Source?

oersted

5 replies

15h7m

2024-04-23 03:18:46 UTC

Right thanks for the reminder, I added it

moralestapia

4 replies

15h3m

2024-04-23 03:22:29 UTC

Thanks, I don't see them being "well above GPT-4", merely 1 point? Also, no idea why one would want to exclude GPT-4-Turbo, the flagship "GPT-4" model, but w/e.

I also don't think they "beat Llama 3 8B"; their own abstract says "rivals that of models such as Mixtral 8x7B and GPT-3.5", "rivals" not even "beats".

Great model, but let's not overplay it.

oersted

3 replies

14h53m

2024-04-23 03:32:24 UTC

In the English category: GPT-4-0314 (ELO 1166), Llama 3 8B Instruct (ELO 1161), Mistral-Large-2402 (ELO 1151), GPT-4-0613 (ELO 1148).

You are right, I toned down the language, I got a bit overexcited, and I missed the difference in the versions of GPT-4. And LMSYS is a subjective benchmark for what users prefer, which I'm sure has weird inherent biases.

It's just that any signal of an 3.8B model being anywhere in the vicinity of GPT-4 is huge.

moralestapia

2 replies

14h46m

2024-04-23 03:39:11 UTC

Yeah, GPT3.5, in a phone, at ~1,000 tokens/sec ... nice!

mlyle

1 replies

14h20m

2024-04-23 04:05:10 UTC

at ~1,000 tokens/sec

12 tokens per second.

moralestapia

0 replies

6h32m

2024-04-23 11:53:42 UTC

Whoops, made the same mistake as @ignoramous :P

jxy

6 replies

14h44m

2024-04-23 03:42:03 UTC

This inductive logic is way overblown.

Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

oersted

5 replies

14h36m

2024-04-23 03:49:45 UTC

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

saretup

4 replies

14h4m

2024-04-23 04:21:14 UTC

Firstly, English is a highly subjective category.

Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.

While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.

YetAnotherNick

3 replies

12h9m

2024-04-23 06:16:38 UTC

Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

imtringued

2 replies

11h52m

2024-04-23 06:33:38 UTC

When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".

YetAnotherNick

1 replies

10h34m

2024-04-23 07:51:58 UTC

Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.

rgbrgb

0 replies

2h32m

2024-04-23 15:53:45 UTC

just for better writing

in my use cases, better writing makes a better answer

ignoramous

5 replies

15h4m

2024-04-23 03:22:06 UTC

Phi-3-mini 3.8b: 71.2

Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

A model on par with GPT-3.5 running on phones!

(weights haven't been released, though)

coder543

4 replies

14h47m

2024-04-23 03:38:22 UTC

(weights haven't been released, though)

Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

Hopefully Microsoft will continue that trend with Phi-3.

outputs 1212 tokens/sec on iOS

I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

intellectronica

1 replies

12h1m

2024-04-23 06:24:35 UTC

Weights are coming tomorrow.

homarp

0 replies

3h8m

2024-04-23 15:17:34 UTC

tomorrow is now: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

jph00

0 replies

13h43m

2024-04-23 04:42:18 UTC

Weights will be realised tomorrow, according to one of the tech report authors on Twitter.

ignoramous

0 replies

13h31m

2024-04-23 04:54:37 UTC

you meant 12 tokens/sec

Thanks! The HTML version on archive.is has messed up markup and shows 1212 instead: https://archive.is/Ndox6

zone411

2 replies

14h34m

2024-04-23 03:51:10 UTC

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

No, we don't. LMsys is just one, very flawed benchmark.

ukuina

0 replies

14h1m

2024-04-23 04:24:46 UTC

Why is LMsys flawed?

Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

oersted

0 replies

14h33m

2024-04-23 03:53:03 UTC

Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.

viraptor

2 replies

11h23m

2024-04-23 07:02:29 UTC

On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.

candiodari

1 replies

6h47m

2024-04-23 11:39:00 UTC

I wonder if that's a positive or negative. How does it affect hallucinations?

viraptor

0 replies

6h12m

2024-04-23 12:14:03 UTC

It depends what you want to do. If you want a chat bot that can replace most Google queries, you want as much learned data as possible and the whole Wikipedia consumed. If you want a RAG style system, you want good reasoning about the context and minimal-or-no references to extra information. It's neither positive nor negative without a specific use case.

karmasimida

0 replies

9h52m

2024-04-23 08:33:27 UTC

Where did you get this from?

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

infecto

0 replies

6h29m

2024-04-23 11:57:01 UTC

"But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.

blackeyeblitzar

0 replies

2h50m

2024-04-23 15:35:15 UTC

It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.

I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.

alecco

0 replies

9h59m

2024-04-23 08:26:25 UTC

At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.

modeless

23 replies

14h24m

2024-04-23 04:01:36 UTC

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.

That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

bt1a

13 replies

14h3m

2024-04-23 04:22:56 UTC

This won't dethrone Llama 3, but it's equally impressive.

They mention this model's relative weakness in the TruthfulQA eval, since it's more lossy trying to pack 'knowledge' into a small model relative to problem-solving skills (which shine on MMLU)

Regardless - still a very useful thing to have offline and on the fly. Those scores are nothing to scoff at.

Given that these pipelines are likely harder harder to imitate than new architectures like Transformers, I assume there has been and will be an intense focus on synthetic data generation and cleansing. Llama 3 used 15T of tokens in its training corpus vs 4.8T in the "scaled-up" version of phi-3. If you made it to the end of this disjointed ramble I'm sorry

IvanAchlaqullah

7 replies

12h20m

2024-04-23 06:05:22 UTC

TruthfulQA

Wait, people still use this benchmark? I hear there's a huge flaw on it.

For examples, fine-tuning the model on 4chan make it scores better on TruthfulQA. It becomes very offensive afterwards though, for obvious reasons. See GPT-4chan [1]

[1] https://www.youtube.com/watch?v=efPrtcLdcdM

thomashop

2 replies

10h9m

2024-04-23 08:17:08 UTC

Couldn't it be that training it on 4chan makes it more truthful for some reason?

wongarsu

1 replies

6h28m

2024-04-23 11:57:51 UTC

Could it be that people who can talk anonymously with no reputation to gain or lose and no repercussions to fear actually score high on truthfulness? Could it be that truthfulness is actually completely unrelated to the offensiveness of the language used to signal in-group status?

cptcobalt

0 replies

2h49m

2024-04-23 15:36:36 UTC

This unironically feels like good research & paper potential.

nurumaik

0 replies

9h8m

2024-04-23 09:17:49 UTC

scores better

very offensive

Any cons?

hoseja

0 replies

11h35m

2024-04-23 06:50:12 UTC

Looks like a good and useful benchmark.

andy99

0 replies

9h10m

2024-04-23 09:15:52 UTC

Not sure I understand your example? It's not an offensiveness benchmark, in fact I can imagine a model trained to be inoffensive would do worse on a truth benchmark. I wouldn't go so far as to say truthfulQA is actually testing how truthful a model is or its reasoning. But it's one of the least correlated with other benchmarks which makes it one of the most interesting. Much more so than running most other tests that are highly correlated with MMLU performance. https://twitter.com/gblazex/status/1746295870792847562

andai

0 replies

5h15m

2024-04-23 13:10:47 UTC

"Omit that training data..."

Grimblewald

4 replies

8h30m

2024-04-23 09:55:56 UTC

Even llama3 has its issues. Ive been quite impressed so far but if the context gets a little long it freaks out, gets stuck repeating the same token or just fails to finish an answer. This is for the full f16 8B model, so it cant be put down to quantization. It also doesnt quite handle complex instructions as well as the benchmarks would imply should.

andai

2 replies

5h28m

2024-04-23 12:57:48 UTC

Supposedly LLMs (especially smaller ones) are best suited to tasks where the answer is in the text, i.e. summarization, translation, and answering questions.

Asking it to answer questions on its own is much more prone to hallucination.

To that end I've been using Llama 3 for summarizing transcripts of YouTube videos. It does a decent job, but... every single time (literally 100% of the time), it will hallucinate a random name for the speaker.* Every time! I thought it might be the system prompt, but there isn't one.

My own prompt is just "{text}\n\n###\n\nPlease summarize the text above."

If I ask it to summarize in bullet points, it doesn't do that.

I'm assuming there was something in the (instruct) training data that strongly encourages that, i.e. a format of summaries beginning with the author's name? Seems sensible enough, but obviously backfires when there's literally no data and it just makes something up...

*In videos where the speaker's name isn't in the transcript. If it's a popular field, it will often come up with something plausible (e.g. Andrew Ng for an AI talk.) If it's something more obscure, it'll dream up something completely random.

woodson

0 replies

1h41m

2024-04-23 16:44:16 UTC

Especially for small models I had very bad results for use in translation. Even trying all kinds of tricks didn’t help (apparently prompting in the target language helps for some). Encoder-decoder models such as FLAN-T5 or MADLAD-400 seemed far superior at equal or even smaller model size.

kiratp

0 replies

1h14m

2024-04-23 17:11:15 UTC

The technique to use is to give the model an “out” for the missing/negative case.

"{text}\n\n###\n\nPlease summarize the text above. The text is a video transcript. It may not have the names of the speakers in it. If you need to refer to an unnamed speaker, call them Speaker_1, Speaker_2 and so on."

behohippy

0 replies

2h50m

2024-04-23 15:35:19 UTC

I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.

refulgentis

4 replies

14h10m

2024-04-23 04:15:29 UTC

Phi-2 wasn't chat/instruct tuned, so it didn't act good in chat, it was a base model. But the benchmark #s were real.

irjustin

2 replies

13h49m

2024-04-23 04:36:18 UTC

I'm pretty naive so please forgive it's a stupid question.

To me, what the parent comment is saying is that even though the benchmarks are cool, it's not super helpful to the every day person. Because if you can't chat with it very well (even for a narrow context) what utility does it have with great benchmarks?

svnt

1 replies

13h17m

2024-04-23 05:08:35 UTC

Both are saying the same thing: in order for the base model that is phi to perform well as a chat agent, it would need to be tuned for that purpose before its benchmark results could have real-world value.

imjonse

0 replies

12h37m

2024-04-23 05:49:07 UTC

From this report. Phi-2 was not instruct tuned indeed.

"Our models went through post-training with both supervised instruction fine-tuning, and preference tuning with DPO. We have worked on generating and curating various instruction and preference data. This has improved the model chat capabilities, robustness, as well as its safety."

0 replies

12h58m

2024-04-23 05:27:25 UTC

I had a lot of issues trying to get Phi-2 to perform as well as the benchmarks indicated on non-chat tasks.

It felt a lot like it was overfitted to the exact type of tasks (ie, not a data leak) in the benchmarks but if you were trying something a bit off track if didn't know what to do. At the time my hypothesis was that the small model just didn't have the capacity to generalise well enough, but since then Gemma 2B has come out and seems to be ok.

So now I have no idea why, but yes: the benchmarks for Phi-2 didn't represent how it worked for me on real world tasks where you'd expect it top be ok.

ankit219

1 replies

8h28m

2024-04-23 09:57:59 UTC

Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

moffkalast

0 replies

3h12m

2024-04-23 15:14:03 UTC

This was the case for Phi-2, it was notoriously rubbish in practical use.

spmurrayzzz

0 replies

4h31m

2024-04-23 13:54:40 UTC

I don't think we can call it distillation, at least not in the conventional ML use of the word as you're not interacting with any of the actual model architecture, specifically — not computing the loss between the predictions of the parent model and the distilled target model.

This is an important distinction when it comes to assess model collapse risk, which is a risk I think has probably been overstated enough to this point where now its being understated.

HarHarVeryFunny

0 replies

2h28m

2024-04-23 15:57:35 UTC

The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

The question that Chinchilla tries to answer is: for a given training budget (which you can think of as dollars or FLOPs), what is the optimal trade off of model size and quantity of training data to get the most performant model? Build a large model and train with less data, or build a smaller one and train with more data?

However, another consideration is minimizing total lifetime cost of the model: training cost + inference cost. You could train a model for longer (costing more) in order to get a given level of performance from a smaller model that will be cheaper for inference, or vice versa. For any given projected model lifetime inference volume, there is going to be a different answer.

It's not that Chinchilla-optimal models stopped making sense, but rather that this sort of consideration has people willing to pump more money (tokens) into smaller models to reduce inference cost for that level of capability.

brcmthrowaway

21 replies

15h3m

2024-04-23 03:22:31 UTC

If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.

thoughtegting

4 replies

14h52m

2024-04-23 03:34:05 UTC

If I were apple, I would be developing something in total secrecy and then release something ahead of the rest of competition when people least expect it. very big ifs but siri can be updated everywhere overnight and I dont see them rushing into anything like this

golergka

3 replies

14h47m

2024-04-23 03:38:14 UTC

If I were apple, I would just buy one of the major LLM companies. They have the cash.

bingbingbing777

2 replies

13h40m

2024-04-23 04:45:57 UTC

They've been buying AI companies and have nothing to show for it.

golergka

0 replies

11h40m

2024-04-23 06:46:01 UTC

Showing off work in progress is not really their thing.

elbear

0 replies

13h12m

2024-04-23 05:14:09 UTC

Why do you think that is? Do you think their culture is an obstacle or is it something else?

seydor

2 replies

12h28m

2024-04-23 05:57:47 UTC

Apple's advantage is that their devices are safeguarding people from the dangers of AI

talldayo

0 replies

1h54m

2024-04-23 16:31:51 UTC

That's a very eloquent variation on the word "censorship"

Are you next going to tell us that the CIA's access to iCloud data protects their users from terrorism too?

fauigerzigerk

0 replies

9h27m

2024-04-23 08:58:46 UTC

How so? And what dangers?

PedroBatista

2 replies

14h39m

2024-04-23 03:46:32 UTC

They'll just do what they have been doing for ~20 years, they wait, pick the "winner", polish the "user experience", call it Apple magic and incorporate that into their product cycles.

Some day will be the day their joke book becomes so mediocre it will not stick anymore, but I think they are safe on this one, for now..

mirekrusin

0 replies

9h0m

2024-04-23 09:25:33 UTC

Considering that experiments cost tens to hundreds of millions of dollars a pop this may be not that bad strategy.

fauigerzigerk

0 replies

9h28m

2024-04-23 08:57:11 UTC

True for hardware, but their record on software is far less convincing.

vessenes

0 replies

14h39m

2024-04-23 03:46:41 UTC

I don't think MS has a special sauce here, just a willingness to publish. To the extent MS has disclosed the bulk of what they are doing with Phi, it's a combination of really nice initial idea "Use written texts + GPT-4 to generate high quality prompts where we know the answer is great because it's written down" and engineering.

To me this is advancing the state of the art as to the impact of data quality, but it doesn't look to me like the phi series have some magical special sauce otherwise. Data of quality and synthetic data creation are not magical moats that Apple can't cross.

I'll say too that I'm psyched to try Phi-3; the sweet spot for me is a model that can be a local coding assistant and still answer random q&a questions with some sophistication. I'm skeptical that 3-8b parameter models will bring the high-level of sophistication sometimes needed in this cycle; there's still a very large gap with the larger models in daily use, despite some often close benchmark scores.

Anyway, Apple-Phi-3 is in no way an impossibility.

oersted

0 replies

14h48m

2024-04-23 03:37:46 UTC

If anything this is good for them. Apple's play here has always been getting their devices ready for running LLMs locally. This makes it way easier.

moralestapia

0 replies

14h55m

2024-04-23 03:30:25 UTC

I don't recall Nokia being a 3 trillion dollar company. Your vibes may vary, though.

esafak

0 replies

15h0m

2024-04-23 03:26:08 UTC

Did they ever claim to be a powerhouse in foundation models? Did your MacBook or iPhone become obsolete or stop working? They use the models, they don't release them because they don't hoard data.

ec109685

0 replies

14h47m

2024-04-23 03:38:32 UTC

Eh, I think it’s showing that this class of model is becoming commoditized given there is a new one launching every week.

bt1a

0 replies

14h23m

2024-04-23 04:02:37 UTC

I tore my hair out developing a SwiftUI app that could run llama.cpp and whisper.cpp simultaneously. Was able to run a Q3_K Mistral 7B along with a smaller whisper model eventually, but grinding through XCode is a nightmare.

They're working on MLX but it only recently got swift bindings. They just don't have the DEVELOPERS DEVELOPERS DEVELOPERS coked out attitude i guess

astrange

0 replies

13h33m

2024-04-23 04:52:19 UTC

I think that when people release new interesting software products it's good for hardware companies.

WanderPanda

0 replies

14h36m

2024-04-23 03:49:43 UTC

The opposite is the case, with all the advancements, even by doing nothing, Apple (like everyone, including hobbyists) is moving closer to the frontier. Hopefully this trend stays alive!

IncreasePosts

0 replies

14h44m

2024-04-23 03:41:50 UTC

How exactly does publicized research lead to them not being able to catch up? I don't think anything in this paper is patentable.

Deverauxi

0 replies

14h38m

2024-04-23 03:47:54 UTC

They have something like 140 billion dollars in cash.

They’ll be fine.

hackerlight

8 replies

15h2m

2024-04-23 03:23:51 UTC

Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

vessenes

5 replies

14h37m

2024-04-23 03:48:25 UTC

Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.

torginus

1 replies

13h1m

2024-04-23 05:24:12 UTC

Except everything that comes out of an LLM (like GPT4) is highly suspect (at least in my experience).

samus

0 replies

9h29m

2024-04-23 08:57:03 UTC

1. They need it for style and language, not necessarily for the facts

2. Since GPT-4 is seen as the very best general-purpose LLM in existence, it makes sense to emulate its performance with less resources.

3. Phi models are also trained with other high-quality data

xarope

0 replies

14h21m

2024-04-23 04:04:19 UTC

perhaps that's the best path forward? Text and reference books (hopefully unbiased) for answers, and web scraped data for conversational tone.

astrange

0 replies

13h29m

2024-04-23 04:56:12 UTC

I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.

YetAnotherNick

0 replies

12h38m

2024-04-23 05:47:48 UTC

No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.

minimaxir

1 replies

14h58m

2024-04-23 03:27:46 UTC

Yes, "chinchilla optimal" is a meme, but 15T might turn out to be too many tokens.

wrsh07

0 replies

14h49m

2024-04-23 03:36:45 UTC

My understanding from this tweet thread [1] is that chinchilla probably overspecified some of the hyperparameters to the model

tl;dr I'm looking forward to having lots of models (ideally models) trained with a wide range of parameters to narrow down "what is actually optimal"

I think there is an interesting tradeoff of data quality and data volume, though

(Eg if we train with the highest quality 10% of our data, does the model improve if we use the other 90%? What if we increase our data size by 10x?)

[1] https://twitter.com/tamaybes/status/1780639257389904013

smartmic

4 replies

11h34m

2024-04-23 06:51:31 UTC

Hm, roundabout 84 authors of one "scientific" paper. I wonder if this says something about (a) the quality of its content, (b) the path were academic (?) paper publishing goes to, (c) nothing at all, or (d), something entirely else.

samus

0 replies

9h33m

2024-04-23 08:52:19 UTC

It's a tech report. Fair enough to include the whole lab.

lysecret

0 replies

10h0m

2024-04-23 08:25:50 UTC

Just means you need a big machine and a lot of capital to make advancement. Take a look at any paper coming out of cern.

a_bonobo

0 replies

11h25m

2024-04-23 07:01:00 UTC

I have been on far larger author lists :) There's probably a whole team for the training data generation and assessment, a whole team for the safety assessment (section 4), that stuff adds up.

0cf8612b2e1e

0 replies

1h24m

2024-04-23 17:01:30 UTC

You should see physics. Stuff involving the large hadron collider can be pages of authors.

It costs so little to share the credit if someone was an asset.

pkoiralap

2 replies

3h27m

2024-04-23 14:59:00 UTC

They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9...

minimaxir

0 replies

2h46m

2024-04-23 15:39:48 UTC

And with a MIT license!

Patrick_Devine

0 replies

19m

2024-04-23 18:06:38 UTC

And of course if you want to try it out locally, `ollama run phi3`.

simonw

1 replies

15h14m

2024-04-23 03:11:46 UTC

I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.

minimaxir

0 replies

15h2m

2024-04-23 03:24:07 UTC

At the least, there's multiple benchmarks noted in the paper (21!) and the results are consistent across all of them.

I'd trust Microsoft to do decontamination testing, although the paper doesn't explicitly mention it other than "The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models."

m3kw9

1 replies

4h52m

2024-04-23 13:33:30 UTC

Phi-2 was useless for practical purposes except if you want to show your friends that it can write a poem, llama3 8b was slightly better but is still same category, it’s complete trash with coding vs gpt4. Llama3 400b “iS OPen SoURce!” But no you will need to pay to access because most one can not practically afford an A100 and set it up properly.

What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

azinman2

0 replies

2h15m

2024-04-23 16:10:56 UTC

I just tried to give gpt4 a scrape of a menu page and asked it to reformat it to csv. It hallucinated several parts and missed others. Llama3-70b hasn’t done that. So far it’s been more reliable. You can run a quantized version on consumer hardware or pay significantly less ($10 vs $1 in, $30 vs $1 out) on hosted platforms.

abidlabs

1 replies

13h33m

2024-04-23 04:52:42 UTC

Hugging Face Paper Page and Discussion: https://huggingface.co/papers/2404.14219

whereismyacc

0 replies

9h20m

2024-04-23 09:05:33 UTC

they're just spamming "weights or it didn't happen"

i mean, fair

Havoc

1 replies

5h42m

2024-04-23 12:43:16 UTC

Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though

imjonse

0 replies

45m

2024-04-23 17:40:46 UTC

Phi-3 is instruct tuned though so hopefully better.

visarga

0 replies

14h23m

2024-04-23 04:02:54 UTC

This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringement claims can be placated.

ur-whale

0 replies

14h3m

2024-04-23 04:22:33 UTC

That's a whole lot of Zhangs!

mythz

0 replies

13h57m

2024-04-23 04:29:02 UTC

I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).

But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It was also the model ranked with the lowest quality answers.

maximsicora

0 replies

9h32m

2024-04-23 08:54:06 UTC

insane

homarp

0 replies

3h7m

2024-04-23 15:18:47 UTC

the weights have been released, 4k https://huggingface.co/microsoft/Phi-3-mini-4k-instruct and 128k context https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

blackoil

0 replies

14h55m

2024-04-23 03:30:31 UTC

Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?