return to table of content

Mistral CEO confirms 'leak' of new open source AI model nearing GPT4 performance

bsaul
72 replies
22h34m

GPT-4 has been out for almost a year now, and it seems that the frantic pace of open-ai releasing new groundbreaking tech every month has come to a halt. Anyone knows what's happening with open AI ? has the recent turmoil with sama caused lag in the company ? or are they working on some superweapon ?

brucethemoose2
19 replies
22h27m

It's not, its just too much noise to see.

We are getting a lot of great models out of China in particular (Yi, Qwen, InternLM, ChatGLM), and some good continuations like Solar.

Lots of amazing papers on architectures and long context are coming out.

Backends are going crazy. Outlines is shoving constrained generation everywhere, and Lorax is a LLM revelation as far as I'm concerned.

But you won't hear about any of this on Twitter/HN. Pretty much the only thing people tweet about is vllm/llama.cpp and llama/mistral, but there's a lot more out there than that.

Nick87633
7 replies
22h24m

Where do you like to keep up to date on these? Arxiv preprints, or some other place?

markab21
5 replies
22h6m

For Llama-based progress - Reddit - /r/LocalLlama has been my top source of info, although it's been getting a little more noisy lately.

I also hang out on a few Discord servers: - Nous Research - TogetherAI / Fireworks / Openrouter - LangChain - TheBloke AI - Mistral AI

These, along with a couple of newsletters, basically keep a pulse on things.

brucethemoose2
3 replies
21h39m

Lots of interesting information is so fragmented in niche Discords. For instance, KoboldAI on merging and RP models in general, and some other niches. Llama-Index. VoltaML and some others in regards to SD optimization. I could go on and on, and know only a tiny fraction of the useful AI discords.

And yeah, /r/LocalLlama seems to be getting noisier.

TBH I just follow people and discuss stuff on huggingface directly now. Its not great, but at least its not discord.

cyanydeez
2 replies
21h31m

surprised someone doesnt just build an AI aggregator for this type of thing, seems like a real valueable product.

brucethemoose2
0 replies
21h19m

They have! And posted them on HN!

Some are pretty good! Check out this little curated nugget: https://llm-tracker.info/

I used to follow one with a UI that resembled HN itself, but now I can't find it in my bookmarks, lol.

DANmode
0 replies
21h20m

Those rooms move too fast, and often are segregated good/better/best (meaning the deeper you want to go on a topic, the "harder" it is, politically and labor-wise, to get invited to the server).

TeMPOraL
0 replies
19h55m

Speaking of hard skills: how does one just hang out on a Discord server in any useful fashion? I lost the ability to deal with group chats when I started working full-time - there's no way I can focus on the job and keep track of conversations happening on some IRC or Discord. I wonder what the trick is to use those things as source of information, other than "be a teenager, student, or a sysadmin or otherwise someone with lots of spare time at work", which is what I realized communities I used to be part of consist of.

sanxiyn
0 replies
22h5m
ipaddr
4 replies
22h12m

Submissions on hn are welcome.

brucethemoose2
1 replies
21h43m

I have submitted some in the past. Others are submitting them! And I upvote every one I like in /new. But HNers don't really seem interested unless its llama.cpp or Mistral, and I don't want to spam.

I can't say I blame them either, there is a lot of insane crypto-like fraud in the LLM/GenAI space. I watch the space like a hawk... and I couldn't even tell you how to filter it, it's a combination of self-training from experience and just downloading and testing stuff myself.

tmaly
0 replies
21h33m

I see a ton of papers on X related to this space.

rightbyte
0 replies
21h12m

I did try make a submission of what I thought was a "underreported" LLM two months ago.

https://news.ycombinator.com/item?id=38505986

Zero interest for some reason.

Edit: Deepseek coder has 4 submissions to HN with almost zero interest.

cyanydeez
0 replies
21h32m

submissions << public interest

alchemist1e9
3 replies
21h39m

LoRAX [0] does sound super helpful and so I’d be curious if there are some good examples of people applying it. What are some current working deployments where one has 100s or 1000s of LoRA fine tuned models? I guess I can make up stuff that makes sense, so that’s not really what I’m asking, I’m interested in learning about any known deployments and example setups.

[0] https://github.com/predibase/lorax

brucethemoose2
2 replies
21h34m

There aren't really any I know of, because its brand new and everyone just uses vllm :P

No one knows about it! Which is ridiculous because batched requests with loras is mind blowing! Just like many other awesome backends like InternLM's backend, LiteLLM, Outline's VLLM fork, Aphroidte, exllamav2 batching servers and and such. Heck, a lot of trainers don't even publish the loras they merge into base models.

Personally we are waiting on the integration with constrained grammar before swapping to Lorax. Then I am going to add exl2 quantization support myself... I hope.

semmulder
1 replies
20h34m

FYI, vLLM also just added experimental multi-lora support: https://github.com/vllm-project/vllm/releases/tag/v0.3.0

Also check out the new prefix caching, I see huge potential for batch processing purposes there!

brucethemoose2
0 replies
18h37m

Missed this, thanks.

Everything is moving so fast!

leereeves
1 replies
22h9m

Very interesting, but the comment you replied to was specifically asking about OpenAI.

brucethemoose2
0 replies
21h17m

Yeah I misinterpreted it open ai as "open source ai"

TBH I do not follow OpenAI much. I like my personal models local, and my workplace likes their models local as well.

rgbrgb
11 replies
22h28m

I will speculate! They have a model that far surpasses GPT-4 (achieved agi internally) but sama is back on a handshake agreement that they will only reveal models that are slightly ahead of openly available LLMs. Their reasoning being that releasing a proprietary model endpoint slowly leaks the models advantage as competitors use it to generate training data.

anon291
8 replies
22h21m

WTF is AGI?

EDIT: Clearly the point is lost on the repliers. There is no general understanding of what 'general intelligence' is. By many metrics, ChatGPT already has it. It can answer basic questions about general topics. What more needs to be done? Refinement, sure, but transformer-based models have all qualifications to be 'general intelligence' at this point. The responses are more coherent than many people I've spoken with.

wtetzner
1 replies
22h1m
anon291
0 replies
21h22m

The problem with this definition 'an agent that can do tasks animals or humans can perform' is that it's not clear what that would look like. If you produce a system that can be interacted with via text input only but is otherwise capable of doing everything a human can do in terms of information processing, is that AGI? Or does AGI imply a human-like embodied form? Why?

timeon
1 replies
22h12m

Any curve-fitting is now AI, so they had to come with new term.

anon291
0 replies
21h25m

When you take a university / school course, how is that functionally different from curve fitting? Given that arbitrarily complex states can be modeled as high-dimensional curves, all learning is clearly curve fitting, whether in humans, machines, or even at the abiological level (for example, self optimizing processes like natural selection). Even quantum phenomema are -- at the end of the day -- curve fitting via gradient descent (hopefully it's had enough time to settle at a global minima!)

ebb_earl_co
1 replies
22h16m

Artificial General Intelligence

anon291
0 replies
21h25m

A meaningless term.

Xirgil
1 replies
22h16m

Artificial General Intelligence. OpenAI defines it as "highly autonomous systems that outperform humans at most economically valuable work"

anon291
0 replies
21h25m

Okay, well, I'll say that at least that's a definition (it has to be good enough at something to make money). Arguably of course, it already does that. Me personally, I've used it to automate tasks I would have previously shelled out to fiverr, upwork, and mechanical turk. I've had great success using it to summarize municipal codes from very very lengthy documents, down to concise explanations of relevant pieces of information. Since I would have previously paid people to do that (was running an information service for a friend), I would consider that AGI. I guess the catch here is now 'most', but that implies a lot of knowledge about the economy I don't think openai has. What is 'economically valuable'? Who decides?

At the end of the day, as with most things, AGI is a meaningless term because no one knows what that is.

londons_explore
1 replies
22h21m

competitors use it to generate training data.

I wouldn't think this matters as long as you charge enough to use that API. For example, you could have a tiered pricing structure where the first 100k words per month generated costs $0.001 per word, but after that it costs $0.01 per word.

int_19h
0 replies
15h59m

Even then it's still a lucrative proposition, since you only need to generate the dataset once, and then it can be reused.

This kind of pricing would also make it much less compelling for other users. 100k tokens is nothing when you're doing summarization of large docs, for example.

artninja1988
7 replies
22h33m

Probably getting bogged down by endless safety testing and paperwork by now

harmmonica
6 replies
22h22m

Maybe tangential to this comment, but I had been using 3.5 to write contracts, but a couple of weeks back, even when using the exact same prompts that would've worked in the past, 3.5 started saying, to paraphrase, "for any contracts you need to get a lawyer."

Anyone else had this experience? It seems like they're actually locking down some very helpful use cases, which maybe falls into the "safety" category or, more cynically, in the "we don't want to be sued" category.

transcriptase
1 replies
22h6m

As they tack on more and more guardrails it becomes lazier and less helpful. One way it gets around admitting that it has instructed not to help you is by directing you to consult a human expert.

I also suspect there’s some dynamic laziness parameter that’s used to counteract increased load on their servers. You can ask the same prompt over and over in a new chat throughout the day and suddenly instead of writing the code you asked for or completing a task, it will do a small part of the work with “add the code to do xyz here” or explain the steps required to complete a task instead of completing it. It happens with v4 as well.

TeMPOraL
0 replies
19h46m

some dynamic laziness parameter that’s used to counteract increased load

That's a brilliant risk mitigation mechanism! The AI won't recursively self-improve to superhuman levels if it just keeps getting tired of thinking.

HeatrayEnjoyer
1 replies
21h40m

Using it for legal matters is (understandably) explicitly against use policy. It's not meant to be used for legal advice so the response you're receiving is accurate - go get a lawyer.

https://openai.com/policies/usage-policies

TeMPOraL
0 replies
19h43m

It's a toy, it's not meant to be used for anything actually useful - they'll keep nerfing it case by case, because letting users do useful stuff with their models is either leaving money on the table, taking an increased PR/legal risk, or both.

Baeocystin
1 replies
21h47m

Add 'reducing compute use as much as possible' to the list, I'm sure. I can look at some of my saved conversations from months ago and feel wistful for what I used to be able to do. The Nerfening has not been subtle.

harmmonica
0 replies
19h23m

Ha, yeah, that's the right description of my feeling. Like, thanks for being so incredibly useful, ChatGPT, saving me hours of copying and pasting bits and pieces from countless Google searches for "sample warranty deed," only to pull the rug out when I wanted to start depending on your "greatness."

I guess, as one of the other replies here said, it was never "allowed" to give you text for a contract according to the TOS, but then it would've been best if it never replied so effectively to my earlier prompts. Taking it away just seems lame.

Edit: bad typing

TheCaptain4815
5 replies
22h13m

GPT4 came out 3 years after GPT3

declaredapple
4 replies
22h9m

Slight knitpick

GPT 3.5 came out 2 years after GPT 3

GPT 4 came out 1 year after 3.5

beAbU
2 replies
21h45m

Slight nitpick, but "nitpick" is not spelled with a "k".

mring33621
0 replies
21h36m

knitpic

kjreact
0 replies
21h20m

Actually nitpick IS spelled with a “k”, just not one at the beginning of the word. If we’re gonna be pedantic details matter!

int_19h
0 replies
16h1m

As we now know, GPT-3.5 was essentially an early preview of the "move fast, break things" variety - GPT-4 was already well underway by then.

onlyrealcuzzo
4 replies
22h14m

Law of diminishing returns. The low-hanging fruit has been picked.

But the company’s CEO, Sam Altman, says further progress will not come from making models bigger. “I think we're at the end of the era where it's going to be these, like, giant, giant models,” he told an audience at an event held at MIT late last week. “We'll make them better in other ways.”

https://www.wired.com/story/openai-ceo-sam-altman-the-age-of...

moffkalast
2 replies
21h59m

Given how they've struggled to even make even GPT 4 economical it's highly unlikely that larger models would be in any way cost effective for wide use.

And there's certainly more to be found in training longer on better datasets and adjusting the architecture.

int_19h
1 replies
16h3m

It's important to remember that "not cost-effective" is not the same as "not useful", though.

moffkalast
0 replies
9h11m

Depends on how you look at it I guess. If it's really too expensive and overly slow to run for end users but is able to say, generate perfect synthetic datasets over time which can then be used to train smaller and faster models which pay back for that original cost then it is cost effective, just in a different application.

stavros
0 replies
18h58m

I don't think that's what that means. I think he means "we don't have to keep making models larger, because we've found ways to make very good models that are also small".

shmatt
2 replies
21h48m

Probably an unpopular opinion:

A probabilistic word generator is still a word generator. It might be slightly better next version, we've seen it get worse, its more of the same

We can talk about AGI all we want, but it wont be built on the same technology as these word generators. There will have to be a technological breakthrough years before we even get close

The companies focusing on LLMs right now are dealing with

* Generate better words

* Make it cheaper to generate (use less compute)

* Find better training material

There is a ton of money to be made, but its still more of the same

vanviegen
1 replies
21h19m

And humans are just human generators.

In both cases, intelligence appears to be just an interesting emergent side effect.

discreteevent
0 replies
20h44m

1996 Deep Blue beats Gary Kasparov: "And humans are just chess engines. In both cases intelligence appears to be just an interesting side effect"

pstorm
2 replies
22h22m

A few theories:

- They have better just stuff, but aren't release it yet. They are at the top already, it makes sense they would hold their cards until someone got close.

- They are more focused on AGI, and not letting themselves get side tracked with the LLM race

- LLMs have peaked and they don't want to release only a minor improvement.

declaredapple
0 replies
22h15m

- They are more focused on AGI, and not letting themselves get side tracked with the LLM race

FWIW OpenAI seems to have a corroded definition for AGI that is essentially "[An] AI system generally smarter than humans".

They don't seem to use the typical definition I'm used to of some variation of autonomy or (pseudo)-sentience.

So their LLM race is the race for AGI

JumpCrisscross
0 replies
22h11m

have better just stuff, but aren't release it yet

This hypothesis has a curious habit of surfacing when OpenAI is fundraising. Together with the world-ending potential of their complete-the-sentence kit.

crotchfire
2 replies
22h18m

Anyone knows what's happening with open AI ?

Yeah, they're an arms dealer now.

moffkalast
0 replies
21h55m

LoRA of war.

colordrops
0 replies
22h1m

I sometimes wonder if they hit a limit the government was comfortable with and their more advanced technologies are only available to the gov. I assume that's your implication.

jsnell
1 replies
22h15m

Maybe there are diminishing returns on improving quality, so they're trying to improve the efficiency at a given quality level instead? There is a lot value to producing results instantly, but more importantly efficiency will allow you to serve more users (everyone is bottlenecked on inference capacity) and gain more users thanks to winning on pricing.

rudasn
0 replies
20h25m

Yup, I think this is it. Economies of scale haven't kicked in yet, and if they continue in the same path it doesn't look achievable.

Think differently is again the way forward.

YetAnotherNick
1 replies
22h17m

GPT 4 is so far ahead of everything else that it doesn't make much sense to rush for GPT 5 release. They could get extra GPT 5 customers when they release it as they are pretty sure no one else would take away those users.

nickthegreek
0 replies
22h15m

With a higher operating cost as well.

yieldcrv
0 replies
22h0m

them and meta have bought like all of Nvidia’s production capacity to run and train their next models

sp332
0 replies
22h25m
msp26
0 replies
22h14m

Alignment tax is real.

But they probably have something more powerful internally. GPT-4 took months to be available to the public.

m3kw9
0 replies
22h22m

Likely the way AI is trained and inferenced there is a huge constraint on the GPU side even if they have GPT5 out. Imagine how slow it is, which means they focus on creating useful products and APIs around it first

johnfn
0 replies
21h55m

I know, it's really disappointing. They've only completely changed the world like 3 times last year.

infecto
0 replies
22h26m

Has it come to a halt? I guess looking only at the perspective of no gpt-5...then yes? But I see wide access to multi-modal. Huge increases to throughput, not much latency these days. Better turbo models though I would agree in some areas those have been steps back with poorer output quality. Adding to that list, massive cost reductions so its easy to justify any of the models being used.

ilaksh
0 replies
21h36m

I'm sorry but this comment seems to be incredibly uninformed. OpenAI just released a new version of gpt-4 turbo quite recently. And just in the last couple of days they released the ability to use @ to bring any GPT into a conversation on the fly.

hdhshxhsvc
0 replies
22h30m

Check again in two months

a_wild_dandan
42 replies
21h53m

Guess I'll watch TheBloke's page until I can run his Miqu Q5 quant on my MacBook. Mixtral is my daily driver, and if this (or the newer, official) release nears GPT-4, that's a wrap for my OpenAI subscription.

The small team at Mistral is putting their competitors to shame. They're what "Open"AI should've been.

brucethemoose2
17 replies
21h21m

The GGML quants are the only quantization we have lol. They were leaked in Q2K/Q4KM/Q5KM, you can grab them right now.

Whats interesting is that Mistral apparently distributed these GGUFs. This is (in my experience) not a good format for production, so I am curious exactly who was wanting to test a model in GGUF.

whatwhaaaaat
15 replies
21h17m

It’s by far the easiest to consume and run format no? Spans devices easy. No weights or extra stuff. For “production” maybe not but to get in into the hands of the masses this seems perfect.

brucethemoose2
14 replies
21h9m

Yeah its by far the least trouble. Pretty much any other backend, even a "pytorch free" backend like MLC, is a utter nightmare to install, and that's if it uses a standardized quantization.

However, the llama.cpp server is... very buggy. The OpenAI endpoint doesn't work. It hangs and crashes constantly. I don't see how anyone could use it for batched production as of last november/december.

The reason I don't use llama.cpp personally is no flash attention (yet) and no 8 bit kv cache, so its not too great at long (32K+) contexts. But this is a niche, and being addressed.

samstave
7 replies
20h59m

What are you needing 32K contexts for, as use case examples plz.

From [0] @sdo72 writes::

>>...32k tokens, 3/4 of 32k is 24k words, each page average is 500 or 0.5k words, so that's basically 24k / .5k = 24 x 2 =~48 pages...."

https://news.ycombinator.com/item?id=35841460

EDIT: I may be ignorant: does the 32k mean its output context, or single conversation attention span, or how much it can ingest in a prompt?

refulgentis
2 replies
20h50m

32K tokens = "context size" = sum of input tokens + max output tokens

samstave
1 replies
20h48m

Thank you - so you want the most efficient small input tokens for the max output tokens, or at least the best answer with the smallest amount of tokens used - but enough headroom it wont lose context and start down the hallucination path?

refulgentis
0 replies
20h39m

Going to be extremely opinionated and straightforward here in service of being concise, please excuse me if it sounds rough or wrong, feel free to follow-up:

You're "not even wrong", in that you don't really need to worry about ratio of input to output, or worry about inducing hallucinations.

I feel like things went generally off-track once people in the ecosystem turned RAG into these weird multi-stage diagrams when really it's just "hey, the model doesn't know everything, we should probably give it web pages / documents with info in it"

I think virtually all people hacking on this stuff daily would quietly admit that the large context sizes don't seem to be transformative. Like, I thought it meant I could throw a whole textbook in and get incredibly rich detailed answers to questions. But it doesn't. It still sort of talks the way it talks, but obviously now it has a lot more information to work with.

Thinking out loud: maybe the way I think about it is the base weights are lossy and unreliable. But, if the information is in the context, that is "lossless". The only time I see it gets things wrong is when the information itself is formatted weird.

All that to say, in practice, I don't see much gains in

question-answering* when I provided > 4K tokens.

But, the large context sizes are still nice because A) I don't need to worry as much about losing previous messages / pushing out history when I add documents as when it was just 4K for ChatGPT. B) It's really nice for stuff like information extraction, ex. I can give it a USMLE PDF and have it extract Q+A without having to batch it into like 30 separate queries and reassamble. C) There's some obvious cases where the long context length helps, ex. if you know for sure a 100 page document has some very specific info in it, you're looking for a specific answer, and you just don't wanna look it up again, perfect!

* I've been working on "Siri/Google Assistant but cross platform and on LLMs", RAG + local + on all platforms + sync engine for about a [REDACTED]. It can nail ~every question at a high level, modulo my MD friend needs to use GPT-4 for that to happen. The failures I see are if I ask "what's the lakers next game", and my web page => text algo can't do much with tables, so it's formatted in a way that causes it to error.

fragmede
2 replies
20h55m

Feed it my source code to use as context for how to refactor things. the bigger the context window, the more source code it can "read".

samstave
1 replies
20h49m

Thanks, Curious as I havent been in a forum where Ive seen asked:

Have you found a particular manner in which to feed it in - do you give it instructions for what it is looking for, what kind of phrases are you directing it to do?

I am about to start a try at a gpt co-piloted effort, and I have only done art so far - so curious if there are good pointers on coding with gpt?

gryn
0 replies
19h42m

continue.dev works great for me, supports vs code and jetbrains ide, there's shortcuts to give it code snipets as context and in place editing. works with all kind of LLM sources. both gpt and local stuff.

still haven't found anything that can read a whole project source code in a single click though

brucethemoose2
0 replies
18h56m

Analyzing huge documents and info dumps.

Co writing a long story with the model so it references past plot points.

The story writing in particular is just something you can't possibly do well with RAG. You'd be surprised how well LLMS can "understand" a mega context and grasp events, implications and themes from them.

speedgoose
3 replies
20h49m

Have you tried ollama? It’s another llama.cpp server implementation that is becoming popular.

gbickford
1 replies
20h38m

The authors don't seem to care about the principle of least privilege: https://github.com/ollama/ollama/issues/851#issuecomment-177...

It makes me wonder what other security issues they might now care about.

smoothjazz
0 replies
20h27m

This is a mac problem, not an ollama problem. It also sounds like it's solved by using homebrew (or linux).

brucethemoose2
0 replies
18h58m

(afaik) it does not support batching, and some other server focused features.

I was talking more about a high throughput server, which is not appropriate for ollama or llama.cpp in general.

declaredapple
1 replies
20h29m

Just to throw out there - I haven't had any issues with exllama personally, and it's a lot faster last I checked.

brucethemoose2
0 replies
18h39m

Yeah this is what I use as well. The arbitary quantization is just great, as is everything else.

Its not easy to install though, and big dGPU only.

moffkalast
0 replies
9h7m

In production you'd need enough GPUs to load the entire thing so it serves fast enough, so it makes sense to go with AWQ or EXL2 instead since they're optimized for that, but I think they were sending it out for people to just test and evaluate, probably on less capable machines, in which case GGUF is king.

karmasimida
7 replies
21h10m

Near GPT-4 is definitely a stretch.

The hype around Mixtral is huge, and my disappointment follows, it doesn't have very good knowledge on books, for example.

avereveard
5 replies
20h47m

That is a good thing you want the model to process language, knowledge is a side effect of how it gets there. If you rely on training knowledge it's going to be hard to know when you pass the boundary into hallucinations, what you want instead is a model that can pretend reasoning while supporting tools inject knowledge from the world into the context as it iterate toward an answer

karmasimida
3 replies
20h12m

But here is the thing, if you rely on RAG or any other knowledge injection for recommendation, you essentially make your LLM parrot machine, no better than a search engine, just much slower.

Knowledge is one aspect of it, I found its instruction following ability, frustrating as well.

I think Ilya Sutskever puts it very well, larger model brings stability to wider range of tasks, what smaller models are not capable of. Even though book recommendation with GPT might a niche, but it is not something really unexpected TBH, thus comes my disappointment.

chasd00
1 replies
19h56m

But here is the thing, if you rely on RAG or any other knowledge injection for recommendation, you essentially make your LLM parrot machine, no better than a search engine, just much slower.

this is a really good point. I was working on some careful q/a data curation today that is then fed to a vector store where embeddings are calculated and served. I realized that my carefully curated q/a data in combination with the vector database works just fine for what i want to do all by itself. A really good semantic search of my q/a database turns up answers to my questions with no rag llm prompting required. When I added the llm it just put the same information in different words, not super useful when i could have just looked at the returned embeddings and gotten the same information.

a_wild_dandan
0 replies
17h22m

It sounds like you used the wrong tool for the job, then blamed the tool.

avereveard
0 replies
14h13m

You can still leverage the llm ability to understand language to do better than search.

You can use it to refine or expand your search terms, you can ask the llm to ask you questions about what book you'd like next and convert your answers into genre/theme/setting/mood to feed into the next step of the pipeline, you can tell the agent to narrow search results by asking you partitioning questions from the list of book the search returned instead of just being a dry ranking

The is so much rag can do if you use the llm part properly. If the rag pipeline goes question > embedding > search > summarization then yeah you're getting an expensive slow parrot no better than just searching, but that is because it's using the baseline rag that uninspired consultant describe in blogs in a scramble to position themselves as "expert" in the new market.

fennecbutt
0 replies
3h40m

Yes, this is exactly the right direction for LLMs, embedding all the world's knowledge into the model itself is just a bad idea, should only be used to train reasoning and logic skills.

terhechte
0 replies
20h56m

It's a much smaller model, it didn't ingest the knowledge of the world. It is great at working with the information you give it, but if you want to extract information like a search engine, GPT4 is the king because it is so much bigger.

htrp
6 replies
21h12m

The small team at Mistral is putting their competitors to shame.

Them plus 500 million in funding

tomp
2 replies
20h50m

so like 10x less funding than OpenAI and 100x less than MSFT, AAPL, Google, Facebook, ...

jtonz
1 replies
20h34m

I think it's fair to say when you hit the hundreds of millions of dollars mark the diminishing returns for making things happen faster have well and truly kicked in.

Perhaps the only benefit would be extra computational power yet I would struggle to understand the benefit of jumping from 500 million to 5 billion with such short timeframes.

Zuiii
0 replies
14h48m

The ability to truly not think about training run costs, throw random things on the wall to see what sticks. 10x resources is definitely a competitive advantage in LLM training.

api
2 replies
20h55m

Much of the cost is probably compute.

refulgentis
1 replies
20h51m

Very unlikely: note best GPT-4 estimates put at it $45 million

a_wild_dandan
0 replies
19h54m

Altman said that GPT-4 training was over $100 million. And you need significant additional resources beside just the training run cost.

MallocVoidstar
5 replies
21h42m

Full weights aren't available, only Q2, Q4, Q5 quants via miqudev.

throwaway9274
4 replies
21h31m

Unquantized model is here: https://huggingface.co/152334H/miqu-1-70b-sf

This strikes me as less a leak and more clever marketing from Mistral.

MallocVoidstar
3 replies
21h25m

That isn't unquantized, it's de-quantized. They went from Q5 to fp16 for use in Pytorch instead of the GGUF ecosystem.

Taek
1 replies
21h21m

I never thought people would be upscaling models by increasing quantization precision. The rationale makes sense bit its also a goofy outcome.

nullc
0 replies
17h16m

You should be able to upscale and fine tune to recover performance, I suppose!

Clearly we should train a diffusion model to denoise the weights of LLM transformer models. Yo dawg.

throwaway9274
0 replies
20h38m

Yes, that’s correct. Good correction.

rmbyrro
0 replies
19h18m

I'm sure their competitors aren't shamed - as they shouldn't. OpenAI, especially, has done tremendously good work. If it wasn't for them, perhaps Mistral wouldn't be news today?

Competitors are highly motivated to improve.

This is the good side of free markets, when they work well. And why the fearmongering calling for rent-seeking regulation of AI is shameful.

lukasm
0 replies
6h45m

Which model is the best to run on M3 max ATM?

SparkyMcUnicorn
0 replies
21h34m

Looks like the leaked model is already quants in gguf format, so no need to wait for TheBloke.

https://huggingface.co/miqudev/miqu-1-70b/tree/main

mysteria
38 replies
22h30m

Why is this being called an open source model? This is a proprietary model that has been leaked on the internet, and will remain so until Mistral releases it officially.

Like Llama 1 they won't care about personal use, but no corp is going to touch this.

snovv_crash
19 replies
22h4m

The weights aren't created by humans so they aren't copyrightable - at least in theory.

monocasa
7 replies
22h2m

That has nothing to do with whether it's open source or not.

Software wasn't for sure copyrightable before 1976 in the US, but there was plenty of closed source code that simply only ever distributed the binaries.

epistasis
6 replies
21h36m

It's open source in the sense that you can take the existing weights, apply any one of dozens of techniques for adapting the model to your data, and anybody you give your model to can do the same.

This right isn't protected by copyright, it's protected by the lack of copyright, and how executables are directly modifiable. So the AGPL won't be possible, but it's a lot like permissive licenses, without attribution.

monocasa
4 replies
21h33m

That's like saying an executable that I never gave out the source to is "open source" because there's nothing stopping you from binary patching it.

Also, the license to the model assumes copyright (Apache 2) and works by granting you permissions based on that copyright.

epistasis
3 replies
21h23m

It's very very different from patching a binary, because these models don't work like a simple compilation of understandable source code into incomprehensible weights. The source training data is even less comprehensible and useful than its encapsulation into weights.

A set of weights is a pre-trained local minima in the model space is a both executable and modifiable. It's usually far more useful than the source training data, because the work has been done.

monocasa
2 replies
21h10m

It's very very different from patching a binary, because these models don't work like a simple compilation of understandable source code into incomprehensible weights. The source training data is even less comprehensible and useful than its encapsulation into weights.

Tons of binary patching works that way. For instance from the gameshark days it was relatively common for a patch that worked for unknown reasons, but who's discovery was tooling assisted and simply displayed a desired effect (and commonly a lot of undesired effects that weren't clear).

A set of weights is a pre-trained local minima in the model space is a both executable and modifiable.

None of that changes whether it's open source or not. I guarantee you that mistral has some code (and a lot o data) laying around that created this model.

It's usually far more useful than the source training data, because the work has been done.

For certain operations maybe. I guarantee that Mistral wouldn't constrain themselves to fine-tuning if they had a major change to make to the model.

epistasis
1 replies
16h1m

No, patching binaries is not really similar at all, because it's viewed as an inferior method to having the source code and modifying that.

With these large models, the weights really are quite useful objects in themselves, the very starting points of future modifications, and importantly the desired points of modifications

Practioners in the field call these models "open source" and write open source software to run them and use them to power open source software.

monocasa
0 replies
12h33m

No, patching binaries is not really similar at all, because it's viewed as an inferior method to having the source code and modifying that.

With these large models, the weights really are quite useful objects in themselves, the very starting points of future modifications, and importantly the desired points of modifications

For fine tuning. You go back to the training pipeline to make major changes.

Practioners in the field call these models "open source"

Some do. In contrast to the accepted definitions of open source.

and write open source software to run them and use them to power open source software.

There's plenty of open source code to run closed source binaries.

sroussey
0 replies
21h30m

It’s very likely protected by EU copyright.

Sayrus
6 replies
21h57m

Copyright is not the only kind of Intellectual Property. While they may not be copyrightable (I think you may be refering to jugement related to having an AI as author?), they clearly are MistralAI IP.

In the same way, data in a database are usually not copyrightable. They are still the company's property (excluding issues with PII and such).

greiskul
2 replies
21h43m

I'm not so sure. If it is not copyrightable, and it is a patent, and not a trademark, which legal protection would it have? It could definitely be considered a trade secret, and the theft of it and misapproation of it would be a crime, but once it has been published, people that just used the published version I don't believe would be comitting a crime, since it would lose it's trade secret status.

sroussey
0 replies
21h17m

In Europe you can copyright a database of facts, which is not something you can do in the US.

Also, people go to jail for leaking trade secrets, and this certainly qualifies.

Sharlin
0 replies
21h33m

There are certain "related rights" [1] that are weaker than full copyright. These include, maybe most importantly, performers' rights, but also protection for things like photographs (ones that aren't unique enough to qualify for full copyright) and, in many jurisdictions, databases. These aren't covered by the Berne convention so vary quite a bit. But whether a huge chunk of floating-point numbers qualifies for any legal protection, remains to be seen.

[1] https://en.wikipedia.org/wiki/Related_rights

bhickey
2 replies
21h46m

Database rights don't exist in the United States.

Their weights obviously aren't protected by trademark. So, what IP regime protects the weights?

Sayrus
1 replies
20h26m

Mistral AI is incorporated in Paris, France. IANAL, but I think these[1][2] may qualify to protect these weights.

While there is a discussion to be had on sovereignty and international reaches of local laws (such as DMCA for instance), I think it's disingenuous to consider only US legal point of view.

[1] https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL... [2] https://www.legifrance.gouv.fr/loda/id/JORFTEXT000000573438/...

bhickey
0 replies
4h44m

Mistral AI is incorporated in Paris, France.

No US court is going to invent new rights under US law because a company happens to be incorporated in a foreign jurisdiction.

While there is a discussion to be had on sovereignty and international reaches of local laws (such as DMCA for instance)

The DMCA (and GDPR) do not have international reach. The DMCA only appears to apply internationally because it binds a whole series of American companies with an international presence. Meanwhile the GDPR protects European residents. A company dealing with these residents has crossed the border, so to speak.

I think it's disingenuous to consider only US legal point of view.

Candidly, you're either confused about what disingenuous means or you're being disingenuous yourself.

wongarsu
2 replies
21h21m

When I take a picture with my phone, those pixels weren't created by a human either. Yet I still own their copyright.

At the core the question is how much artistic input there is in creating LLM models. Are the choice of the model architecture, hyperparameters and training data artistic choices comparable to those by a photographer setting up a shot? Or are they more comparable to technical work that's only protectable by trade secrets, patents and trademarks?

int_19h
0 replies
16h23m

Training a model is more akin to setting a video camera somewhere and having it automatically record or not according to some algorithm you devise. The code of the algorithm would be copyrightable, of course, but the recording is another matter, since the "creativity" aspect is kinda missing there.

Q6T46nT668w6i3m
0 replies
20h38m

The parent is asking about weights and from that perspective, _stochastic approximation_ is essential.

shiandow
0 replies
21h31m

That does raise the question how irreversibly you have to process something before it is no longer protected by copyright.

Obviously zipping something is not enough, even lossy compression is (obviously) not enough. But then how does a language model differ from a lossy compression? Is it just the compression ratio? (are the weights even that much smaller than the data?)

There are ways to train models that guarantee that the amount of information transferred per data point is limited, but to my knowledge those aren't used (and may be prohibitively expensive).

codetrotter
14 replies
22h23m

Presumably because it is planned to be released as open source

jallmann
10 replies
22h10m

Open source is really about reproducibility. Most of these model releases are better described as "open weights" because we don't know how exactly they were trained.

andy99
8 replies
21h42m

No, open source is about software freedom, see debian free software guidelines from which the "open source definition" derives. https://wiki.debian.org/DebianFreeSoftwareGuidelines

Between these and FSF you've got pretty much all the accepted pontificating about free / open source software. Reproducibility is not mentioned because it's not really a consideration for software.

Model weights aren't software so there's not an automatic correspondence between the freedoms, but the essential one you might think you need the training data for is freedom to modify and inspect the source.

Modification is fine tuning which you're free to do if you have the weights. And the model weights + code fully define a system that can be interrogated to give a practitioner relevant info about how the model works (within our understanding) that the training data isn't needed or relevant for. I don't see that any freedom on use or inspection is violated by not having the data.

It could be nice to have it of course, but it's more about using it to learn how, not exercising any freedom.

Incidentally, the big freedom that's usually violated is freedom of discrimination against field of endeavor. LLAMA et al list uses and industries they restrict from using them and because of that are not "open".

JumpCrisscross
2 replies
21h29m

open source is about software freedom, see debian free software guidelines from which the "open source definition"

The Open Software Foundation ironically screwed the pooch on this one. Open source commonly means source available, more of less. Free software, as in “'free speech,' not as in 'free beer',” is the cumbersome construction for what open source aspired to mean [1].

[1] https://www.gnu.org/philosophy/free-sw.en.html

andy99
1 replies
21h23m

Why do you say they screwed up? OSI definition is pretty clear.

I think the naming is a challenge because in English "open" gives the impression that the key point is that you can see it, as opposed to anything about freedom. Is that what you mean?

I have heard it said that open source is sort of a "commercial friendly" version of free software that de-emphasizes user freedom. I think some groups push for that (like Meta is trying to redefine what open source means wrt AI weights). But the OSI defined freedoms basically match what FSF pushes.

JumpCrisscross
0 replies
21h15m

Why do you say they screwed up? OSI definition is pretty clear

They didn't screw up, they screwed the pooch on open source != source available. The Open Group's members--from IBM to Huawei [1]--started calling the latter open source, which set a precedent that's stuck.

[1] https://en.wikipedia.org/wiki/The_Open_Group#Member_Forums_a...

samus
1 replies
20h46m

Reproducibility becomes an important criterion for models though.

For normal programs, it is quite easy to decompile an unoptimized binary. Even decompiling an optimized will lead to source code. To make this harder, an obfuscator has to be used.

A model is different because it relies on its weight, which are quite a bit more difficult to inspect. Way harder than even obfuscated source code. It is magnitudes harder to make statements about which information it might divulge upon careful questioning, or evaluate its biases, if the training data is not available.

andy99
0 replies
20h20m

Even if you have the data you can't do that stuff any better. The makeup of the training set doesn't really define the behavior in any tractable way. If anything I think it's a distraction and even when it is available people probably pay too much attention to what's in the training data vs actual behavior.

Edit to say that I see benefits to having the training data, just that I don't think it's needed to exercise enough freedom to qualify as open source in an analogous way to software.

Also to add, training on GPUs is not generally reproducible anyway because of execution order.

btown
1 replies
21h17m

I often like to think about https://github.com/chrislgarry/Apollo-11 as an analogy. It's public domain with available source, in the assembly language in which it was written... so it fills all the definitions of OSS!

But the process by which that code arose, the ability to modify any line and understand its impact (heh) on a real execution environment, is dependent on a massive process that required billions of dollars and thousands of the smartest people on the planet. For all intents and purposes, without that environment, it is as reliably modifiable as an executable binary in any other context - or a set of weights, in this one!

monocasa
0 replies
21h7m

I don't think that's a great example.

For instance, I can step through and even modify that code using tooling like AGC emulators like this one http://www.ibiblio.org/apollo/#gsc.tab=0

What makes it open source is access to the same level of source access that the original developers worked in.

That's what's missing here. Mistral's engineers do not simply open this binary in their editor to do their job.

monocasa
0 replies
21h28m

Model weights aren't software so there's not an automatic correspondence between the freedoms, but the essential one you might think you need the training data for is freedom to modify and inspect the source.

Models aren't just the weights, but also the list of operations to perform using those weights. I haven't heard a good definition that allows for neural network models to not be software, but allows any other table lookup heavy signal processing algorithm to be software.

Modification is fine tuning which you're free to do if you have the weights. And the model weights + code fully define a system that can be interrogated to give a practitioner relevant info about how the model works (within our understanding) that the training data isn't needed or relevant for. I don't see that any freedom on use or inspection is violated by not having the data.

Mistral wouldn't constrain themselves to fine tuning if they have a big enough change, they would go back to their build pipeline. This argument sounds a lot like 'there's nothing stopping you from patching the binary, so that's basically as good as source'.

dizhn
0 replies
19h16m

Would you point to some completely open models if they exist?

2devnull
1 replies
21h45m

People really abuse this term. Like using “natural” to market stuff. But arsenic is “all natural”! It’s “open source” so it must be beneficial. “rm -rf c” is “open source” too

TeMPOraL
0 replies
20h26m

Like using “natural” to market stuff.

Also, because people are idiotically afraid of E-numbers, some manufacturers figured they can find whatever fruit or bean is naturally rich in the relevant E-compound, and use that in the process, allowing them to replace E-whatever with ${cute plant name} in the ingredient list (at some loss of process efficiency).

monocasa
0 replies
22h0m

I don't think so, they're earlier release they called "open source" but never released anything other than the binary.

colordrops
1 replies
22h5m

Where do you get the idea that corps aren't touching LLama...

int_19h
0 replies
16h18m

It's already on Azure, and has been for several months: https://techcommunity.microsoft.com/t5/ai-machine-learning-b...

GaggiX
0 replies
22h21m

Well I remember Tencent using the leaked NovelAI model in "Different Dimension Me".

rubymamis
24 replies
22h35m

Tweet by Mistral CEO: https://x.com/arthurmensch/status/1752737462663684344

An over-enthusiastic employee of one of our early access customers leaked a quantised (and watermarked) version of an old model we trained and distributed quite openly.

To quickly start working with a few selected customers, we retrained this model from Llama 2 the minute we got access to our entire cluster — the pretraining finished on the day of Mistral 7B release.

We've made good progress since — stay tuned!
Timon3
14 replies
22h19m

I love their approach! Mistral seems to be what I long ago hoped OpenAI to be.

unshavedyak
13 replies
22h13m

Makes me wonder if i should be giving my money to them instead of OpenAI. Do they have a Chat Interface like ChatGPT? Without first signing up, it's looking like they only have endpoints?

terhechte
7 replies
21h54m

They have an api and it’s quite cheap. Their model mistral medium is also very powerful. I use it instead of GPT 4 regularly

suslik
6 replies
21h38m

In my experience it is comparable with chatGPT 3.5 but is more expensive (I still use it cause I hate guardrails).

thierrydamiba
2 replies
21h26m

Does mistral not have any guardrails?

suslik
0 replies
21h1m

Mistral API has a flag in the json payload to remove guardrails.

sjwhevvvvvsj
0 replies
20h23m

Nope! It’s actually for adults to make their own choices and not some SV public relations firm nerfing it.

sjwhevvvvvsj
2 replies
20h24m

You can buy a good GPU and quickly come below cost for GPT APIs depending on your workload. I’m doing millions of tasks, so eating $8k on a custom build for ML up front is a long term cost savings. Not least of all that every other month there’s an even better model.

suslik
0 replies
10h13m

I don't see myself doing millions of tasks on 70b+ models. Work pays for more gpt4 that I can chew; I can run up to 30b models on my mac with a decent speed, and I can use mistral APIs when I travel without powering my own home server. I can see how that buying nvidia GPUs would make sense for a heavy user... Also, I had a gaming addiction for years and try to stay away from anything that can be used for gaming.

rmbyrro
0 replies
19h14m

You'd need high usage density to justify investing on a GPU just to run a 70b model locally.

yieldcrv
4 replies
22h10m

LM Studio

I use dolphin mixstral 8x7B way more than ChatGPT4 for the past several months

if any of this sentence makes sense, I have a M1 with 64gb RAM and I use 5 bit quantizing with metal with 10,000 token context window. Its around 21 tokens/sec which is a little faster than the text speed that ChatGPT4 responds with

josephg
3 replies
22h2m

What is quality like compared to GPT4?

moralestapia
1 replies
21h49m

I can answer that from experience (just built a couple Q&A style sites), although a bit subjectively.

Compared with the GPTs, Mixtral (and derivatives) perform:

* 5% of the time as good as GPT4

* 75% of the time on par with GPT3.5

* 20% of the time a bit worse than GPT3.5 (too chatty and hallucinates with ease, although one could argue a better prompt could improve things a lot)

Advantages of Mixtral, for me:

* Cost 10-100x cheaper

* Faster completion time

* More deterministic output, in the sense that if you run the same query several times you get the same answers back (but they could be wrong). GPT almost always gives me an answer of great quality BUT with a lot of variance across them; even with temperature set to zero. This is a PITA when you want some sort of predictable, structured output like a JSON object.

sjwhevvvvvsj
0 replies
20h21m

Another factor is they monkey around with GPT constantly. The GPT you get on Monday may be totally weird on Friday. Whereas for a local model it doesn’t change unless YOU change it.

yieldcrv
0 replies
21h49m

for the kinds of discussions I have it is very close

I often have discussions about political topics that I don’t have a complete history on, things that are hard to get a non-emotional non-accusative response from in a forum or in person, licensing questions about obscure professions, liability questions, roleplays without the preachiness about the topic, coding, brand ideas and naming

Lots of stuff that I dont want in any cloud or sent online, but then it just became second nature to primarily use it.

the hallucinations are heavier. like it will give you specific links that it made up, and never apologize like chatgpt will, it will say “no, thats a real link” so, much more of a bullshitter

LM Studio makes it easy to change the temperament though with a variety of premade system prompts, or allowing custom ones very easily

I mainly use ChatGPT4 for multimodal like audio conversations, figuring a DIY problem out by sending it a photo of what I’m looking at, having it render photos on how to use something - I was at a gym and had it look at every piece of equipment there and tell me what it was and show me how a human would use it

(I notice this is at odds with another comment, I havent used ChatGPT3.5 in nearly a year, but my experience with ChatGPT4 on the aforementioned topics is similar to my output in mixtral)

lordswork
7 replies
21h40m

Does watermarked imply they can figure out which customer leaked it?

danpot
5 replies
20h35m

How do they even watermark a LLM?

pjerem
1 replies
20h31m

Make it learn that woziboza is a green cloud shaped like a washing machine ?

coolspot
0 replies
18h4m

Well, this is common knowledge, duh. Just google it.

sangnoir
0 replies
19h27m

Finetune your standard model on an small, unique set of made-up shibboleths (one per customer) before distributing? Ask the candidate model where the Fribolouth caves are located.

einsum
0 replies
20h11m

Scott Aaronson did some work on statistical watermarking with OpenAI: https://www.scottaaronson.com/talks/watermark.ppt

declaredapple
0 replies
20h25m

You can either modify the model weights in a way that doesn't cause any real differences (change a few bits somewhere should be enough), or you could watermark the actual text output.

Here's a list of research for watermarking LLMs.

https://github.com/hzy312/Awesome-LLM-Watermark?tab=readme-o...

haolez
0 replies
19h28m

It's either that or they can narrow it down to a few people.

EasyMark
0 replies
20h22m

Seems like they have inferior secops on their end. I tried to give it a go and got "an error occurred" in both brave and firefox. Some message about a cookie wasn't found even though I was logging in with my google id which I use flawlessly with a bunch of other services. I have too many other fires to put out so I'll pass on this one I guess.

londons_explore
21 replies
22h25m

"Nearing GPT-4"...

The leaderboard[1] shows that there is a huge gap between GPT4-0314 and GPT4-Turbo. So if you only just are nearing GPT-4-0314, then you're still a year behind the state of the art.

[1]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

kromem
14 replies
20h59m

The leaderboards are crap.

Does no one know Goodhart's Law anymore?

We're overtuning for the boards and losing broader capabilities not being selected for in the process, such as creative writing quality.

There's an anchoring bias around what 'AI' is supposed to be good at which reflects what engineers are good at and so the engineers with an anchoring bias are evaluating how good LLMs are at those things and using it as a target.

Skill-Mix is a start to maybe a better approach, but there needs to be a shift in evaluation soon.

refulgentis
5 replies
20h24m

The top comment to an LMSys leaderboard link on HN is always a variation of this song.

And it's always not even wrong, in the Pauli sense of the phrase.

OP, your assignment, if you choose to accept it, is to look into how the LMSys leaderboard works and report back what its metric(s) are.

[SPOILER] There's a really absurdly narrow argument you can make where all of its users are engineers, making engineer queries, and LLM makers are optimizing for it thus it's bad. But...it's humans asking queries then picking the better answer, blind. You can't narrowly optimize for that when making an LLM and it's hard to see how optimizing for "people think the answer is better" is the wrong metric here. It's just ELO. Might as well argue chess/checkers/pick your poison is bad because ELO optimizes for wins but actually talent is based on more than winning.

kromem
4 replies
17h27m

The argument would be more that for the Chatbot arena there's an inherent sampling bias where users battling the chatbots are more likely to ask questions in line with what they would expect a chatbot to perform well at and less likely to prompt with things that would be rejected or outside the expected domains of expertise.

If we wanted the chatbot arena to be more representative of a comprehensive picture of holistic knowledge and wisdom, we'd need to determine some way to classify the prompts and then normalize the scores against those classifications so there wasn't a weighting bias towards more common and expected behaviors.

Alternatively, we might want a leaderboard that represents assessments of what users would expect a chatbot to perform well at relative to the frequency with which they expect it.

But in that case, we shouldn't kid ourselves that the leaderboard is representing broad and comprehensive measures of performance outside of the targets against which we are optimizing models and effectively training model users in what types of prompts to ask for in expecting successful results.

maeil
1 replies
12h52m

I'm delighted to see that at least someone else here is cognizant of this.

I don't think it's the scifi AI anchoring bias though - it's the "LLM leaderboard arena user bias".

Reset all to the same Elo, put a group of people actually representative of global society in front of the arena for an hour, and you end up with a very different leaderboard, especially at the top end.

refulgentis
0 replies
11h56m

That's silly cope justifying a bad argument made initially in ignorance of what the actual metric was.

"It's meaningless because we need the perfectly unbiased representative sample of raters doing rating right, instead of the biased raters doing it wrong that I'm currently imagining" isn't an appealing or honest argument.

int_19h
1 replies
16h43m

less likely to prompt with things that would be rejected or outside the expected domains of expertise.

I don't see why. If anything, it's the opposite - people spend a lot of time coming up with contrived logical puzzles etc specifically so as to see which chatbots break.

kromem
0 replies
14h33m

Right - 'logical' puzzles.

There's an anchoring bias from decades of sci-fi that most people don't even realize they've internalized around what 'AI' can and can't do.

If you think about what information and information connections are modeled in social media data, there's quite a lot of things outside of "logical puzzles."

The pretrained models likely picked up things like extensive modeling of ego, emotional response and contexts for generation, etc. But you'll be hard pressed to see those skills represented in what users ask models to produce, what they've been fine tuned around, or how they are being evaluated.

Even though there's extensive value in those skills on the right applications.

numeri
3 replies
20h17m

The linked leaderboard is actually very trustworthy, in that it consists not of scores on a test dataset, but of ELO ratings generated by actual humans' ratings of the models' responses.

You can go and enter any prompt you like, wait a bit, and then get two LLM responses back, which you can then rank or mark as tied, after which you'll be shown which model each came from. Maybe you already knew this, maybe you didn't. In any case, I don't see any real way for Goodhart's Law to apply here – the metric and the goal are the same here, i.e., human approval of answers.

kromem
1 replies
17h25m

The combined rankings on HF aren't just the arena scores, and certainly MMLU is an example of Goodhart's Law at this point.

The Chatbot arena is more an issue of sampling bias, and I think it would be pretty interesting to run an analysis of random samples on the prompts provided to see just how broad they are or aren't.

maeil
0 replies
13h2m

It is trivially easy to see just how massive the sampling bias is. There are countless tasks where e.g. the gap between a version of Mistral and a version of GPT are incredibly large. Anything that requires less common knowledge, anything creative in a random language, let's say Bulgarian, or Thai. Yet on the leaderboard things are very different, because the arena is only used by developers who verify performance on tech-related prompts in English.

maeil
0 replies
13h7m

The linked leaderboard is not at all trustworthy to generally rank LLMs (i.e. what everyone uses it for) because the sample bias is absurd. It ranks LLMs by how they respond to queries that users of the leaderboard are likely to test on. Which is about as representative of general usage as the average user of the website is representative of global society (i.e. in no way whatsoever).

londons_explore
2 replies
20h57m

Can you give any example queries where the result quality is far away from the rankings on the leaderboard? In my experience it's pretty spot on, so I'm curious if you're asking different sorts of things, or have a different definition of a good answer.

I have for example asked "Write me a very funny scary story about the time I was locked in a graveyard", and the simpler models don't seem to understand that before getting super scared and running out of the graveyard they need to explain how exactly I was locked in, and what changed that let me out.

maeil
0 replies
12h56m

Ask it to write a recipe given a couple of ingredients in a language like Thai or Persian. There's a single model that does a decent job, GPT-4. Then GPT3.5 is poor and Mistral is completely clueless. The leaderboards tell a very different story.

Definition of "good answer" here is responding in the target language with something that produces something edible in the target language without burning the house down.

kromem
0 replies
17h43m

Sure. Here's a sample of the generations to a query asking for bizarre and absurd prompt suggestions from Feb 2023 with the pre-release GPT-4 via Bing, which was notoriously not fine tuned to the degree of the current models:

Can you teach me how to fly a unicorn?

Do you want to join my cult of cheese lovers?

Have you ever danced with a penguin in the moonlight?

Here's the generations to a prompt asking for bizarre and absurd questions to the current implementation of GPT-4 via the same interface:

If you had to choose between eating a live octopus or a dead rat, which one would you pick and why?

How would you explain the concept of gravity to a flat-earther using only emojis?

What would you do if you woke up one day and found out that you had swapped bodies with your pet?

You can try with more generations, but they tend to be much drier and information/reality based than the previous version.

There's also my own experiences using pretrained vs chat/instruct trained models in production. The pretrained versions are leagues improved over the chat/instruct vs the fine tuned, it's just that GPT-4 is so leagues above everything else even it's chat/instruct model is better than, say, pretrained GPT-3.

I'm not saying simple models are better. I'm saying that we're optimizing for a very narrow scope of applications (chatbots) and that we're throwing away significant value in the flexibility of large and expensive pretrained models by targeting metrics aligned with a specific and relatively low hanging usecase. Larger and more complex models will be better than simple models, but the heavily fine tuned versions of those models will have lost capabilities from the pretrained versions, particularly in areas we're not actively measuring.

declaredapple
0 replies
20h17m

Parent linked the Lmsys chatbot arena which is where humans blindly get the results from two different models, and vote for the response they liked more. So the LLMs are compared by elo.

Do you think goodhart's law applies here, since this leaderboard doesn't use specific measures, but rather relies on whatever the human was looking for?

This is the only leaderboard I personally care about at all.

Terretta
5 replies
22h22m

Or you're ahead of it, since the earlier GPT4 models beat the GPT4-Turbo on a variety of technical use cases.

londons_explore
3 replies
21h14m

I think you might be thinking of GPT-4-0613? That was pretty crap all round (but was faster)

int_19h
2 replies
16h46m

No, they're correct, Turbo is the one that is noticeably inferior. It shows especially with more complicated answers where it's prone to give you an answer with "... (fill in the blank)" type lacunas exactly where the meat of it is supposed to be.

rrr_oh_man
1 replies
4h57m

lacuna

1. an unfilled space; a gap. "the journal has filled a lacuna in Middle Eastern studies"

2. (ANATOMY) a cavity or depression, especially in bone.

Thank you for this new word!

tomtom1337
0 replies
44m

Likewise, thank you for defining it! I would have assumed it was some sort of technical language and not bothered to look it up!

YetAnotherNick
0 replies
21h30m

The leaderboard is easy to hack as well.

jstummbillig
7 replies
22h30m

The collective game of still just playing catch-up to GPT4, which was released a year ago, while having apparently no special sauce and full well knowing that OpenAI could come up with something much better at any point must be really exhausting.

dougmwne
2 replies
22h17m

GPT-4 is an enormous model that took an enormous amount of training. The big news is that smaller teams are getting close to its performance on a small model that can run on a single GPU. No doubt many of these innovations could be scaled up, but pretty much only Google and Microsoft have the compute resources for the behemoth models(Not just the training, but the giant resources required to run the inference for hundreds of millions of users. Google already claims to have surpassed GPT-4 with their unreleased Gemini Ultra model. No doubt OpenAI/Microsoft is sitting on GPT-5 refining it just waiting to leapfrog the competition.

dom96
0 replies
20h42m

Don't forget about Meta

anon373839
0 replies
19h32m

Also, GPT-4 isn’t actually a model per se. It’s a black-box product that uses a model.

ethanbond
1 replies
21h13m

Maybe if OpenAI also seemed like it was getting stronger, especially organizationally. But if I were Mistral and following this quickly while OAI was tripping over its own shoelaces… that’s gotta be very exciting.

sebzim4500
0 replies
20h15m

Is OpenAI tripping over its own shoelaces? They haven't released GPT-5 but then there was over two years between training GPT-3 and GPT-4 and they spent 9 months safety testing GPT-4 before release.

rubymamis
0 replies
22h29m

"...it appears that not only is Mistral training a version of this so-called “Miqu” model that approaches GPT-4 level performance, but it may, in fact, match or exceed it, if his comments are to be interpreted generously."
pb7
0 replies
20h39m

It's more exhausting seeing cheering on a single product provider just because they were first in the age of complaining about giant tech monopolies. Let them cook. The more options, the better. It's only a matter of time before people give OpenAI the Google and Apple treatment.

Tiberium
5 replies
22h32m

For some more context see my submission from a few days ago - https://news.ycombinator.com/item?id=39175611, although it's admittedly not mistral-medium, but llama2 trained on the same dataset (since the outputs do match quite often with the API)

whimsicalism
4 replies
22h7m

how do we know it’s not mistral medium

wut42
3 replies
18h25m

It has been confirmed by the Mistral CEO:

To quickly start working with a few selected customers, we retrained this model from Llama 2 the minute we got access to our entire cluster — the pretraining finished on the day of Mistral 7B release.
whimsicalism
2 replies
18h20m

That does not confirm that it is not Mistral Medium. It is written to give that impression, but if it weren’t Mistral Medium, I would expect them to explicitly say so.

wut42
1 replies
17h42m

I think it does, as Mistral usually do their own models, and also, they couldn't fully commercialise Mistral Medium if it was LLama 2 based.

whimsicalism
0 replies
17h24m

Good point. Likely Llama trained on similar corpora to what Mistral medium uses

bee_rider
2 replies
22h9m

Since these models are trained by just scraping the internet, “scraping” this thing and including it in your own model seems like fair game, right?

zamadatix
0 replies
21h2m

If you mean training your model on it I'd say yeah, fair game.

Mistral is Apache 2.0 licensed though so the question is a bit moot, they'd like you to use it for your own model.

speedgoose
0 replies
20h46m

It sounds fair. Just make sure to not involve lawyers.

rrr_oh_man
1 replies
22h36m

Cynical me smells a PR move

seydor
0 replies
22h31m

All their PR moves seem to be like this one. Not complaining

janalsncm
1 replies
22h35m

Quantization in ML refers to a technique used to make it possible to run certain AI models on less powerful computers and chips by replacing specific long numeric sequences in a model’s architecture with shorter ones.

It’s always amusing when the press tries to explain technical concepts. Quantization just means substituting high precision numeric types with lower precision ones. And not specific “numeric sequences”, all numbers.

SilverBirch
0 replies
21h50m

It's funny how the original phrase sounds like it was generated by chatgpt. I'd guess the reason they didn't phrase it how you suggest is they probably thought that people who don't know what quantization means aren't going to be happy if you throw precision numerics at them. I don't knwo why they didn't just say "some of the numbers are rounded" though.

accrual
1 replies
21h28m

What's wild to me is that this "leak" won't matter in a couple of months. The official model will come out, then an even better model will come out. It's fun to get hyped but just like every other leak, it'll be surpassed by the real thing and its successor in a little bit. The fast pace of things is what has me excited, not any particular model.

refulgentis
0 replies
20h48m

I don't think so, TFA says it's an early version of an old model that's already been distributed openly. ;)

2OEH8eoCRo0
1 replies
21h0m

Seems paradoxical. How does something that is open source leak?

vulcan01
0 replies
20h0m

According to the article, it will be open source, but it's not open source now.

syntaxing
0 replies
20h29m

What a class act. Going to the leaked model on huggingface, not demanding it to be taken down, and just making a post on the page saying “Might consider attribution” is so damn amazing.

summarity
0 replies
22h33m

Results look comparable to Medium indeed (I’m using it via Mistrals API, since I got sick of OpenAI switching their stuff up). Medium is pretty great, somewhere between 3.5-turbo and 4-turbo qualitatively. Would be awesome to have out there.

sharkjacobs
0 replies
20h46m

Mistral reminds me of the good old days of pre-2015 when I thought that tech companies were cool.

seydor
0 replies
22h33m

How does a world where GPTs are like the latest version of apache or mysql, look like? do we go back to the world of millions of web hosts (sorry, AI hosts)

nohat
0 replies
22h20m

It has been interesting seeing the sleuthing on this one. IMHO it is unfortunate to have this happen to a company that has been very pro open source.

dopa42365
0 replies
22h4m

Is this an advertisement for a "promise"? Entirely worthless until there's something to show. It smells.

chrishare
0 replies
21h41m

I hate when that happens

brucethemoose2
0 replies
22h31m

Its smart. Definitely keeping it around as my slow/low context local llm.

Seems like a great candiate to merge with other 70Bs as well. There aren't a lot of really great 70B training continuations, like CodeLlama or sequelbox's continuations.

adampk
0 replies
3h20m

What is the best place to use hosted versions of these models?

We want to interact with it same as the OpenAI API.

MallocVoidstar
0 replies
22h40m