return to table of content

Large Enough – Mistral AI

Liquix
101 replies
2h48m

These companies full of brilliant engineers are throwing millions of dollars in training costs to produce SOTA models that are... "on par with GPT-4o and Claude Opus"? And then the next 2.23% bump will cost another XX million? It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs; that an ARC prize level breakthrough is needed to move the needle any farther at this point.

happyhardcore
27 replies
2h42m

I suspect this is why OpenAI is going more in the direction of optimising for price / latency / whatever with 4o-mini and whatnot. Presumably they found out long before the rest of us did that models can't really get all that much better than what we're approaching now, and once you're there the only thing you can compete on is how many parameters it takes and how cheaply you can serve that to users.

__jl__
21 replies
2h37m

Meta just claimed the opposite in their Llama 3.1 paper. Look at the conclusion. They say that their experience indicates significant gains for the next iteration of models.

The current crop of benchmarks might not reflect these gains, by the way.

splwjs
12 replies
2h11m

I sell widgets. I promise the incalculable power of widgets has yet to be unleashed on the world, but it is tremendous and awesome and we should all be very afraid of widgets taking over the world because I can't see how they won't.

Anyway here's the sales page. the widget subscription is so premium you won't even miss the subscription fee.

sqeaky
2 replies
2h7m

That is strong (and fun) point, but this is peer reviewable and has more open collaboration elements than purely selling widgets.

We should still be skeptical because often want to claim to be better or have unearned answers, but I don't think the motive to lie is quite as strong as a salesman's.

troupo
1 replies
1h52m

this is peer reviewable

It's not peer-reviewable in any shape or form.

hnfong
0 replies
1h23m

It is kind of "peer-reviewable" in the "Elon Musk vs Yann LeCun" form, but I doubt that the original commenter meant this.

coltonv
2 replies
2h4m

This. It's really weird the way we suddenly live in a world where it's the norm to take whatever a tech company says about future products at face value. This is the same world where Tesla promised "zero intervention LA to NYC self driving" by the end of the year in 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, and 2024. The same world where we know for a fact that multiple GenAI demos by multiple companies were just completely faked.

It's weird. In the late 2010s it seems like people were wising up to the idea that you can't implicitly trust big tech companies, even if they have nap pods in the office and have their first day employees wear funny hats. Then ChatGPT lands and everyone is back to fully trusting these companies when they say they are mere months from turning the world upside down with their AI, which they say every month for the last 12-24 months.

hnfong
0 replies
1h25m

In the late 2010s it seems like people were wising up to the idea that you can't implicitly trust big tech companies

In the 2000s we only had Microsoft, and none of us were confused as to whether to trust Bill Gates or not...

littlestymaar
1 replies
1h47m

Except: Meta doesn't sell AI at all. Zuck is just doing this for two reasons:

- flex

- deal a blow to Altmann

HDThoreaun
0 replies
6m

Meta uses ai in all the recommendation algorithms. They absolutely hope to turn their chat assistants into a product on WhatsApp too, and GenAI is crucial to creating the metaverse. This isn’t just a charity case.

mattnewton
0 replies
1h53m

that would make sense if it was from Openai, but Meta doesn't actually sell these widgets? They release the widget machines for free in the hopes that other people will build a widget ecosystem around them to rival the closed widget ecosystem that threatens to lock them out of a potential "next platform" powered by widgets.

ctoth
0 replies
2h0m

Wouldn't the equivalent for Meta actually be something like:

Other companies sell widgets. We have a bunch of widget-making machines and so we released a whole bunch of free widgets. We noticed that the widgets got better the more we made and expect widgets to become even better in future. Anyway here's the free download.

Given that Meta isn't actually selling their models?

Your response might make sense if it were to something OpenAI or Anthropic said, but as is I can't say I follow the analogy.

camel_Snake
0 replies
1h57m

Meta doesn't sell widgets in this scenario - they give them away for free. Their competition sells widgets, so Meta would be perfectly happy if the widget market totally collapsed.

ThrowawayTestr
0 replies
1h59m

If OpenAI was saying this you'd have a point but I wouldn't call Facebook a widget seller in this case when they're giving their widgets away for free.

imtringued
3 replies
2h14m

LLMs are reaching saturation on even some of the latest benchmarks and yet I am still a little disappointed by how they perform in practice.

They are by no means bad, but I am now mostly interested in long context competency. We need benchmarks that force the LLM to complete multiple tasks simultaneously in one super long session.

xeromal
2 replies
2h5m

I don't know anything about AI but there's one thing I want it to do for me. Program a full body exercise program long term based on the parameters I give it such as available equipment and past workout context goals. I haven't had good success with chatgpt but I assume what you're talking about is relevant to my goals.

ThrowawayTestr
1 replies
1h59m

Aren't there apps that already do this like Fitbod?

xeromal
0 replies
1h48m

Fitbod might do the trick. Thanks! The availability of equipment was a difficult thing for me to incorporate into a fitness program.

nathanasmith
1 replies
2h16m

They also said in the paper that 405B was only trained to "compute-optimal" unlike the smaller models that were trained well past that point indicating the larger model still had some runway so had they continued it would have kept getting stronger.

moffkalast
0 replies
1h40m

Makes sense right? Otherwise why make a model so large that nobody can conceivably run it if not to optimize for performance on a limited dataset/compute? It was always a distillation source model, not a production one.

dev1ycan
0 replies
1h58m

Or maybe they just want to avoid getting sued by shareholders for dumping so much money into unproven technology that ended up being the same or worse than the competitor

Bjorkbat
0 replies
45m

Yeah, but what does that actually mean? That if they had simply doubled the parameters on Llama 405b it would score way better on benchmarks and become the new state-of-the-art by a long mile?

I mean, going by their own model evals on various benchmarks (https://llama.meta.com/), Llama 405b scores anywhere from a few points to almost 10 points more than than Llama 70b even though the former has ~5.5x more params. As far as scale in concerned, the relationship isn't even linear.

Which in most cases makes sense, you obviously can't get a 200% on these benchmarks, so if the smaller model is already at ~95% or whatever then there isn't much room for improvement. There is, however, the GPQA benchmark. Whereas Llama 70b scores ~47%, Llama 405b only scores ~51%. That's not a huge improvement despite the significant difference in size.

Most likely, we're going to see improvements in small model performance by way of better data. Otherwise though, I fail to see how we're supposed to get significantly better model performance by way of scale when the relationship between model size and benchmark scores is nowhere near linear. I really wish someone who's team "scale is all you need" could help me see what I'm missing.

And of course we might find some breakthrough that enables actual reasoning in models or whatever, but I find that purely speculative at this point, anything but inevitable.

crystal_revenge
3 replies
1h49m

the only thing you can compete on is how many parameters it takes and how cheaply you can serve that to users.

The problem with this strategy is that it's really tough to compete with open models in this space over the long run.

If you look at OpenAI's homepage right now they're trying to promote "ChatGPT on your desktop", so it's clear even they realize that most people are looking for a local product. But once again this is a problem for them because open models run locally are always going to offer more in terms of privacy and features.

In order for proprietary models served through an API to compete long term they need to offer significant performance improvements over open/local offerings, but that gap has been perpetually shrinking.

On an M3 macbook pro you can run open models easily for free that perform close enough to OpenAI that I can use them as my primary LLM for effectively free with complete privacy and lots of room for improvement if I want to dive into the details. Ollama today is pretty much easier to install than just logging into ChatGPT and the performance feels a bit more responsive for most tasks. If I'm doing a serious LLM project I most certainly won't use proprietary models because the control I have over the model is too limited.

At this point I have completely stopped using proprietary LLMs despite working with LLMs everyday. Honestly can't understand any serious software engineer who wouldn't use open models (again the control and tooling provided is just so much better), and for less technical users it's getting easier and easier to just run open models locally.

bla3
1 replies
1h46m

I think their desktop app still runs the actual LLM queries remotely.

kridsdale3
0 replies
30m

This. It's a mac port of the iOS app. Using the API.

pzo
0 replies
1h19m

In the long run maybe but it's going to take probably 5 years or more before laptops such as Macbook M3 with 64 GB RAM will be mainstream. Also it's going going to take a while before such models with 70B params will be bundled in Windows and Mac with system update. Even more time before you will have such models inside your smartphone.

OpenAI did a good move with making GPTo mini so dirty cheap that it's faster and cheaper to run than LLama 3.1 70B. Most consumers will interact with LLM via some apps using LLM API, Web Panel on desktop or native mobile app for the same reason most people use GMail etc. instead of native email client. Setting up IMAP, POP etc is for most people out of reach the same like installing Ollama + Docker + OpenWebUI

App developers are not gonna bet on local LLM only as long they are not mainstream and preinstalled on 50%+ devices.

nichochar
0 replies
1h45m

Totally. I wrote about this when they announced their dev-day stuff.

In my opinion, they've found that intelligence with current architecture is actually an S-curve and not an exponential, so trying to make progress in other directions: UX and EQ.

https://nicholascharriere.com/blog/thoughts-openai-spring-re...

swyx
13 replies
2h37m

indeed. I pointed out in https://buttondown.email/ainews/archive/ainews-llama-31-the-... that the frontier model curve is currently going down 1 OoM every 4 months, meaning every model release has a very short half life[0]. however this progress is still worth it if we can deploy it to improve millions and eventually billions of people's lives. a commenter pointed out that the amoutn spent on Llama 3.1 was only like 60% of the cost of Ant Man and the Wasp Quantumania, in which case I'd advocate for killing all Marvel slop and dumping all that budget on LLM progress.

[0] not technically complete depreciation, since for example 4o mini is widely believed to be a distillation of 4o, so 4o's investment still carries over into 4o mini

thierrydamiba
6 replies
2h9m

Agreed on everything, but calling the marvel movies slop…I think that word has gone too far.

mattnewton
1 replies
1h48m

Not all Marvel films are slop. But, as a fan who comes from a family of fans and someone who has watched almost all of them; lets be real. That particular film, really and most of them, contain copious amounts of what is absolutely slop.

I don't know if the utility is worse than an LLM that is SOTA for 2 months that no one even bothers switching to however - at least the marvel slop is being used for entertainment by someone. I think the market is definitely prioritizing the LLM researcher over Disney's latest slop sequel though so whoever made that comparison can rest easy, because we'll find out.

lawlessone
0 replies
1h18m

really and most of them, contain copious amounts of what is absolutely slop.

I thought that was the allure, something that's camp funny and an easy watch.

I have only watched a few of them so I am not fully familiar?

ThrowawayTestr
1 replies
1h55m

The marvel movies are the genesis for this use of the word slop.

simonw
0 replies
1h54m

Can you back that claim up with a link or similar?

bn-l
0 replies
1h24m

It’s junk food. No one is disputing how tasty it is though (including the recent garbage).

RUnconcerned
0 replies
1h49m

Not only are Marvel movies slop, they are very concentrated slop. The only way to increase the concentration of slop in a Marvel movie would be to ask ChatGPT to write the next one.

troupo
4 replies
1h51m

however this progress is still worth it if we can deploy it to improve millions and eventually billions of people's lives

Has there been any indication that we're improving the lives of millions of people?

zooq_ai
1 replies
1h36m

Yes, just like internet, power users have found use cases. It'll take education / habit for general users

troupo
0 replies
15m

Ah yes. We're in the crypto stages of "it's like the internet".

machiaweliczny
1 replies
1h3m

Just me coding 30% faster is worth it

troupo
0 replies
13m

I haven't found a single coding problem where any of these coding assistants where anything but annoying.

If I need to babysit a junior developer fresh out of school and review every single line if code it spits out, I can find them elsewhere

ActorNightly
10 replies
2h39m

The thing I don't understand is why everyone is throwing money at LLMs for language, when there are much simpler use cases which are more useful?

For example, has anyone ever attempted image -> html/css model? Seems like it be great if I can draw something on a piece of paper and have it generate a website view for me.

majiy
1 replies
2h35m

That's a thought I had. For example, could a model be trained to take a description, and create a Blender (or whatever other software) model from it? I have no idea how LLMs really work under the hood, so please tell me if this is nonsense.

eurekin
0 replies
2h24m

I'm waiting exactly for this, gpt4 trips up a lot with blender currently (nonsensical order of operations etc.)

GaggiX
1 replies
2h37m

For example, has anyone ever attempted image -> html/css model?

Have you tried upload the image to a LLM with vision capabilities like GPT-4o or Claude 3.5 Sonnet?

machiaweliczny
0 replies
57m

I tried and sonnet 3.5 can copy most of common UIs

rkwz
0 replies
2h17m

Perhaps if we think of LLMs as search engines (Google, Bing etc) then there's more money to be made by being the top generic search engine than the top specialized one (code search, papers search etc)

jacobn
0 replies
2h38m

I was under the impression that you could more or less do something like that with the existing LLMs?

(May work poorly of course, and the sample I think I saw a year ago may well be cherry picked)

chipdart
0 replies
2h21m

For example, has anyone ever attempted image -> html/css model?

There are already companies selling services where they generate entire frontend applications from vague natural language inputs.

https://vercel.com/blog/announcing-v0-generative-ui

ascorbic
0 replies
2h35m

All of the multi-modal LLMs are reasonably good at this.

JumpCrisscross
0 replies
1h52m

has anyone ever attempted image -> html/css model?

I had a discussion with a friend about doing this, but for CNC code. The answer was that a model trained on a narrow data set underperforms one trained on a large data set and then fine tuned with the narrow one.

lolinder
9 replies
1h46m

It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs

Yes. This is exactly why I'm skeptical of AI doomerism/saviorism.

Too many people have been looking at the pace of LLM development over the last two (2) years, modeled it as an exponential growth function, and come to the conclusion that AGI is inevitable in the next ${1-5} years and we're headed for ${(dys|u)topia}.

But all that assumes that we can extrapolate a pattern of long-term exponential growth from less than two years of data. It's simply not possible to project in that way, and we're already seeing that OpenAI has pivoted from improving on GPT-4's benchmarks to reducing cost, while competitors (including free ones) catch up.

All the evidence suggests we've been slowing the rate of growth in capabilities of SOTA LLMs for at least the past year, which means predictions based on exponential growth all need to be reevaluated.

cjalmeida
3 replies
1h39m

Indeed.All exponential growth curves are sigmoids in disguise.

nicman23
1 replies
1h17m

except when it isn't and we ded :P

kridsdale3
0 replies
48m

I don't think Special Relativity would allow that.

ToValueFunfetti
0 replies
19m

This is something that is definitionally true in a finite universe, but doesn't carry a lot of useful predictive value in practice unless you can identify when the flattening will occur.

If you have a machine that converts mass into energy and then uses that energy to increase the rate at which it operates, you could rightfully say that it will level off well before consuming all of the mass in the universe. You just can't say that next week after it has consumed all of the mass of Earth.

RicoElectrico
3 replies
1h30m

I don't think we are approaching limits, if you take off the English-centric glasses. You can query LLMs about pretty basic questions about Polish language or literature and it's gonna either bullshit or say it doesn't know the answer.

Example:

    w której gwarze jest słowo ekspres i co znaczy?

    Słowo "ekspres" występuje w gwarze śląskiej i oznacza tam ekspres do kawy. Jest to skrót od nazwy "ekspres do kawy", czyli urządzenia służącego do szybkiego przygotowania kawy.
The correct answer is that "ekspres" is a zipper in Łódź dialect.

nprateem
0 replies
1h27m

That's just same same but different, not a step change towards significant cognitive ability.

lolinder
0 replies
1h17m

What this means is just that Polish support (and probably most other languages besides English) in the models is behind SOTA. We can gradually get those languages closer to SOTA, but that doesn't bring us closer to AGI.

andrepd
0 replies
1h28m

Tbf, you can ask it basic questions in English and it will also bullshit you.

jeremyjh
0 replies
48m

I'm also wondering about the extent to which we are simply burning venture capital versus actually charging subscription prices that are sustainable long-term. Its easy to sell dollars for $0.75 but you can only do that for so long.

Workaccount2
8 replies
2h37m

I think GPT5 will be the signal of whether or not we have hit a plateau. The space is still rapidly developing, and while large model gains are getting harder to pick apart, there have been enormous gains in the capabilities of light weight models.

chipdart
5 replies
2h24m

I think GPT5 will be the signal of whether or not we have hit a plateau.

I think GPT5 will tell if OpenAI hit a plateau.

Sam Altman has been quoted as claiming "GPT-3 had the intelligence of a toddler, GPT-4 was more similar to a smart high-schooler, and that the next generation will look to have PhD-level intelligence (in certain tasks)"

Notice the high degree of upselling based on vague claims of performance, and the fact that the jump from highschooler to PhD can very well be far less impressive than the jump from toddler to high schooler. In addition, notice the use of weasel words to frame expectations regarding "the next generation" to limit these gains to corner cases.

There's some degree of salesmanship in the way these models are presented, but even between the hyperboles you don't see claims of transformative changes.

rvnx
2 replies
2h12m

PhD level-of-task-execution sounds like the LLM will debate whether the task is ethical instead of actually doing it

throwadobe
0 replies
24m

I wish I could frame this comment

airspresso
0 replies
1h50m

lol! Producing academic papers for future training runs then.

splwjs
0 replies
2h5m

some degree of salesmanship

buddy every few weeks one of these bozos is telling us their product is literally going to eclipse humanity and we should all start fearing the inevitable great collapse.

It's like how no one owns a car anymore because of ai driving and I don't have to tell you about the great bank disaster of 2019, when we all had to accept that fiat currency is over.

You've got to be a particular kind of unfortunate to believe it when sam altman says literally anything.

sensanaty
0 replies
33m

Basically every single word out of Mr Worldcoin's mouth is a scam of some sort.

zainhoda
0 replies
2h28m

I’m waiting for the same signal. There are essentially 2 vastly different states of the world depending on whether GPT-5 is an incremental change vs a step change compared to GPT-4.

mupuff1234
0 replies
1h23m

Which is why they'll keep calling the next few models GPT4.X

iknownthing
7 replies
2h43m

and even if there is another breakthrough all of these companies will implement it more or less simultaneously and they will remain in a dead heat

llm_nerd
6 replies
2h40m

Presuming the breakthrough is openly shared. It remains surprising how transparent many of these companies are about new approaches that push the SoTa forward, and I suspect we're going to see a change. That companies won't reveal the secret sauce so readily.

e.g. Almost the entire market relies upon Attention Is All You Need paper detailing transformers, and it would be an entirely different market if Google had held that as a trade secret.

talldayo
4 replies
2h36m

Given how absolutely pitiful the proprietary advancements in AI have been, I would posit we have little to worry about.

jsheard
3 replies
2h24m

OTOH the companies who are sharing their breakthroughs openly aren't yet making any money, so something has to give. Their research is currently being bankrolled by investors who assume there will be returns eventually, and eventually can only be kicked down the road for so long.

thruway516
0 replies
2h10m

Well, that's because the potential reward from picking the right horse is MASSIVE and the cost of potentially missing out is lifelong regret. Investors are driven by FOMO more than anything else. They know most of these will be duds but one of these duds could turn out to be life changing. So they will keep bankrolling as long as they have the money.

talldayo
0 replies
2h16m

Eventually can be (and has been) bankrolled by Nvidia. They did a lot of ground-floor research on GANs and training optimization, which only makes sense to release as public research. Similarly, Meta and Google are both well-incentivized to share their research through Pytorch and Tensorflow respectively.

I really am not expecting Apple or Microsoft to discover AGI and ferret it away for profitability purposes. Strictly speaking, I don't think superhuman intelligence even exists in the domain of text generation.

michaelt
0 replies
1h28m

Sort of yes, sort of no.

Of course, I agree that Stability AI made Stable Diffusion freely available and they're worth orders of magnitude less than OpenAI. To the point they're struggling to keep the lights on.

But it doesn't necessarily make that much difference whether you openly share the inner technical details. When you've got a motivated and well financed competitor, merely demonstrating a given feature is possible, showing the output and performance and price, might be enough.

If OpenAI adds a feature, who's to say Google and Facebook can't match it even though they can't access the code?

GaggiX
0 replies
2h34m

Attention Is All You Need paper detailing transformers, and it would be an entirely different market if Google had held that as a trade secret.

I would guess that in that timeline, Google would never have been able to learn about the incredible capabilities of transformer models outside of translation, at least not until much later.

chipdart
6 replies
2h36m

It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs;

I think you're just seeing the "make it work" stage of the combo "first make it work, then make it fast".

Time to market is critical, as you can attest by the fact you framed the situation as "on par with GPT-4o and Claude Opus". You're seeing huge investments because being the first to get a working model stands to benefit greatly. You can only assess models that exist, and for that you need to train them at a huge computational cost.

romeros
5 replies
2h19m

ChatGPT is like Google now. It is the default. Even if Claude becomes as good as ChatGPT or even slightly better it won't make me switch. It has to be like a lot better. Way better.

It feels like ChatGPT won the time to market war already.

staticman2
1 replies
1h53m

If ChatGPT fails to do a task you want, your instinct isn't "I'll run the prompt through Claude and see if it works" but "oh well, who needs LLMs?"

atxbcp
0 replies
1h28m

Please don't assume your experience applies to everyone. If ChatGPT can't do what I want, my first reaction is to ask Claude for the same thing. Often to find out that Claude performs much better. I've already cancelled ChaptGPT Plus for exactly that reason.

brandall10
0 replies
2h7m

But plenty people switched to Claude, esp. with Sonnet 3.5. Many of them in this very thread.

You may be right with the average person on the street, but I wonder how many have lost interest in LLM usage and cancelled their GPT plus sub.

asah
0 replies
1h58m

-1: I know many people who are switching to Claude. And Google makes it near-zero friction to adopt Gemini with Gsuite. And more still are using the top-N of them.

This is similar to the early days of the search engine wars, the browser wars, and other categories where a user can easily adopt, switch between and use multiple. It's not like the cellphone OS/hardware war, PC war and database war where (most) users can only adopt one platform at a time and/or there's a heavy platform investment.

Tostino
0 replies
2h10m

Eh, with the degradation of coding performance in ChatGPT I made the switch. Seems much better to work with on problems, and I have to do way less hand holding to get good results.

I'll switch again soon as something better is out.

niemandhier
3 replies
1h40m

The next iteration depends on NVIDIA & co, what we need is sparse libs. Most of the weights in llms are 0, once we deal with those more efficiently we will get to the next iteration.

lawlessone
2 replies
1h15m

Most of the weights in llms are 0,

that's interesting. Do you have a rough percentage of this?

Does this mean these connections have no influence at all on output?

machiaweliczny
1 replies
59m

My uneducated guess is that with many layers you can implement something akin to graph in brain by nulling lots of previous later outputs. I actually suspect that current models aren’t optimal with layers all of the same size but i know shit

kridsdale3
0 replies
27m

This is quite intuitive. We know that a biological neural net is a graph data structure. And ML systems on GPUs are more like layers of bitmaps in Photoshop (it's a graphics processor). So if most of the layers are akin to transparent pixels, in order to build a graph by stacking, that's hyper memory inefficient.

swalsh
1 replies
1h34m

And with the increasing parameter size, the main winner will be Nvidia.

Frankly I just don't understand the economics of training a foundation model. I'd rather own an airline. At least I can get a few years out of the capital investment of a plane.

machiaweliczny
0 replies
53m

But billionaires already have that, they want a chance of getting their own god.

speed_spread
0 replies
2h36m

Benchmarks scores aren't good because they apply to previous generations of LLMs. That 2.23% uptick can actually represent a world of difference in subjective tests and definitely be worth the investment.

Progress is not slowing down but it gets harder to quantify.

skybrian
0 replies
2h14m

I think it’s impressive that they’re doing it on a single (large) node. Costs matter. Efficiency improvements like this will probably increase capabilities eventually.

I’m also optimistic about building better (rather than bigger) datasets to train on.

satvikpendem
0 replies
2h36m

This is already what the chinchilla paper surmised, it's no wonder that their prediction now comes to fruition. It is like an accelerated version of Moore's Law, because software development itself is more accelerated than hardware development.

mlsu
0 replies
53m

What else can be done?

If you are sitting on 1 billions $ of GPU capex, what's $50 million in energy/training cost for another incremental run that may beat the leaderboard?

Over the last few years the market has placed its bets that this stuff will make gobs of money somehow. We're all not sure how. They're probably thinking -- it's likely that whoever has a few % is going to sweep and take most of this hypothetical value. What's another few million, especially if you already have the GPUs?

I think you're right -- we are towards the right end of the sigmoid. And with no "killer app" in sight. It is great for all of us that they have created all this value, because I don't think anyone will be able to capture it. They certainly haven't yet.

m3kw9
0 replies
1h36m

There is different directions AI have lots to improve: multi modal which branch into robotics, single modal like image, video, and sound generation and understanding. Also would check back when openAI releases 5

lossolo
0 replies
1h52m

For some time, we have been at a plateau because everyone has caught up, which essentially means that everyone now has good training datasets and uses similar tweaks to the architecture. It seems that, besides new modalities, transformers might be a dead end as an architecture. Better scores on benchmarks result from better training data and fine-tuning. The so-called 'agents' and 'function calling' also boil down to training data and fine-tuning.

genrilz
0 replies
2h27m

For this model, it seems like the point is that it uses way less parameters than at least the large Llama model while having near identical performance. Given how large these models are getting, this is an important thing to do before making performance better again.

42lux
0 replies
2h1m

We always needed a tock to see real advancement, like with the last model generation. The tick we had with the h100 was enough to bring these models to market but that's it.

tikkun
58 replies
2h50m

Links to chat with models that released this week:

Large 2 - https://chat.mistral.ai/chat

Llama 3.1 405b - https://www.llama2.ai/

I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts from my Claude history.

I'd rank as:

1. Sonnet 3.5

2. Large 2 and Llama 405b (similar, no clear winner between the two)

If you're using Claude, stick with it.

My Claude wishlist:

1. Smarter (yes, it's the most intelligent, and yes, I wish it was far smarter still)

2. Longer context window (1M+)

3. Native audio input including tone understanding

4. Fewer refusals and less moralizing when refusing

5. Faster

6. More tokens in output

drewnick
48 replies
2h38m

All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet.

tikkun
19 replies
2h36m

When using a prompt that involves thinking first, all three get it correct.

"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."

Llama 405b: correct

Mistral Large 2: correct

Claude 3.5 Sonnet: correct

layer8
12 replies
2h20m

It’s not impressive that one has to go to that length though.

mattnewton
2 replies
1h45m

You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.

layer8
1 replies
1h22m

The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.

*) that is, except sometimes by making adjustments to the system prompt

mattnewton
0 replies
4m

I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.

I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.

jonas21
1 replies
31m

To be fair, I just asked a real person and had to go to even greater lengths:

Me: How many "r"s are in strawberry?

Them: What?

Me: How many times does the letter "r" appear in the word "strawberry"?

Them: Is this some kind of trick question?

Me: No. Just literally, can you count the "r"s?

Them: Uh, one, two, three. Is that right?

Me: Yeah.

Them: Why are you asking me this?

SirMaster
0 replies
2m

Try asking a young child...

asadm
1 replies
2h2m

this can be automated.

grumbel
0 replies
49m

GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.

Spivak
1 replies
2h3m

To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.

layer8
0 replies
1h30m

The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.

unshavedyak
0 replies
2h3m

Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.

There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.

petesergeant
0 replies
1h22m

In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"
ThrowawayTestr
0 replies
1h51m

Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.

jedberg
2 replies
1h27m

This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"

At that point it was easier to do it myself.

HPsquared
0 replies
9m

"What programming computers is really like."

EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.

tcgv
1 replies
58m

Chain-of-Thought (CoT) prompting to the rescue!

We should always put some effort into prompt engineering before dismissing the potential of generative AI.

johntb86
0 replies
26m

By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.

hansworst
0 replies
1h7m

Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.

Maybe the various chat interfaces already do this behind the scenes?

Kuinox
10 replies
2h30m

Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.

for strawberry, it see it as [496, 675, 15717], which is str aw berry.

If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".
GenerWork
9 replies
1h13m

If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

est31
2 replies
45m

Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is the same.

Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.

Psychologists use this trick as well to figure out internal representations, for example the rorschach test.

And probably, if you asked random people in the street how many P's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say spanish, also the country name uses an F.

rahimnathwani
1 replies
32m

Until I was ~12, I thought 'a lot' was a single word.

soneca
1 replies
54m

This is only an issue if you send commands to a LLM as you were communicating to a human.

antisthenes
0 replies
27m

This is only an issue if you send commands to a LLM as you were communicating to a human.

Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.

observationist
0 replies
48m

Count the number of occurrences of the letter e in the word "enterprise".

Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.

It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.

coder543
0 replies
31m

I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?

I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.

A useless benchmark is useless.

In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.

bhelkey
0 replies
54m

It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.

A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.

[1] https://help.openai.com/en/articles/4936856-what-are-tokens-...

Zambyte
0 replies
33m

Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.

Tepix
4 replies
2h26m

LLMs think in tokens, not letters. It's like asking someone who is dyslexic about spelling. Not their strong suit. In practice, it doesn't matter much, does it?

recursive
1 replies
1h53m

Sometimes it does, sometimes it doesn't.

It is evidence that LLMs aren't appropriate for everything, and that there could exist something that works better for some tasks.

Zambyte
0 replies
17m

Language models are best treated like consciousness. Our consciousness does a lot less than people like to attribute to it. It is mostly a function of introspection and making connections, rather than being the part of the brain where higher level reasoning and the functions of the brain that tell your body how to stay alive (like beating your heart).

By allowing a language model to do function calling, you are essentially allowing it to do specialized "subconscious" thought. The language model becomes a natural language interface to the capabilities of its "subconsciousness".

A specific human analogy could be: I tell you to pick up a pen off of the table, and then you do it. Most of your mental activity would be subconscious, orienting your arm and hand properly to pick up the pen, actually grabbing the pen, and picking it up. The linguistic representation of the action would exist in your concious mind (pick up the pen), but not much else.

A language model could very easily call out to a text processing function to correctly do things like count the number of r's in the word strawberry. That is a job that your concious mind can dispatch to your subconciousness.

imtringued
1 replies
1h45m

This doesn't make any sense. The LLM knows the token to letter associations if you ask it. Once it has split the token into letters, it can answer the question anyway.

This tells us that the model is unaware of its own limitations, otherwise it would have figured that this tasks needs it to split the letters first.

staticman2
0 replies
1h14m

Well splitting into letters doesn't help with Mistral Large 2. I tried this with Mistral 2:

[Question: How many r's in strawberry? First split the word into letters, one letter on each line then answer the question.]

[Answer: Let's split the word "strawberry" into letters:

s t r a w b e r r y

Now, let's count the number of 'r's:

There are 2 'r's in the word "strawberry".]

[Question: Did you count the number of r's?]

[answer: Yes, I did. Here's the count:

r r

So, there are 2 'r's in the word "strawberry".]

I suspect the issue is these models have to be trained to accomplish tasks, and this is not a task it makes sense to train a model to perform. You might train it to do this with function calling one day, calling a python library to count the r's in a string, but actually manually counting r's just doesn't play to an LLM's strengths.

generalizations
3 replies
38m

Testing models on their tokenization has always struck me as kinda odd. Like, that has nothing to do with their intelligence.

wruza
0 replies
13m

It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).

swatcoder
0 replies
31m

Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.

Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.

probably_wrong
0 replies
2m

I would counterargue with "that's the model's problem, not mine".

Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.

A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".

ChikkaChiChi
1 replies
2h33m

4o will get the answer right on the first go if you ask it "Search the Internet to determine how many R's are in strawberry?" which I find fascinating

paulcole
0 replies
7m

I didn't even need to do that. 4o got it right straight away with just:

"how many r's are in strawberry?"

The funny thing is, I replied, "Are you sure?" and got back, "I apologize for the mistake. There are actually two 'r's in the word strawberry."

vorticalbox
0 replies
12m

I just tried llama 3.1 8 b this is its reply.

According to multiple sources, including linguistic analysis and word breakdowns, there are 3 Rs in the word "strawberry".

taf2
0 replies
7m

sonate 3.5 thinks 2

joshstrange
0 replies
2h7m

Lots of replies mention tokens as the root cause and I’m not well versed in this stuff at the low level but to me the answer is simple:

When this question is asked (from what the models trained on) the question is NOT “count the number of times r appears in the word strawberry” but instead (effectively) “I’ve written ‘strawbe’, now how many r’s are in strawberry again? Is it 1 or 2?”.

I think most humans would probably answer “there are 2” if we saw someone was writing and they asked that question, even without seeing what they have written down. Especially if someone said “does strawberry have 1 or 2 r’s in it?”. You could be a jerk and say “it actually has 3” or answer the question they are actually asking.

It’s an answer that is _technically_ incorrect but the answer people want in reality.

doctoboggan
0 replies
2h34m

Due to the fact that LLMs work on tokens and not characters, these sort of questions will always be hard for them.

Stumbling
0 replies
36m

Claude 3 Opus gave correct answer.

Der_Einzige
0 replies
2h5m

I wrote and published a paper at COLING 2022 on why LLMs in general won't solve this without either 1. radically increasing vocab size, 2. rethinking how tokenizers are done, or 3. forcing it with constraints:

https://aclanthology.org/2022.cai-1.2/

rkwz
7 replies
2h15m

Longer context window (1M+)

What's your use case for this? Uploading multiple documents/books?

tikkun
4 replies
2h14m

Correct

freediver
3 replies
1h47m

That would make each API call cost at least $3 ($3 is price per million input tokens). And if you have a 10 message interaction you are looking at $30+ for the interaction. Is that what you would expect?

tr4656
0 replies
1h11m

This might be when it's better to not use the API and just pay for the flat-rate subscription.

rkwz
0 replies
1h41m

Maybe they're summarizing/processing the documents in a specific format instead of chatting? If they needed chat, might be easier to build using RAG?

coder543
0 replies
24m

Gemini 1.5 Pro charges $0.35/million tokens up to the first million tokens or $0.70/million tokens for prompts longer than one million tokens, and it supports a multi-million token context window.

Substantially cheaper than $3/million, but I guess Anthropic’s prices are higher.

ketzo
0 replies
2h11m

Uploading large codebases is particularly useful.

benopal64
0 replies
55m

Books, especially textbooks, would be amazing. These things can get pretty huge (1000+ pages) and usually do not fit into GPT-4o or Claude Sonnet 3.5 in my experience. I envision the models being able to help a user (student) create their study guides and quizzes, based on ingesting the entire book. Given the ability to ingest an entire book, I imagine a model could plan how and when to introduce each concept in the textbook better than a model only a part of the textbook.

msp26
0 replies
1h1m

Large 2 is significantly smaller at 123B so it being comparable to llama 3 405B would be crazy.

TIPSIO
33 replies
2h51m

This race for the top model is getting wild. Everyone is claiming to one-up each with every version.

My experience (benchmarks aside) Claude 3.5 Sonnet absolutely blows everything away.

I'm not really sure how to even test/use Mistral or Llama for everyday use though.

satvikpendem
11 replies
2h49m

I stopped my ChatGPT subscription and subscribed instead to Claude, it's simply much better. But, it's hard to tell how much better day to day beyond my main use cases of coding. It is more that I felt ChatGPT felt degraded than Claude were much better. The hedonic treadmill runs deep.

bugglebeetle
5 replies
2h46m

GPT-4 was probably as good as Claude Sonnet 3.5 at its outset, but OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.

satvikpendem
4 replies
2h38m

Indeed, it used to output all the code I needed but now it only outputs a draft of the code with prompts telling me to fill in the rest. If I wanted to fill in the rest, I wouldn't have asked you now, would've I?

flir
2 replies
2h18m

It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.

That's my perception, anyway.

throwadobe
0 replies
30m

This is also my perception using it daily for the last year or so. Sometimes it also responds with exactly what I provided it with and does not make any changes. It's also bad at following instructions.

GPT-4 was great until it became "lazy" and filled the code with lots of `// Draw the rest of the fucking owl` type comments. Then GPT-4o was released and it's addicted to "Here's what I'm going to do: 1. ... 2. ... 3. ..." and lots of frivolous, boilerplate output.

I wish I could go back to some version of GPT-4 that worked well but with a bigger context window. That was like the golden era...

cloverich
0 replies
20m

This is also my experience. Previously it got good at giving me only relevant code which, as an experienced coder, is what i want. my favorites were the one line responses.

Now it often falls back to generating full examples, explanations, restating the question and its approach. I suspect this is by design as (presumably) less experienced folks want or need all that. For me, i wish i could consistently turn it into one of those way too terse devs that replies with the bare minimum example, and expects you to infer the rest. Usually that is all i want or need, and i can ask for elaboration when not the case. I havent found the best prompts to retrigger this persona from it yet.

visarga
0 replies
21m

I wouldn't have asked you now, would've I?

That's what I said to it - "If I wanted to fill in the missing parts myself, why would I have upgraded to paid membership?"

TIPSIO
4 replies
2h47m

Have you (or anyone) swapped on Cursor with Anthropic API Key?

For coding assistant, it's on my to do list to try. Cursor needs some serious work on model selection clarity though so I keep putting off.

freediver
2 replies
2h41m

I did it (fairly simple really) but found most of my (unsophisticated) coding these days to go through Aider [1] paired with Sonnet, for UX reasons mostly. It is easier to just prompt over the entire codebase, vs Cursor way of working with text selections.

[1] https://aider.chat

freediver
0 replies
1h50m

That is chatting, but it will not change the code.

com2kid
0 replies
8m

One big advantage Claude artifacts have is that they maintain conversation context, versus when I am working with Cursor I have to basically repeat a bunch of information for each prompt, there is no continuity between requests for code edits.

If Cursor fixed that, the user experience would become a lot better.

ldjkfkdsjnv
4 replies
2h49m

Sonnet 3.5 to me still seems far ahead. Maybe not on the benchmarks, but in everyday life I am finding it renders the other models useless. Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.

LrnByTeach
1 replies
48m

Such a relief/contrast to the period between 2010 and 2020, when the top five Google, Apple, Facebook, Amazon, and Microsoft monopolized their own regions and refused to compete with any other player in new fields.

Google : Search

Facebook : social

Apple : phones

Amazon : shopping

Microsoft : enterprise ..

Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.
jack_pp
0 replies
8m

Google refused to compete with Apple in phones?

Microsoft also competes in search, phones

Microsoft, Amazon and Google compete in cloud too

shinycode
0 replies
2h40m

Given we don’t know precisely what’s happening in the black box we can say that spec tech doesn’t give you the full picture of the experience … Apple style

bugglebeetle
0 replies
2h48m

I’ve stopped using anything else as a coding assistant. It’s head and shoulders above GPT-4o on reasoning about code and correcting itself.

coder543
2 replies
2h41m

I'm not really sure how to even test/use Mistral or Llama for everyday use though.

Both Mistral and Meta offer their own hosted versions of their models to try out.

https://chat.mistral.ai

https://meta.ai

You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.

Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.

d13
0 replies
1h7m

Groq’s models are also heavily quantised so you won’t get the full experience there.

Tepix
2 replies
2h25m

Claude is pretty great, but it's lacking the speech recognition and TTS, isn't it?

connorgutman
1 replies
2h18m

Correct. IMO the official Claude app is pretty garbage. Sonnet 3.5 API + Open-WebUI is amazing though and supports STT+TTS as well as a ton of other great features.

machiaweliczny
0 replies
1h8m

But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes

J_Shelby_J
2 replies
2h38m

3.5 sonnet is the quality of the OG GPT-4, but mind blowingly fast. I need to cancel my chatgpt sub.

layer8
1 replies
2h16m

mind blowingly fast

I would imagine this might change once enough users migrate to it.

kridsdale3
0 replies
50m

Eventually it comes down to who has deployed more silicon: AWS or Azure.

skerit
1 replies
1h0m

I don't get it. My husband also swears by Clause Sonnet 3.5, but every time I use it, the output is considerably worse than GPT-4o

Zealotux
0 replies
4m

I don't see how that's possible. I decided to give GPT-4o a second chance after reaching my daily use on Sonnet 3.5, after 10 prompts GTP-4o failed to give me what Claude did in a single prompt (game-related programming). And with fragments and projects on top of that, the UX is miles ahead of anything OpenAI offers right now.

m3kw9
1 replies
1h34m

It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o.

simonw
0 replies
1h15m

My hunch is this comes down to personal prompting style. It's likely that your own style works more effectively with GPT-4o, while other people have styles that are more effective with Claude 3.5 Sonnet.

maccard
0 replies
2h48m

Agree on Claude. I also feel like ChatGPT has gotten noticeably worse over the last few months.

codazoda
0 replies
47m

I'm not really sure how to even test/use Mistral or Llama for everyday use though.

I've been recording installation and usage instructions on how I've been using these Open Source AI models on my machine (a Mac). If that sounds interesting to you, sign up and I'll put together a free webinar.

https://bit.ly/3LJbUvj

wesleyyue
7 replies
1h39m

I'm building a ai coding assistant (https://double.bot) so I've tried pretty much all the frontier models. I added it this morning to play around with it and it's probably the worst model I've ever played with. Less coherent than 8B models. Worst case of benchmark hacking I've ever seen.

example: https://x.com/WesleyYue/status/1816153964934750691

nabakin
2 replies
1h4m

Are you sure the chat history is being passed when the second message is sent? That looks like the kind of response you'd expect if it only received the prompt "in python" with no chat history at all.

wesleyyue
1 replies
56m

Yes, I built the extension. I actually also just went to send another message asking what the first msg was just to double check I didn't have a bug and it does know what the first msg was.

nabakin
0 replies
16m

Thanks, that's some really bad accuracy/performance

mpeg
1 replies
1h36m

to be fair that's quite a weird request (the initial one) – I feel a human would struggle to understand what you mean

wesleyyue
0 replies
1h29m

definitely not an articulate request, but the point of using these tools is to speed me up. The less the user has to articulate and the more it can infer correctly, the more helpful it is. Other frontier models don't have this problem.

Llama 405B response would be exactly what I expect

https://x.com/WesleyYue/status/1816157147413278811

ijustlovemath
1 replies
1h35m

What was the expected outcome for you? AFAIK, Python doesn't have a const dictionary. Were you wanting it to refactor into a dataclass?

wesleyyue
0 replies
1h32m

Yes, there's a few things wrong: 1. If it assumes typescript, it should do `as const` in the first msg 2. If it is python, it should be something like https://x.com/WesleyYue/status/1816157147413278811 which is what I wanted but I didn't want to bother with the typing.

calibas
4 replies
2h0m

"Mistral Large 2 is equipped with enhanced function calling and retrieval skills and has undergone training to proficiently execute both parallel and sequential function calls, enabling it to serve as the power engine of complex business applications."

Why does the chart below say the "Function Calling" accuracy is about 50%? Does that mean it fails half the time with complex operations?

Me1000
2 replies
1h26m

Relatedly, what does "parallel" function calling mean in this context?

simonw
1 replies
1h19m

That's when the LLM can respond with multiple functions it wants you to call at once. You might send it:

    Location and population of Paris, France
A parallel function calling LLM could return:

    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "function": {
            "name": "get_city_coordinates",
            "arguments": "{\"city\": \"Paris\"}"
          }
        }, {
          "function": {
            "name": "get_city_population",
            "arguments": "{\"city\": \"Paris\"}"
          }
        }
      ]
    }
Indicating that you should execute both of those functions and return the results to the LLM as part of the next prompt.

Me1000
0 replies
1h2m

Ah, thank you!

simonw
0 replies
1h45m

Mistral forgot to say which benchmark they were using for that chart, without that information it's impossible to determine what it actually means.

breck
4 replies
2h37m

When I see this "© 2024 [Company Name], All rights reserved", it's a tell that the company does not understand how hopelessly behind they are about to be.

crowcroft
3 replies
2h23m

Could you elaborate on this? Would love to understand what leads you to this conclusion.

christianqchung
1 replies
1h36m

So it's made up?

RyanAdamas
4 replies
1h42m

Personally, language diversity should be the last thing on the list. If we had optimized every software from the get-go for a dozen languages our forward progress would have been dead in the water.

moffkalast
2 replies
1h32m

You'd think so, but 3.5-turbo was multilingual from the get go and benefitted massively from it. If you want to position yourself as a global leader, then excluding 95% of the world who aren't English native speakers seems like a bad idea.

RyanAdamas
1 replies
1h31m

Yeah clearly, OpenAI is rocketing forward and beyond.

moffkalast
0 replies
1h19m

Constant infighting and most of the competent people leaving will do that to a company.

I mean more on a model performance level though. It's been shown that something trained in one language trains the model to be able to output it in any other language it knows. There's quality human data being left on the table otherwise. Besides, translation is one of the few tasks that language models are by far the best at if trained properly, so why not do something you can sell as a main feature?

gpm
0 replies
1h5m

Language diversity means access to more training data, and you might also hope that by learning the same concept in multiple languages it does a better job of learning the underlying concept independent of the phrase structure...

At least from a distance it seems like training a multilingual state of the art model might well be easier than a monolingual one.

doctoboggan
3 replies
2h31m

The question I (and I suspect most other HN readers) have is which model is best for coding? While I appreciate the advances in open weights models and all the competition from other companies, when it comes to my professional use I just want the best. Is that still GPT-4?

tikkun
1 replies
2h30m

My personal experience says Claude 3.5 Sonnet.

stri8ed
0 replies
2h7m

The benchmarks agree as well.

tonetegeatinst
2 replies
2h3m

What doe they mean by "single-node inference"?

Do they mean inference done on a single machine?

simonw
0 replies
1h38m

Yes, albeit a really expensive one. Large models like GPT-4 are rumored to run inference on multiple machines because they don't fit in VRAM for even the most expensive GPUs.

(I wouldn't be surprised if GPT-4o mini is small enough to fit on a single large instance though, would explain how they could drop the price so much.)

bjornsing
0 replies
1h35m

Yeah that’s how I read it. Probably means 8 x 80 GB GPUs.

thntk
2 replies
47m

Anyone know what caused the very big performance jump from Large1 to Large2 in just a few months?

Besides, parameter redundancy seems evidenced. Front-tier models used to be 1.8T, then 405B, and now 123B. Would front-tier models in the future be <10B or even <1B, that would be a game changer.

nuz
0 replies
46m

Lots and lots of synthetic data from the bigger models training the smaller ones would be my guess.

(For things like code where you can compile the synthetic results and test if the prompt matches the generated code synthetic data after filtering basically amounts to lots of professionally written perfect ground truth data.)

duchenne
0 replies
4m

Counter-intuitively, larger models are cheaper to train. However, smaller models are cheaper to serve. At first, everyone was focusing on training, so the models were much larger. Now, so many people are using AI everyday, so companies spend more on training smaller models to save on serving.

rkwz
2 replies
2h46m

A significant effort was also devoted to enhancing the model’s reasoning capabilities. One of the key focus areas during training was to minimize the model’s tendency to “hallucinate” or generate plausible-sounding but factually incorrect or irrelevant information. This was achieved by fine-tuning the model to be more cautious and discerning in its responses, ensuring that it provides reliable and accurate outputs.

Is there a benchmark or something similar that compares this "quality" across different models?

amilios
1 replies
1h58m

Unfortunately not, as it captures such a wide spectrum of use cases and scenarios. There are some benchmarks to measure this quality in specific settings, e.g. summarization, but AFAIK nothing general.

rkwz
0 replies
1h52m

Thanks, any ideas why it's not possible to build a generic eval for this? Since it's about asking a set of questions that's not public knowledge (or making stuff up) and check if the model says "I don't know"?

nen-nomad
2 replies
1h23m

The models are converging slowly. In the end, it will come down to the user experience and the "personality." I have been enjoying the new Claude Sonnet. It feels sharper than the others, even though it is not the highest-scoring one.

One thing that `exponentialists` forget is that each step also requires exponentially more energy and resources.

toomuchtodo
0 replies
1h21m

I have been paying for OpenAI since they started accepting payment, but to echo your comment, Claude is so good I am primarily relying on it now for LLM driven work and cancelled my OpenAI subscription. Genuine kudos to Mistral, they are a worthy competitor in the space against Goliaths. They make someone mediocre at writing code less so, so I can focus on higher value work.

bilater
0 replies
1h6m

And a factor for Mistral typically is it will give you less refusals and can be uncensored. So if I have to guess any task that requires creative output could be better suited for this.

rkwasny
1 replies
2h4m

All evals we have are just far too easy! <1% difference is just noise/bad data

We need to figure out how to measure intelligence that is greater than human.

omneity
0 replies
1h32m

Give it problems most/all humans can't solve on their own, but that are easy to verify.

Math problems being one of them, if only LLMs were good at pure math. Another possibility is graph problems. Haven't tested this much though.

ilaksh
1 replies
1h34m

How does their API pricing compare to 4o and 3.5 Sonnet?

rvnx
0 replies
25m

3 USD per 1M input tokens, so the same as 3.5 Sonnet but worse quality

OldGreenYodaGPT
1 replies
2h39m

I still prefer ChatGPT-4o and use Claude if I have issues but never does any better

jasonjmcghee
0 replies
1h21m

This is super interesting to me.

Claude Sonnet 3.5 outperforms GPT-4o by a significant margin on every one of my use cases.

What do you use it for?

moralestapia
0 replies
2h45m

Nice, they finally got the memo that GPT4 exists and include it in their benchmarks.

huevosabio
0 replies
2h3m

The non-commercial license is underwhelming.

It seems to be competitive with Llama 3.1 405b but with a much more restrictive license.

Given how the difference between these models is shrinking, I think you're better off using llama 405B to finetune the 70B on the specific use case.

This would be different if it was a major leap in quality, but it doesn't seem to be.

Very glad that there's a lot of competition at the top, though!

gavinray
0 replies
2h43m

"It's not the size that matters, but how you use it."

erichocean
0 replies
4m

I like Claude 3.5 Sonnet, but despite paying for a plan, I run out of tokens after about 10 minutes. Text only, I'm typing everything in myself.

It's almost useless because I literally can't use it.

epups
0 replies
2h42m

The graphs seem to indicate their model trades blows with Llama 3.1 405B, which has more than 3x the number of tokens and (presumably) a much bigger compute budget. It's kind of baffling if this is confirmed.

Apparently Llama 3.1 relied on artificial data, would be very curious about the type of data that Mistral uses.

daghamm
0 replies
15m

I don't care about Russian, Korean or Java and C#. Where can I find a language model that speaks English and python and is small enough to selfhost??

Or maybe I should ask this instead: can we create really small models for specific domains, whether from scratch or out of larger models?

bugglebeetle
0 replies
2h50m

I love how much AI is bringing competition (and thus innovation) back to tech. Feels like things were stagnant for 5-6 years prior because of the FAANG stranglehold on the industry. Love also that some of this disruption is coming at out of France (HuggingFace and Mistral), which Americans love to typecast as incapable of this.

ashenke
0 replies
2h27m

I tested it with my claude prompt history, the results are as good as Claude 3.5 Sonnet, but it's 2 or 3 times slower

ThinkBeat
0 replies
1h32m

A side note about the ever increasing costs to advance the models. I feel certain that some branch of what may be connected to the NSA is running and advancing models that probably exceed what the open market provides today.

Maybe they are running it on proprietary or semi proprietary hardware but if they dont, how much does the market no where various shipments of NVIDEA processors ends up?

I imagine most intelligence agencies are in need of vast quantities.

I presume is M$ announces new availability of AI compute it means they have received and put into production X Nvidiam, which might make it possible to guesstimate within some bounds how many.

Same with other open market compute facilities.

Is it likely that a significant share of NVIDEA processors are going to government / intelligent / fronts?

Tepix
0 replies
2h18m

Just in case you haven't RTFA. Mistral 2 is 123b.

Always42
0 replies
2h51m

I'm really glad these guys exist