These companies full of brilliant engineers are throwing millions of dollars in training costs to produce SOTA models that are... "on par with GPT-4o and Claude Opus"? And then the next 2.23% bump will cost another XX million? It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs; that an ARC prize level breakthrough is needed to move the needle any farther at this point.
Links to chat with models that released this week:
Large 2 - https://chat.mistral.ai/chat
Llama 3.1 405b - https://www.llama2.ai/
I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts from my Claude history.
I'd rank as:
1. Sonnet 3.5
2. Large 2 and Llama 405b (similar, no clear winner between the two)
If you're using Claude, stick with it.
My Claude wishlist:
1. Smarter (yes, it's the most intelligent, and yes, I wish it was far smarter still)
2. Longer context window (1M+)
3. Native audio input including tone understanding
4. Fewer refusals and less moralizing when refusing
5. Faster
6. More tokens in output
All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet.
When using a prompt that involves thinking first, all three get it correct.
"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."
Llama 405b: correct
Mistral Large 2: correct
Claude 3.5 Sonnet: correct
It’s not impressive that one has to go to that length though.
You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.
The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.
*) that is, except sometimes by making adjustments to the system prompt
I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.
I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.
To be fair, I just asked a real person and had to go to even greater lengths:
Me: How many "r"s are in strawberry?
Them: What?
Me: How many times does the letter "r" appear in the word "strawberry"?
Them: Is this some kind of trick question?
Me: No. Just literally, can you count the "r"s?
Them: Uh, one, two, three. Is that right?
Me: Yeah.
Them: Why are you asking me this?
Try asking a young child...
this can be automated.
GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.
To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.
The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.
Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.
There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.
In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"
Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.
This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"
At that point it was easier to do it myself.
Exact instruction challenge https://www.youtube.com/watch?v=cDA3_5982h8
"What programming computers is really like."
EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.
Chain-of-Thought (CoT) prompting to the rescue!
We should always put some effort into prompt engineering before dismissing the potential of generative AI.
By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.
Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.
Maybe the various chat interfaces already do this behind the scenes?
Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.
for strawberry, it see it as [496, 675, 15717], which is str aw berry.
If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?
There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".
If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?
The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is the same.
Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.
Psychologists use this trick as well to figure out internal representations, for example the rorschach test.
And probably, if you asked random people in the street how many P's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say spanish, also the country name uses an F.
Until I was ~12, I thought 'a lot' was a single word.
This is only an issue if you send commands to a LLM as you were communicating to a human.
This is only an issue if you send commands to a LLM as you were communicating to a human.
Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.
Count the number of occurrences of the letter e in the word "enterprise".
Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.
It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.
I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?
I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.
A useless benchmark is useless.
In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.
It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.
A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.
[1] https://help.openai.com/en/articles/4936856-what-are-tokens-...
Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.
LLMs think in tokens, not letters. It's like asking someone who is dyslexic about spelling. Not their strong suit. In practice, it doesn't matter much, does it?
Sometimes it does, sometimes it doesn't.
It is evidence that LLMs aren't appropriate for everything, and that there could exist something that works better for some tasks.
Language models are best treated like consciousness. Our consciousness does a lot less than people like to attribute to it. It is mostly a function of introspection and making connections, rather than being the part of the brain where higher level reasoning and the functions of the brain that tell your body how to stay alive (like beating your heart).
By allowing a language model to do function calling, you are essentially allowing it to do specialized "subconscious" thought. The language model becomes a natural language interface to the capabilities of its "subconsciousness".
A specific human analogy could be: I tell you to pick up a pen off of the table, and then you do it. Most of your mental activity would be subconscious, orienting your arm and hand properly to pick up the pen, actually grabbing the pen, and picking it up. The linguistic representation of the action would exist in your concious mind (pick up the pen), but not much else.
A language model could very easily call out to a text processing function to correctly do things like count the number of r's in the word strawberry. That is a job that your concious mind can dispatch to your subconciousness.
This doesn't make any sense. The LLM knows the token to letter associations if you ask it. Once it has split the token into letters, it can answer the question anyway.
This tells us that the model is unaware of its own limitations, otherwise it would have figured that this tasks needs it to split the letters first.
Well splitting into letters doesn't help with Mistral Large 2. I tried this with Mistral 2:
[Question: How many r's in strawberry? First split the word into letters, one letter on each line then answer the question.]
[Answer: Let's split the word "strawberry" into letters:
s t r a w b e r r y
Now, let's count the number of 'r's:
There are 2 'r's in the word "strawberry".]
[Question: Did you count the number of r's?]
[answer: Yes, I did. Here's the count:
r r
So, there are 2 'r's in the word "strawberry".]
I suspect the issue is these models have to be trained to accomplish tasks, and this is not a task it makes sense to train a model to perform. You might train it to do this with function calling one day, calling a python library to count the r's in a string, but actually manually counting r's just doesn't play to an LLM's strengths.
Testing models on their tokenization has always struck me as kinda odd. Like, that has nothing to do with their intelligence.
It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).
Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.
Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.
I would counterargue with "that's the model's problem, not mine".
Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.
A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".
4o will get the answer right on the first go if you ask it "Search the Internet to determine how many R's are in strawberry?" which I find fascinating
I didn't even need to do that. 4o got it right straight away with just:
"how many r's are in strawberry?"
The funny thing is, I replied, "Are you sure?" and got back, "I apologize for the mistake. There are actually two 'r's in the word strawberry."
I just tried llama 3.1 8 b this is its reply.
According to multiple sources, including linguistic analysis and word breakdowns, there are 3 Rs in the word "strawberry".
sonate 3.5 thinks 2
Lots of replies mention tokens as the root cause and I’m not well versed in this stuff at the low level but to me the answer is simple:
When this question is asked (from what the models trained on) the question is NOT “count the number of times r appears in the word strawberry” but instead (effectively) “I’ve written ‘strawbe’, now how many r’s are in strawberry again? Is it 1 or 2?”.
I think most humans would probably answer “there are 2” if we saw someone was writing and they asked that question, even without seeing what they have written down. Especially if someone said “does strawberry have 1 or 2 r’s in it?”. You could be a jerk and say “it actually has 3” or answer the question they are actually asking.
It’s an answer that is _technically_ incorrect but the answer people want in reality.
Due to the fact that LLMs work on tokens and not characters, these sort of questions will always be hard for them.
Claude 3 Opus gave correct answer.
I wrote and published a paper at COLING 2022 on why LLMs in general won't solve this without either 1. radically increasing vocab size, 2. rethinking how tokenizers are done, or 3. forcing it with constraints:
Longer context window (1M+)
What's your use case for this? Uploading multiple documents/books?
Correct
That would make each API call cost at least $3 ($3 is price per million input tokens). And if you have a 10 message interaction you are looking at $30+ for the interaction. Is that what you would expect?
This might be when it's better to not use the API and just pay for the flat-rate subscription.
Maybe they're summarizing/processing the documents in a specific format instead of chatting? If they needed chat, might be easier to build using RAG?
Gemini 1.5 Pro charges $0.35/million tokens up to the first million tokens or $0.70/million tokens for prompts longer than one million tokens, and it supports a multi-million token context window.
Substantially cheaper than $3/million, but I guess Anthropic’s prices are higher.
Uploading large codebases is particularly useful.
Books, especially textbooks, would be amazing. These things can get pretty huge (1000+ pages) and usually do not fit into GPT-4o or Claude Sonnet 3.5 in my experience. I envision the models being able to help a user (student) create their study guides and quizzes, based on ingesting the entire book. Given the ability to ingest an entire book, I imagine a model could plan how and when to introduce each concept in the textbook better than a model only a part of the textbook.
Large 2 is significantly smaller at 123B so it being comparable to llama 3 405B would be crazy.
This race for the top model is getting wild. Everyone is claiming to one-up each with every version.
My experience (benchmarks aside) Claude 3.5 Sonnet absolutely blows everything away.
I'm not really sure how to even test/use Mistral or Llama for everyday use though.
I stopped my ChatGPT subscription and subscribed instead to Claude, it's simply much better. But, it's hard to tell how much better day to day beyond my main use cases of coding. It is more that I felt ChatGPT felt degraded than Claude were much better. The hedonic treadmill runs deep.
GPT-4 was probably as good as Claude Sonnet 3.5 at its outset, but OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.
Indeed, it used to output all the code I needed but now it only outputs a draft of the code with prompts telling me to fill in the rest. If I wanted to fill in the rest, I wouldn't have asked you now, would've I?
It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.
That's my perception, anyway.
This is also my perception using it daily for the last year or so. Sometimes it also responds with exactly what I provided it with and does not make any changes. It's also bad at following instructions.
GPT-4 was great until it became "lazy" and filled the code with lots of `// Draw the rest of the fucking owl` type comments. Then GPT-4o was released and it's addicted to "Here's what I'm going to do: 1. ... 2. ... 3. ..." and lots of frivolous, boilerplate output.
I wish I could go back to some version of GPT-4 that worked well but with a bigger context window. That was like the golden era...
This is also my experience. Previously it got good at giving me only relevant code which, as an experienced coder, is what i want. my favorites were the one line responses.
Now it often falls back to generating full examples, explanations, restating the question and its approach. I suspect this is by design as (presumably) less experienced folks want or need all that. For me, i wish i could consistently turn it into one of those way too terse devs that replies with the bare minimum example, and expects you to infer the rest. Usually that is all i want or need, and i can ask for elaboration when not the case. I havent found the best prompts to retrigger this persona from it yet.
I wouldn't have asked you now, would've I?
That's what I said to it - "If I wanted to fill in the missing parts myself, why would I have upgraded to paid membership?"
Have you (or anyone) swapped on Cursor with Anthropic API Key?
For coding assistant, it's on my to do list to try. Cursor needs some serious work on model selection clarity though so I keep putting off.
I did it (fairly simple really) but found most of my (unsophisticated) coding these days to go through Aider [1] paired with Sonnet, for UX reasons mostly. It is easier to just prompt over the entire codebase, vs Cursor way of working with text selections.
I believe Cursor allows for prompting over the entire codebase too: https://docs.cursor.com/chat/codebase
That is chatting, but it will not change the code.
One big advantage Claude artifacts have is that they maintain conversation context, versus when I am working with Cursor I have to basically repeat a bunch of information for each prompt, there is no continuity between requests for code edits.
If Cursor fixed that, the user experience would become a lot better.
Sonnet 3.5 to me still seems far ahead. Maybe not on the benchmarks, but in everyday life I am finding it renders the other models useless. Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.
Such a relief/contrast to the period between 2010 and 2020, when the top five Google, Apple, Facebook, Amazon, and Microsoft monopolized their own regions and refused to compete with any other player in new fields.
Google : Search
Facebook : social
Apple : phones
Amazon : shopping
Microsoft : enterprise ..
Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.
Google refused to compete with Apple in phones?
Microsoft also competes in search, phones
Microsoft, Amazon and Google compete in cloud too
Given we don’t know precisely what’s happening in the black box we can say that spec tech doesn’t give you the full picture of the experience … Apple style
I’ve stopped using anything else as a coding assistant. It’s head and shoulders above GPT-4o on reasoning about code and correcting itself.
I'm not really sure how to even test/use Mistral or Llama for everyday use though.
Both Mistral and Meta offer their own hosted versions of their models to try out.
You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.
Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.
meta.ai is inaccessible in a large portion of world territories, but the Llama 3.1 70B and 405B are also available in https://hf.co/chat
Additionally, all Llama 3.1 models are available in https://api.together.ai/playground/chat/meta-llama/Meta-Llam... and in https://fireworks.ai/models/fireworks/llama-v3p1-405b-instru... by logging in.
Groq’s models are also heavily quantised so you won’t get the full experience there.
Claude is pretty great, but it's lacking the speech recognition and TTS, isn't it?
Correct. IMO the official Claude app is pretty garbage. Sonnet 3.5 API + Open-WebUI is amazing though and supports STT+TTS as well as a ton of other great features.
But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes
3.5 sonnet is the quality of the OG GPT-4, but mind blowingly fast. I need to cancel my chatgpt sub.
mind blowingly fast
I would imagine this might change once enough users migrate to it.
Eventually it comes down to who has deployed more silicon: AWS or Azure.
I don't get it. My husband also swears by Clause Sonnet 3.5, but every time I use it, the output is considerably worse than GPT-4o
I don't see how that's possible. I decided to give GPT-4o a second chance after reaching my daily use on Sonnet 3.5, after 10 prompts GTP-4o failed to give me what Claude did in a single prompt (game-related programming). And with fragments and projects on top of that, the UX is miles ahead of anything OpenAI offers right now.
It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o.
My hunch is this comes down to personal prompting style. It's likely that your own style works more effectively with GPT-4o, while other people have styles that are more effective with Claude 3.5 Sonnet.
Agree on Claude. I also feel like ChatGPT has gotten noticeably worse over the last few months.
To help keep track of the race, I put together a simple dashboard to visualize model/provider leaders in capability, throughput, and cost. Hope someone finds it useful!
Google Sheet: https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...
I'm not really sure how to even test/use Mistral or Llama for everyday use though.
I've been recording installation and usage instructions on how I've been using these Open Source AI models on my machine (a Mac). If that sounds interesting to you, sign up and I'll put together a free webinar.
I'm building a ai coding assistant (https://double.bot) so I've tried pretty much all the frontier models. I added it this morning to play around with it and it's probably the worst model I've ever played with. Less coherent than 8B models. Worst case of benchmark hacking I've ever seen.
Are you sure the chat history is being passed when the second message is sent? That looks like the kind of response you'd expect if it only received the prompt "in python" with no chat history at all.
Yes, I built the extension. I actually also just went to send another message asking what the first msg was just to double check I didn't have a bug and it does know what the first msg was.
Thanks, that's some really bad accuracy/performance
to be fair that's quite a weird request (the initial one) – I feel a human would struggle to understand what you mean
definitely not an articulate request, but the point of using these tools is to speed me up. The less the user has to articulate and the more it can infer correctly, the more helpful it is. Other frontier models don't have this problem.
Llama 405B response would be exactly what I expect
What was the expected outcome for you? AFAIK, Python doesn't have a const dictionary. Were you wanting it to refactor into a dataclass?
Yes, there's a few things wrong: 1. If it assumes typescript, it should do `as const` in the first msg 2. If it is python, it should be something like https://x.com/WesleyYue/status/1816157147413278811 which is what I wanted but I didn't want to bother with the typing.
"Mistral Large 2 is equipped with enhanced function calling and retrieval skills and has undergone training to proficiently execute both parallel and sequential function calls, enabling it to serve as the power engine of complex business applications."
Why does the chart below say the "Function Calling" accuracy is about 50%? Does that mean it fails half the time with complex operations?
Relatedly, what does "parallel" function calling mean in this context?
That's when the LLM can respond with multiple functions it wants you to call at once. You might send it:
Location and population of Paris, France
A parallel function calling LLM could return: {
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "get_city_coordinates",
"arguments": "{\"city\": \"Paris\"}"
}
}, {
"function": {
"name": "get_city_population",
"arguments": "{\"city\": \"Paris\"}"
}
}
]
}
Indicating that you should execute both of those functions and return the results to the LLM as part of the next prompt.Ah, thank you!
Mistral forgot to say which benchmark they were using for that chart, without that information it's impossible to determine what it actually means.
When I see this "© 2024 [Company Name], All rights reserved", it's a tell that the company does not understand how hopelessly behind they are about to be.
Could you elaborate on this? Would love to understand what leads you to this conclusion.
E = T/A! [0]
A faster evolving approach to AI is coming out this year that will smoke anyone who still uses the term "license" in regards to ideas [1].
[0] https://breckyunits.com/eta.html [1] https://breckyunits.com/freedom.html
So it's made up?
I do what I say and I say what I do.
https://github.com/breck7/breckyunits.com/blob/afe70ad66cfbb...
Personally, language diversity should be the last thing on the list. If we had optimized every software from the get-go for a dozen languages our forward progress would have been dead in the water.
You'd think so, but 3.5-turbo was multilingual from the get go and benefitted massively from it. If you want to position yourself as a global leader, then excluding 95% of the world who aren't English native speakers seems like a bad idea.
Yeah clearly, OpenAI is rocketing forward and beyond.
Constant infighting and most of the competent people leaving will do that to a company.
I mean more on a model performance level though. It's been shown that something trained in one language trains the model to be able to output it in any other language it knows. There's quality human data being left on the table otherwise. Besides, translation is one of the few tasks that language models are by far the best at if trained properly, so why not do something you can sell as a main feature?
Language diversity means access to more training data, and you might also hope that by learning the same concept in multiple languages it does a better job of learning the underlying concept independent of the phrase structure...
At least from a distance it seems like training a multilingual state of the art model might well be easier than a monolingual one.
The question I (and I suspect most other HN readers) have is which model is best for coding? While I appreciate the advances in open weights models and all the competition from other companies, when it comes to my professional use I just want the best. Is that still GPT-4?
My personal experience says Claude 3.5 Sonnet.
The benchmarks agree as well.
I kinda trust https://aider.chat/docs/leaderboards/
What doe they mean by "single-node inference"?
Do they mean inference done on a single machine?
Yes, albeit a really expensive one. Large models like GPT-4 are rumored to run inference on multiple machines because they don't fit in VRAM for even the most expensive GPUs.
(I wouldn't be surprised if GPT-4o mini is small enough to fit on a single large instance though, would explain how they could drop the price so much.)
Yeah that’s how I read it. Probably means 8 x 80 GB GPUs.
Anyone know what caused the very big performance jump from Large1 to Large2 in just a few months?
Besides, parameter redundancy seems evidenced. Front-tier models used to be 1.8T, then 405B, and now 123B. Would front-tier models in the future be <10B or even <1B, that would be a game changer.
Lots and lots of synthetic data from the bigger models training the smaller ones would be my guess.
(For things like code where you can compile the synthetic results and test if the prompt matches the generated code synthetic data after filtering basically amounts to lots of professionally written perfect ground truth data.)
Counter-intuitively, larger models are cheaper to train. However, smaller models are cheaper to serve. At first, everyone was focusing on training, so the models were much larger. Now, so many people are using AI everyday, so companies spend more on training smaller models to save on serving.
A significant effort was also devoted to enhancing the model’s reasoning capabilities. One of the key focus areas during training was to minimize the model’s tendency to “hallucinate” or generate plausible-sounding but factually incorrect or irrelevant information. This was achieved by fine-tuning the model to be more cautious and discerning in its responses, ensuring that it provides reliable and accurate outputs.
Is there a benchmark or something similar that compares this "quality" across different models?
Unfortunately not, as it captures such a wide spectrum of use cases and scenarios. There are some benchmarks to measure this quality in specific settings, e.g. summarization, but AFAIK nothing general.
Thanks, any ideas why it's not possible to build a generic eval for this? Since it's about asking a set of questions that's not public knowledge (or making stuff up) and check if the model says "I don't know"?
The models are converging slowly. In the end, it will come down to the user experience and the "personality." I have been enjoying the new Claude Sonnet. It feels sharper than the others, even though it is not the highest-scoring one.
One thing that `exponentialists` forget is that each step also requires exponentially more energy and resources.
I have been paying for OpenAI since they started accepting payment, but to echo your comment, Claude is so good I am primarily relying on it now for LLM driven work and cancelled my OpenAI subscription. Genuine kudos to Mistral, they are a worthy competitor in the space against Goliaths. They make someone mediocre at writing code less so, so I can focus on higher value work.
And a factor for Mistral typically is it will give you less refusals and can be uncensored. So if I have to guess any task that requires creative output could be better suited for this.
All evals we have are just far too easy! <1% difference is just noise/bad data
We need to figure out how to measure intelligence that is greater than human.
Give it problems most/all humans can't solve on their own, but that are easy to verify.
Math problems being one of them, if only LLMs were good at pure math. Another possibility is graph problems. Haven't tested this much though.
How does their API pricing compare to 4o and 3.5 Sonnet?
3 USD per 1M input tokens, so the same as 3.5 Sonnet but worse quality
I still prefer ChatGPT-4o and use Claude if I have issues but never does any better
This is super interesting to me.
Claude Sonnet 3.5 outperforms GPT-4o by a significant margin on every one of my use cases.
What do you use it for?
Nice, they finally got the memo that GPT4 exists and include it in their benchmarks.
The name just makes me think of the screaming cowboy song. https://youtu.be/rvrZJ5C_Nwg?t=138
The non-commercial license is underwhelming.
It seems to be competitive with Llama 3.1 405b but with a much more restrictive license.
Given how the difference between these models is shrinking, I think you're better off using llama 405B to finetune the 70B on the specific use case.
This would be different if it was a major leap in quality, but it doesn't seem to be.
Very glad that there's a lot of competition at the top, though!
"It's not the size that matters, but how you use it."
I like Claude 3.5 Sonnet, but despite paying for a plan, I run out of tokens after about 10 minutes. Text only, I'm typing everything in myself.
It's almost useless because I literally can't use it.
The graphs seem to indicate their model trades blows with Llama 3.1 405B, which has more than 3x the number of tokens and (presumably) a much bigger compute budget. It's kind of baffling if this is confirmed.
Apparently Llama 3.1 relied on artificial data, would be very curious about the type of data that Mistral uses.
I don't care about Russian, Korean or Java and C#. Where can I find a language model that speaks English and python and is small enough to selfhost??
Or maybe I should ask this instead: can we create really small models for specific domains, whether from scratch or out of larger models?
I love how much AI is bringing competition (and thus innovation) back to tech. Feels like things were stagnant for 5-6 years prior because of the FAANG stranglehold on the industry. Love also that some of this disruption is coming at out of France (HuggingFace and Mistral), which Americans love to typecast as incapable of this.
important to note that this time around weights are available https://huggingface.co/mistralai/Mistral-Large-Instruct-2407
I tested it with my claude prompt history, the results are as good as Claude 3.5 Sonnet, but it's 2 or 3 times slower
A side note about the ever increasing costs to advance the models. I feel certain that some branch of what may be connected to the NSA is running and advancing models that probably exceed what the open market provides today.
Maybe they are running it on proprietary or semi proprietary hardware but if they dont, how much does the market no where various shipments of NVIDEA processors ends up?
I imagine most intelligence agencies are in need of vast quantities.
I presume is M$ announces new availability of AI compute it means they have received and put into production X Nvidiam, which might make it possible to guesstimate within some bounds how many.
Same with other open market compute facilities.
Is it likely that a significant share of NVIDEA processors are going to government / intelligent / fronts?
Just in case you haven't RTFA. Mistral 2 is 123b.
I'm really glad these guys exist
I suspect this is why OpenAI is going more in the direction of optimising for price / latency / whatever with 4o-mini and whatnot. Presumably they found out long before the rest of us did that models can't really get all that much better than what we're approaching now, and once you're there the only thing you can compete on is how many parameters it takes and how cheaply you can serve that to users.
Meta just claimed the opposite in their Llama 3.1 paper. Look at the conclusion. They say that their experience indicates significant gains for the next iteration of models.
The current crop of benchmarks might not reflect these gains, by the way.
I sell widgets. I promise the incalculable power of widgets has yet to be unleashed on the world, but it is tremendous and awesome and we should all be very afraid of widgets taking over the world because I can't see how they won't.
Anyway here's the sales page. the widget subscription is so premium you won't even miss the subscription fee.
That is strong (and fun) point, but this is peer reviewable and has more open collaboration elements than purely selling widgets.
We should still be skeptical because often want to claim to be better or have unearned answers, but I don't think the motive to lie is quite as strong as a salesman's.
It's not peer-reviewable in any shape or form.
It is kind of "peer-reviewable" in the "Elon Musk vs Yann LeCun" form, but I doubt that the original commenter meant this.
This. It's really weird the way we suddenly live in a world where it's the norm to take whatever a tech company says about future products at face value. This is the same world where Tesla promised "zero intervention LA to NYC self driving" by the end of the year in 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, and 2024. The same world where we know for a fact that multiple GenAI demos by multiple companies were just completely faked.
It's weird. In the late 2010s it seems like people were wising up to the idea that you can't implicitly trust big tech companies, even if they have nap pods in the office and have their first day employees wear funny hats. Then ChatGPT lands and everyone is back to fully trusting these companies when they say they are mere months from turning the world upside down with their AI, which they say every month for the last 12-24 months.
In the 2000s we only had Microsoft, and none of us were confused as to whether to trust Bill Gates or not...
I'm not sure anyone is asking you to take it at face value or implicitly trust them? There's a 92-page paper with details: https://ai.meta.com/research/publications/the-llama-3-herd-o...
Except: Meta doesn't sell AI at all. Zuck is just doing this for two reasons:
- flex
- deal a blow to Altmann
Meta uses ai in all the recommendation algorithms. They absolutely hope to turn their chat assistants into a product on WhatsApp too, and GenAI is crucial to creating the metaverse. This isn’t just a charity case.
that would make sense if it was from Openai, but Meta doesn't actually sell these widgets? They release the widget machines for free in the hopes that other people will build a widget ecosystem around them to rival the closed widget ecosystem that threatens to lock them out of a potential "next platform" powered by widgets.
Wouldn't the equivalent for Meta actually be something like:
Given that Meta isn't actually selling their models?
Your response might make sense if it were to something OpenAI or Anthropic said, but as is I can't say I follow the analogy.
Meta doesn't sell widgets in this scenario - they give them away for free. Their competition sells widgets, so Meta would be perfectly happy if the widget market totally collapsed.
If OpenAI was saying this you'd have a point but I wouldn't call Facebook a widget seller in this case when they're giving their widgets away for free.
LLMs are reaching saturation on even some of the latest benchmarks and yet I am still a little disappointed by how they perform in practice.
They are by no means bad, but I am now mostly interested in long context competency. We need benchmarks that force the LLM to complete multiple tasks simultaneously in one super long session.
I don't know anything about AI but there's one thing I want it to do for me. Program a full body exercise program long term based on the parameters I give it such as available equipment and past workout context goals. I haven't had good success with chatgpt but I assume what you're talking about is relevant to my goals.
Aren't there apps that already do this like Fitbod?
Fitbod might do the trick. Thanks! The availability of equipment was a difficult thing for me to incorporate into a fitness program.
They also said in the paper that 405B was only trained to "compute-optimal" unlike the smaller models that were trained well past that point indicating the larger model still had some runway so had they continued it would have kept getting stronger.
Makes sense right? Otherwise why make a model so large that nobody can conceivably run it if not to optimize for performance on a limited dataset/compute? It was always a distillation source model, not a production one.
Or maybe they just want to avoid getting sued by shareholders for dumping so much money into unproven technology that ended up being the same or worse than the competitor
Yeah, but what does that actually mean? That if they had simply doubled the parameters on Llama 405b it would score way better on benchmarks and become the new state-of-the-art by a long mile?
I mean, going by their own model evals on various benchmarks (https://llama.meta.com/), Llama 405b scores anywhere from a few points to almost 10 points more than than Llama 70b even though the former has ~5.5x more params. As far as scale in concerned, the relationship isn't even linear.
Which in most cases makes sense, you obviously can't get a 200% on these benchmarks, so if the smaller model is already at ~95% or whatever then there isn't much room for improvement. There is, however, the GPQA benchmark. Whereas Llama 70b scores ~47%, Llama 405b only scores ~51%. That's not a huge improvement despite the significant difference in size.
Most likely, we're going to see improvements in small model performance by way of better data. Otherwise though, I fail to see how we're supposed to get significantly better model performance by way of scale when the relationship between model size and benchmark scores is nowhere near linear. I really wish someone who's team "scale is all you need" could help me see what I'm missing.
And of course we might find some breakthrough that enables actual reasoning in models or whatever, but I find that purely speculative at this point, anything but inevitable.
The problem with this strategy is that it's really tough to compete with open models in this space over the long run.
If you look at OpenAI's homepage right now they're trying to promote "ChatGPT on your desktop", so it's clear even they realize that most people are looking for a local product. But once again this is a problem for them because open models run locally are always going to offer more in terms of privacy and features.
In order for proprietary models served through an API to compete long term they need to offer significant performance improvements over open/local offerings, but that gap has been perpetually shrinking.
On an M3 macbook pro you can run open models easily for free that perform close enough to OpenAI that I can use them as my primary LLM for effectively free with complete privacy and lots of room for improvement if I want to dive into the details. Ollama today is pretty much easier to install than just logging into ChatGPT and the performance feels a bit more responsive for most tasks. If I'm doing a serious LLM project I most certainly won't use proprietary models because the control I have over the model is too limited.
At this point I have completely stopped using proprietary LLMs despite working with LLMs everyday. Honestly can't understand any serious software engineer who wouldn't use open models (again the control and tooling provided is just so much better), and for less technical users it's getting easier and easier to just run open models locally.
I think their desktop app still runs the actual LLM queries remotely.
This. It's a mac port of the iOS app. Using the API.
In the long run maybe but it's going to take probably 5 years or more before laptops such as Macbook M3 with 64 GB RAM will be mainstream. Also it's going going to take a while before such models with 70B params will be bundled in Windows and Mac with system update. Even more time before you will have such models inside your smartphone.
OpenAI did a good move with making GPTo mini so dirty cheap that it's faster and cheaper to run than LLama 3.1 70B. Most consumers will interact with LLM via some apps using LLM API, Web Panel on desktop or native mobile app for the same reason most people use GMail etc. instead of native email client. Setting up IMAP, POP etc is for most people out of reach the same like installing Ollama + Docker + OpenWebUI
App developers are not gonna bet on local LLM only as long they are not mainstream and preinstalled on 50%+ devices.
Totally. I wrote about this when they announced their dev-day stuff.
In my opinion, they've found that intelligence with current architecture is actually an S-curve and not an exponential, so trying to make progress in other directions: UX and EQ.
https://nicholascharriere.com/blog/thoughts-openai-spring-re...
indeed. I pointed out in https://buttondown.email/ainews/archive/ainews-llama-31-the-... that the frontier model curve is currently going down 1 OoM every 4 months, meaning every model release has a very short half life[0]. however this progress is still worth it if we can deploy it to improve millions and eventually billions of people's lives. a commenter pointed out that the amoutn spent on Llama 3.1 was only like 60% of the cost of Ant Man and the Wasp Quantumania, in which case I'd advocate for killing all Marvel slop and dumping all that budget on LLM progress.
[0] not technically complete depreciation, since for example 4o mini is widely believed to be a distillation of 4o, so 4o's investment still carries over into 4o mini
Agreed on everything, but calling the marvel movies slop…I think that word has gone too far.
Not all Marvel films are slop. But, as a fan who comes from a family of fans and someone who has watched almost all of them; lets be real. That particular film, really and most of them, contain copious amounts of what is absolutely slop.
I don't know if the utility is worse than an LLM that is SOTA for 2 months that no one even bothers switching to however - at least the marvel slop is being used for entertainment by someone. I think the market is definitely prioritizing the LLM researcher over Disney's latest slop sequel though so whoever made that comparison can rest easy, because we'll find out.
I thought that was the allure, something that's camp funny and an easy watch.
I have only watched a few of them so I am not fully familiar?
The marvel movies are the genesis for this use of the word slop.
Can you back that claim up with a link or similar?
It’s junk food. No one is disputing how tasty it is though (including the recent garbage).
Not only are Marvel movies slop, they are very concentrated slop. The only way to increase the concentration of slop in a Marvel movie would be to ask ChatGPT to write the next one.
Has there been any indication that we're improving the lives of millions of people?
Yes, just like internet, power users have found use cases. It'll take education / habit for general users
Ah yes. We're in the crypto stages of "it's like the internet".
Just me coding 30% faster is worth it
I haven't found a single coding problem where any of these coding assistants where anything but annoying.
If I need to babysit a junior developer fresh out of school and review every single line if code it spits out, I can find them elsewhere
All that Marvel slop was created by the first real LLM: <https://marvelcinematicuniverse.fandom.com/wiki/K.E.V.I.N.>
The thing I don't understand is why everyone is throwing money at LLMs for language, when there are much simpler use cases which are more useful?
For example, has anyone ever attempted image -> html/css model? Seems like it be great if I can draw something on a piece of paper and have it generate a website view for me.
That's a thought I had. For example, could a model be trained to take a description, and create a Blender (or whatever other software) model from it? I have no idea how LLMs really work under the hood, so please tell me if this is nonsense.
I'm waiting exactly for this, gpt4 trips up a lot with blender currently (nonsensical order of operations etc.)
Have you tried upload the image to a LLM with vision capabilities like GPT-4o or Claude 3.5 Sonnet?
I tried and sonnet 3.5 can copy most of common UIs
Perhaps if we think of LLMs as search engines (Google, Bing etc) then there's more money to be made by being the top generic search engine than the top specialized one (code search, papers search etc)
I was under the impression that you could more or less do something like that with the existing LLMs?
(May work poorly of course, and the sample I think I saw a year ago may well be cherry picked)
They did that in the chatgpt 4 demo 1.5 year ago. https://www.youtube.com/watch?v=GylMu1wF9hw
There are already companies selling services where they generate entire frontend applications from vague natural language inputs.
https://vercel.com/blog/announcing-v0-generative-ui
All of the multi-modal LLMs are reasonably good at this.
I had a discussion with a friend about doing this, but for CNC code. The answer was that a model trained on a narrow data set underperforms one trained on a large data set and then fine tuned with the narrow one.
Yes. This is exactly why I'm skeptical of AI doomerism/saviorism.
Too many people have been looking at the pace of LLM development over the last two (2) years, modeled it as an exponential growth function, and come to the conclusion that AGI is inevitable in the next ${1-5} years and we're headed for ${(dys|u)topia}.
But all that assumes that we can extrapolate a pattern of long-term exponential growth from less than two years of data. It's simply not possible to project in that way, and we're already seeing that OpenAI has pivoted from improving on GPT-4's benchmarks to reducing cost, while competitors (including free ones) catch up.
All the evidence suggests we've been slowing the rate of growth in capabilities of SOTA LLMs for at least the past year, which means predictions based on exponential growth all need to be reevaluated.
Indeed.All exponential growth curves are sigmoids in disguise.
except when it isn't and we ded :P
I don't think Special Relativity would allow that.
This is something that is definitionally true in a finite universe, but doesn't carry a lot of useful predictive value in practice unless you can identify when the flattening will occur.
If you have a machine that converts mass into energy and then uses that energy to increase the rate at which it operates, you could rightfully say that it will level off well before consuming all of the mass in the universe. You just can't say that next week after it has consumed all of the mass of Earth.
I don't think we are approaching limits, if you take off the English-centric glasses. You can query LLMs about pretty basic questions about Polish language or literature and it's gonna either bullshit or say it doesn't know the answer.
Example:
The correct answer is that "ekspres" is a zipper in Łódź dialect.That's just same same but different, not a step change towards significant cognitive ability.
What this means is just that Polish support (and probably most other languages besides English) in the models is behind SOTA. We can gradually get those languages closer to SOTA, but that doesn't bring us closer to AGI.
Tbf, you can ask it basic questions in English and it will also bullshit you.
I'm also wondering about the extent to which we are simply burning venture capital versus actually charging subscription prices that are sustainable long-term. Its easy to sell dollars for $0.75 but you can only do that for so long.
I think GPT5 will be the signal of whether or not we have hit a plateau. The space is still rapidly developing, and while large model gains are getting harder to pick apart, there have been enormous gains in the capabilities of light weight models.
I think GPT5 will tell if OpenAI hit a plateau.
Sam Altman has been quoted as claiming "GPT-3 had the intelligence of a toddler, GPT-4 was more similar to a smart high-schooler, and that the next generation will look to have PhD-level intelligence (in certain tasks)"
Notice the high degree of upselling based on vague claims of performance, and the fact that the jump from highschooler to PhD can very well be far less impressive than the jump from toddler to high schooler. In addition, notice the use of weasel words to frame expectations regarding "the next generation" to limit these gains to corner cases.
There's some degree of salesmanship in the way these models are presented, but even between the hyperboles you don't see claims of transformative changes.
PhD level-of-task-execution sounds like the LLM will debate whether the task is ethical instead of actually doing it
I wish I could frame this comment
lol! Producing academic papers for future training runs then.
buddy every few weeks one of these bozos is telling us their product is literally going to eclipse humanity and we should all start fearing the inevitable great collapse.
It's like how no one owns a car anymore because of ai driving and I don't have to tell you about the great bank disaster of 2019, when we all had to accept that fiat currency is over.
You've got to be a particular kind of unfortunate to believe it when sam altman says literally anything.
Basically every single word out of Mr Worldcoin's mouth is a scam of some sort.
I’m waiting for the same signal. There are essentially 2 vastly different states of the world depending on whether GPT-5 is an incremental change vs a step change compared to GPT-4.
Which is why they'll keep calling the next few models GPT4.X
and even if there is another breakthrough all of these companies will implement it more or less simultaneously and they will remain in a dead heat
Presuming the breakthrough is openly shared. It remains surprising how transparent many of these companies are about new approaches that push the SoTa forward, and I suspect we're going to see a change. That companies won't reveal the secret sauce so readily.
e.g. Almost the entire market relies upon Attention Is All You Need paper detailing transformers, and it would be an entirely different market if Google had held that as a trade secret.
Given how absolutely pitiful the proprietary advancements in AI have been, I would posit we have little to worry about.
OTOH the companies who are sharing their breakthroughs openly aren't yet making any money, so something has to give. Their research is currently being bankrolled by investors who assume there will be returns eventually, and eventually can only be kicked down the road for so long.
Well, that's because the potential reward from picking the right horse is MASSIVE and the cost of potentially missing out is lifelong regret. Investors are driven by FOMO more than anything else. They know most of these will be duds but one of these duds could turn out to be life changing. So they will keep bankrolling as long as they have the money.
Eventually can be (and has been) bankrolled by Nvidia. They did a lot of ground-floor research on GANs and training optimization, which only makes sense to release as public research. Similarly, Meta and Google are both well-incentivized to share their research through Pytorch and Tensorflow respectively.
I really am not expecting Apple or Microsoft to discover AGI and ferret it away for profitability purposes. Strictly speaking, I don't think superhuman intelligence even exists in the domain of text generation.
Sort of yes, sort of no.
Of course, I agree that Stability AI made Stable Diffusion freely available and they're worth orders of magnitude less than OpenAI. To the point they're struggling to keep the lights on.
But it doesn't necessarily make that much difference whether you openly share the inner technical details. When you've got a motivated and well financed competitor, merely demonstrating a given feature is possible, showing the output and performance and price, might be enough.
If OpenAI adds a feature, who's to say Google and Facebook can't match it even though they can't access the code?
I would guess that in that timeline, Google would never have been able to learn about the incredible capabilities of transformer models outside of translation, at least not until much later.
I think you're just seeing the "make it work" stage of the combo "first make it work, then make it fast".
Time to market is critical, as you can attest by the fact you framed the situation as "on par with GPT-4o and Claude Opus". You're seeing huge investments because being the first to get a working model stands to benefit greatly. You can only assess models that exist, and for that you need to train them at a huge computational cost.
ChatGPT is like Google now. It is the default. Even if Claude becomes as good as ChatGPT or even slightly better it won't make me switch. It has to be like a lot better. Way better.
It feels like ChatGPT won the time to market war already.
If ChatGPT fails to do a task you want, your instinct isn't "I'll run the prompt through Claude and see if it works" but "oh well, who needs LLMs?"
Please don't assume your experience applies to everyone. If ChatGPT can't do what I want, my first reaction is to ask Claude for the same thing. Often to find out that Claude performs much better. I've already cancelled ChaptGPT Plus for exactly that reason.
But plenty people switched to Claude, esp. with Sonnet 3.5. Many of them in this very thread.
You may be right with the average person on the street, but I wonder how many have lost interest in LLM usage and cancelled their GPT plus sub.
-1: I know many people who are switching to Claude. And Google makes it near-zero friction to adopt Gemini with Gsuite. And more still are using the top-N of them.
This is similar to the early days of the search engine wars, the browser wars, and other categories where a user can easily adopt, switch between and use multiple. It's not like the cellphone OS/hardware war, PC war and database war where (most) users can only adopt one platform at a time and/or there's a heavy platform investment.
Eh, with the degradation of coding performance in ChatGPT I made the switch. Seems much better to work with on problems, and I have to do way less hand holding to get good results.
I'll switch again soon as something better is out.
The next iteration depends on NVIDIA & co, what we need is sparse libs. Most of the weights in llms are 0, once we deal with those more efficiently we will get to the next iteration.
that's interesting. Do you have a rough percentage of this?
Does this mean these connections have no influence at all on output?
My uneducated guess is that with many layers you can implement something akin to graph in brain by nulling lots of previous later outputs. I actually suspect that current models aren’t optimal with layers all of the same size but i know shit
This is quite intuitive. We know that a biological neural net is a graph data structure. And ML systems on GPUs are more like layers of bitmaps in Photoshop (it's a graphics processor). So if most of the layers are akin to transparent pixels, in order to build a graph by stacking, that's hyper memory inefficient.
And with the increasing parameter size, the main winner will be Nvidia.
Frankly I just don't understand the economics of training a foundation model. I'd rather own an airline. At least I can get a few years out of the capital investment of a plane.
But billionaires already have that, they want a chance of getting their own god.
Benchmarks scores aren't good because they apply to previous generations of LLMs. That 2.23% uptick can actually represent a world of difference in subjective tests and definitely be worth the investment.
Progress is not slowing down but it gets harder to quantify.
I think it’s impressive that they’re doing it on a single (large) node. Costs matter. Efficiency improvements like this will probably increase capabilities eventually.
I’m also optimistic about building better (rather than bigger) datasets to train on.
This is already what the chinchilla paper surmised, it's no wonder that their prediction now comes to fruition. It is like an accelerated version of Moore's Law, because software development itself is more accelerated than hardware development.
What else can be done?
If you are sitting on 1 billions $ of GPU capex, what's $50 million in energy/training cost for another incremental run that may beat the leaderboard?
Over the last few years the market has placed its bets that this stuff will make gobs of money somehow. We're all not sure how. They're probably thinking -- it's likely that whoever has a few % is going to sweep and take most of this hypothetical value. What's another few million, especially if you already have the GPUs?
I think you're right -- we are towards the right end of the sigmoid. And with no "killer app" in sight. It is great for all of us that they have created all this value, because I don't think anyone will be able to capture it. They certainly haven't yet.
There is different directions AI have lots to improve: multi modal which branch into robotics, single modal like image, video, and sound generation and understanding. Also would check back when openAI releases 5
For some time, we have been at a plateau because everyone has caught up, which essentially means that everyone now has good training datasets and uses similar tweaks to the architecture. It seems that, besides new modalities, transformers might be a dead end as an architecture. Better scores on benchmarks result from better training data and fine-tuning. The so-called 'agents' and 'function calling' also boil down to training data and fine-tuning.
For this model, it seems like the point is that it uses way less parameters than at least the large Llama model while having near identical performance. Given how large these models are getting, this is an important thing to do before making performance better again.
We always needed a tock to see real advancement, like with the last model generation. The tick we had with the h100 was enough to bring these models to market but that's it.