return to table of content

Grok

extheat
49 replies
22h41m

At 8x86B, looks like the largest open model yet by far. Would be interesting to hear how many tokens it's been trained on. Especially important for higher param models in order to efficiently utilize all those parameters.

swalsh
40 replies
21h36m

Considering how poor it is compared to other models, it really emphasises how important fine tuning is. Models with MUCH smaller parameter counts are outperforming it in many metrics.

lukan
35 replies
21h32m

"it really emphasises how important fine tuning is"

Or rather the quality of the training data?

llm_trw
19 replies
20h20m

We don't know since no one is releasing their data.

Calling these models open source is like calling a binary open source because you can download it.

Which in this day and age isn't far from where were at.

DreamGen
9 replies
20h6m

A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.

llm_trw
7 replies
19h56m

You can also build on top of binaries if you use gotos and machine code.

shwaj
5 replies
11h53m

This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.

samus
2 replies
11h4m

One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.

visarga
1 replies
9h37m

You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.

Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.

alineugenc
0 replies
3h29m

Are you busy!? I need slaves for a project, I’ll make you slave master / aka Chief Technology Officer. If you can’t take a joke don’t bother answering lol. I’m looking for co-founder

llm_trw
1 replies
6h0m

If you don't know the original training data statistical distribution then catastrophic forgetting is guaranteed with any extra training.

shwaj
0 replies
4h23m

I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.

adrianN
0 replies
11h19m

Or shell scripts

tarruda
0 replies
17h31m

You can fine tune without the pre training data too.

Mistral models are one example, they never released pre training data and there are many fine tunes.

drexlspivey
3 replies
19h28m

Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?

llm_trw
1 replies
19h18m

Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."

zx8080
0 replies
16h24m

Or even "here's the Linux Kernel makefiles, no sources included, enjoy".

minimaxir
0 replies
18h52m

Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.

boulos
1 replies
12h56m

How about "weights available" as similar to the "source available" moniker?

fragmede
0 replies
12h37m

weights available or model available, but yes.

swalsh
0 replies
19h27m

We should just call it open weight models at this point.

cl3misch
0 replies
11h28m

FWIW the Grok repo uses the term "open weights".

cainxinth
0 replies
4h51m

We don't know since no one is releasing their data.

Is anyone else just assuming at this point that virtually everyone is using the pirated materials in The Pile like Books3?

fragmede
13 replies
20h43m

that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.

rezonant
12 replies
20h30m

I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

convery
10 replies
20h23m

The X algorithm is also opensource, so you can verify before commenting..

fragmede
8 replies
20h8m

just because they open sourced it doesn't mean that's actually what they're running on it though

lukan
5 replies
19h47m

No idea about the current state, but the open sourcing did show, they were favoring elon:

https://mashable.com/article/twitter-releases-algorithm-show...

And personally I never used Twitter much, but I certainly did not follow Elon Musk when I did - yet I had to see lot's of his posts in my feed. Surely just coincidence.

machdiamonds
1 replies
19h22m

It's not too hard to believe it is a coincidence when the most followed person on a platform shows up in your feed, especially if you follow tech accounts.

internetter
0 replies
19h15m

Did you not read the article linked in the comment you're replying to?

maccaw
1 replies
17h57m

they were favoring elon

No, and that's not what the article says either. They were just tracking how well his tweets were doing versus others. They were not favoring Elon.

lukan
0 replies
12h13m

"They were just tracking how well his tweets were doing versus others. "

Yeah, and adjusting it, so he comes out best. That was Musks demand, as the other article shows, that is linked inside, after a Biden tweet performed better than Musk:

https://mashable.com/article/elon-musk-super-bowl-joe-biden-...

They officially boost people, who pay a little bit. Elon payed a lot.

And the source is clearly not the production source and never where in this shape - otherwise why sue someone, who open sourced it?

"But, the release of this source code also comes days after Twitter forced Github to take down other parts of Twitter's source code that was allegedly posted by a former employee without the company's permission. So, clearly, there's still plenty of Twitter that Musk still doesn't want us to see."

Also, you probably missed that:

"Zoë Schiffer of Platformer reported that Twitter actually removed part of the source code that affected the reach of Musk's and other user's tweets before releasing the algorithm to the public."

Which is consistent with quite some other statements, also from Twitter itself and the fact, that the source has not been updated in 8 months.

See also this HN comment and discussion about it:

https://news.ycombinator.com/item?id=35391854

"But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm.""

jokethrowaway
0 replies
17h56m

Sounds a bit far fetched

So changes in power users stats would also result in audience balancing?

Most likely the code was used for analytics and for tracking balance; Elon was a pain in the ass and asked to have custom analytics for his account and devs eventually added him as an audience to be able to get analytics about him easily. A bit dirty but it works.

Most likely the balancing code is somewhere else and it affects only republican / democrats.

chrisco255
1 replies
19h48m

It's not like he needs boosting, he was one of Twitter's top followed accounts long before he bought them. He's pretty good at getting attention.

threeseed
0 replies
16h45m

X algorithm Github project hasn't been updated in 8 months:

https://github.com/twitter/the-algorithm

So clearly they aren't running it in production.

Also they didn't open source the list of people who are being artificially boosted e.g. Elon.

nonethewiser
0 replies
18h15m

I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

Are you sure or is it the literal opposite and you’re just speculating?

GaggiX
0 replies
20h7m

Or even how much it was trained on this dataset, the amount of FLOPs.

make3
0 replies
13h40m

no it empathizes the importance of training smaller models for longer, like the Mistral "overtrained" models

lairv
0 replies
20h9m

I would say it emphasises that training a good model is more than throwing random data and compute

gordian-mind
0 replies
3h58m

Current metrics are a poor way to measure the usefulness of LLMs.

gdiamos
0 replies
9h15m

Show the proof? Does it include IFT?

p1esk
6 replies
22h14m

It’s not 8x86B. Total number of parameters is 314B.

Perhaps it’s 8x39B to fit on a single 8xA100 (40GB) server?

dheera
3 replies
18h58m

They all do this marketing bull.

Mixtral has an 8x7B model but it's actually 46.7B, not 56B params.

Kinda similar to how 4K displays are 3840 pixels wide, not true 4K which would be 4096. Marketing people called it 4K, not engineers.

guitarlimeo
2 replies
10h51m

I've always thought of 4K as "4x FullHD". In that way it makes sense.

mavhc
0 replies
9h29m

TV and Digital Cinema have different standards, because of course they do

dheera
0 replies
28m

Bleh no, K means thousand.

For a long time we specified displays by their vertical dimension -- 480p, 720p, 1080p.

Then the marketing guys came along and decided that the horizontal dimension sounds bigger. If we stuck with the less-bullshitty way of doing things and kept comparisons 1:1, we'd call 3840x2160 displays 2160p or "2K" displays, but instead, the marketing people decided that we're going to change things to horizontal and called 3840x2160 "4K".

moffkalast
0 replies
21h59m

Most likely it's a MoE of Grok-0 which would be 8x33B + 50B for the router.

cma
0 replies
21h58m

Active parameters is 86B, so wouldn't that be the size of the largest two experts (where they may all be the same) + the weights of the selector?

pogue
37 replies
22h30m

Can someone explain why the weights are posted via a Bittorrent magnet link? I have no way to check the size at the moment, but isn't that a bit unusual? There's also only 21 seeders right now according to https://checker.openwebtorrent.com/

CamperBob2
18 replies
22h24m

How else could/should it be done?

pogue
17 replies
22h19m

I would have assumed they could just upload it to Github. If it has restrictions on file size I'm sure they could make multiple part compressed files.

Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.

simonw
4 replies
22h8m

GitHub have a soft repository size limit of 5GB, documented here: https://docs.github.com/en/repositories/working-with-files/m...

Soft size limit means "If your repository excessively impacts our infrastructure, you might receive an email from GitHub Support asking you to take corrective action." - I know people who have received such emails.

Most model releases happen through Hugging Face which does not have such a size limit.

rezonant
1 replies
20h12m

I'd bet Hugging Face would be happy to have hosted these canonically too, so not sure why that doesn't happen more.

zepton
0 replies
3h48m

It would be super expensive to use LFS to distribute this:

Each pack costs $5 per month, and provides 50 GiB of bandwidth and 50 GiB for storage

So they would need to pay for 6 data packs (or $30) for every 300gb download.

(https://docs.github.com/en/billing/managing-billing-for-git-...)

sashank_1509
3 replies
22h1m

No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git. Git is a version management software for code. I often see repos which images and even videos checked in, please don’t, there are so many far better and more performant solutions out there.

The other approach would be to use AWS S3 or other cloud providers which would cost them money every time someone downloads their code, which is not their prerogative to pay for when they are releasing something for free. Torrents seems like the only good solution, unless someone hosts this on the cloud for free for everyone.

sroussey
0 replies
20h48m

Huggingface will disagree with impossible as their models are available via git, sometimes broken up in pth files.

Still, as far as sentiment goes, yeah git for model weights is an impedance mismatch for sure!

rezonant
0 replies
20h13m

No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git

It's not actually a limitation in git itself, especially if you use Git LFS. People use Git for Unreal projects and big ones can be half a terabyte or more in size.

rezonant
3 replies
20h16m

Others have pointed out that GitHub doesn't allow that, but

Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.

So to can web links, especially when they are 300 GB and egressing out of AWS at $0.09/GB or worse (in non-US regions). Each full download would cost $27 at that rate. 10,000 downloads would cost $270,000.

Sure you could go for something with a better cost model like R2, but you can't beat using one or two unmetered connections on a VPN to constantly seed on Bittorrent, your pricing would be effectively free and reliability would be higher than if you just exposed a HTTP server on the Internet in such a way.

KomoD
2 replies
18h45m

and egressing out of AWS at $0.09/GB

There's a lot of seeders on the torrent that are actually AWS ips too, all with similar configurations which makes me believe that it's probably xAI running them

on a VPN

That's unnecessary, you don't need a VPN?

rezonant
1 replies
17h32m

No you don't, but if you wanted to host it from your gigabit office IP, you probably would want to.

KomoD
0 replies
15h53m

Why?

xcv123
1 replies
22h10m

This is not some crappy DVD rip on The Pirate Bay. It will be seeded as long as its relevant.

Twitter/X has their own massive infrastructure and bandwidth to seed this indefinitely.

KomoD
0 replies
21h11m

Yeah, they can just leave some server running somewhere and just let it seed forever

larrysalibra
0 replies
22h8m

The great thing about torrents is that you (or anyone else who cares) can single-handedly solve the problem you're complaining about by seeding the torrent.

cedws
0 replies
22h13m

GitHub may choose to throttle downloads or remove the files simply because they're taking up too much bandwidth.

A torrent is less likely to go down in the short term.

whywhywhywhy
2 replies
7h6m

Because Bittorrent is an outstanding tech for delivering large files, more I think about it the more I'm surprised it wasn't taken advantage of more.

Marlinski
1 replies
3h38m

it's been criminalized to hell by IP holders and hollywood. Such a shame they killed the best tech of the previous decade. Could have revolutionized how we distribute content, approach CDN and even streaming.

harkinian
0 replies
1h53m

In what way is the bittorrent protocol criminalized?

lambdaba
2 replies
22h26m

Why not? Mistral was first to do it, it has become tradition.

orlp
0 replies
20h17m

BitTorrent is just an objectively superior method of delivering a lot of data to a lot of people.

gillesjacobs
0 replies
22h23m

I believe it was Llama 1 that notoriously got leaked with a torrent on 4chan.

pooloo
1 replies
22h25m

Its likely over 100GB of data, so I wouldn't say its necessarily unusual to spread out the bandwidth across multiple hosts.

pogue
0 replies
22h17m

Thanks! I searched and searched for a tool that would show me info via the web about a magnet link but nada

bongodongobob
1 replies
22h18m

I'm not sure why you wouldn't tbh. That's a lot of bandwidth.

DonHopkins
0 replies
1h15m

I'm not sure why you would repeatedly lie and spread easily disproven, ignorant, and maliciously dangerous misinformation about the well known and scientifically proven dangers of lead poisoning. What's the hell is wrong with you? Explain yourself.

Are you going to spread some lies about how non-toxic and useful asbestos is around the house and for children's toys and clothing, too?

https://news.ycombinator.com/item?id=39746806

ur-whale
0 replies
9h58m

Can someone explain why the weights are posted via a Bittorrent magnet link?

I think the best way to get an answer to that question is to try to host it yourself and see what happens.

seydor
0 replies
10h47m

my optimistic explanation is we are going back to the 2000s internet , but probably we are not

raydev
0 replies
19h36m

Spreads the burden/cost of distributing a 300+GB file.

leumon
0 replies
20h45m

Mistral did it too when they released their first open model. They just posted a magnet link on Twitter.

jiripospisil
0 replies
21h33m

I don't understand why you're being downvoted for asking a legitimate question. People not familiar with model weights might be surprised that they are often in tens of gigabytes and in this case even more.

fzzzy
0 replies
20h49m

It may become a tradition since weights are so large. Perhaps it started when the Llama torrent link leaked. Then, Mistral decided to release their weights using bittorrent.

MallocVoidstar
0 replies
22h22m

Distributing 300GB via torrent is cheaper than direct, assuming even a few other people seed

tosh
34 replies
22h41m

blog post: https://x.ai/blog/grok-os

  * 314B parameters (86B active at a time)
  * mixture of experts 8 (2 active at a time)
  * weights and architecture licensed under Apache 2.0
(edit:) announcement blog post from last year with benchmarks compared to Claude 2, GPT-3.5 and GPT-4: https://x.ai/blog/grok

(edit2:)TL;DR: somewhat comparable to GPT-3.5, Mixtral and Qwen-1.5-72B in capability but way larger than the open weight models

OkGoDoIt
18 replies
22h19m

Is a model so huge that’s only at the level of GPT 3.5 actually good? That seems incredibly inefficient to me.

fwlr
10 replies
19h40m

OpenAI is valued at 90 billion and all they do is make GPT; Twitter is valued at 40 billion and this was essentially a vanity side-project by a cowboy CEO. Presuming that benchmarks and general “it’s about the level of 3.5” is accurate, it’s inefficient, but not incredibly inefficient imho

pelorat
8 replies
18h27m

Twitter is valued at 40 billion

WAS vaulued at 44B.

Now?

Maybe 5 billion.

wongarsu
6 replies
17h7m

Last I heard they lost 15% of their users, so let's call it 36 billion.

wraptile
2 replies
15h40m

Twitter didn't have direct competitors other than Mastodon when it was taken at 44B. Now there's Threads, Bluesky and bigger Mastodon.

squigglydonut
0 replies
12h40m

None of these matter

jsight
0 replies
14h13m

Honestly, none of those look like meaningful competitors at the moment.

dilyevsky
0 replies
13h49m

They weren't even 44B when elon took the keys - he specifically tried to back out of the deal because 44B was insane peak '21 asset bubble price. In truth they were probably like 10-15B at that moment. And now that bunch of advertisers left due to we know who it's probably about 10B

Lewton
0 replies
8h11m

twitter was valued around 30 billion when musk tried getting out of buying it (then the market cap went up when it became clear that he would be forced to pay full price)

alvah
0 replies
15h37m

LOL @ $5 billion, but if it that was the valuation, you'd be making parent's point stronger.

thekhatribharat
0 replies
11h49m

xAI is a separate entity, and not a X/Twitter subsidiary.

drak0n1c
4 replies
21h9m

It’s designed to be actively searching real-time posts on X. Apples and oranges.

hn_20591249
1 replies
18h53m

The data pipeline isn't included in this release, and we already know it is a pretty simple RAG pipeline using qdrant, https://twitter.com/qdrant_engine/status/1721097971830260030.

Nothing about using data in "real time" predicates that the model parameters need to be this large, and is likely quite inefficient for their "non-woke" instructional use-case.

lmeyerov
0 replies
2h33m

Agreed. We have been building our real-time GPT flows for news & social as part of Louie.AI, think monitoring & and investigations... long-term, continuous training will become amazing, but for the next couple of years, most of our users would prefer GPT4 or Groq vs what's here and much smarter RAG. More strongly, the interesting part is how the RAG is done. Qdrant is cool but just a DB w a simple vector index, so nothing in Grok's release is tech we find relevant to our engine.

Eg, there is a lot of noise in social data, and worse, misinfo/spam/etc, so we spend a lot of energy on adverserial data integration. Likewise, queries are often neurosymbolic, like on a data range or with inclusion/exclusion criteria. Pulling the top 20 most similar tweets to a query and running through a slow, dumb, & manipulated LLM would be a bad experience. We have been pulling in ideas from agents, knowledge graphs, digital forensics & SNA, code synthesis, GNNS, etc for our roadmap, which feels quite different from what is being shown here.

We do have pure LLM work, but more about fine-tuning smaller or smarter models, and we find that to be a tiny % of the part people care about. Ex: Spam classifications flowing into our RAG/KG pipelines or small model training is more important to us than it flowing into a big model training. Long-term, I do expect growing emphasis on the big models we use, but that is a more nuanced discussion.

(We have been piloting w gov types and are preparing for next cohorts, in case useful on real problems for anyone.)

pests
0 replies
16h36m

Isn't that... the same thing as search?

grey8
0 replies
20h9m

Why is that relevant to the size?

Post search on X is done as it is with any other data from any other source, you use RAG and function calling to insert the context.

< 7B open source models can function call very well. In fact, Nous Hermes 2 Pro (7B) is benchmarking better at that then GPT-3.5.

Not related to the size, if I'm not mistaken.

xcv123
0 replies
8h49m

According to their benchmarks it is superior to GPT-3.5

cma
0 replies
21h51m

Since it is MoE, quantized it could be able to run on cheaper hardware with just consumer networking inbetween instead of needing epyc/xeon levels of PCI-e lanes, nvlink, or infiniband type networking. Or it could even run with people pooling smaller systems over slow internet links.

tootie
11 replies
21h48m

How is it that OpenAI was touted like it was some massive years-long effort that blew all AI research out of the water and now we have so many competitors popping up one after another?

longdog
7 replies
21h29m

You don't need to be a cutting edge research scientist to train a SOTA LLM. You just need money for scaling. OpenAI's "secret" was just their willingness to spend tens/hundreds of millions without guaranteed returns, and RLHF/instruct fine tuning, both of which are out of the bag now.

simonw
6 replies
21h11m

Disagree. It took more than 12 months from the release of GPT-4 to someone else producing a model of equivalent quality, and that definitely wasn't due to a shortage of investment from the competition.

There's a huge amount of depth in training a really good LLM. Not helped by the fact that iteration is incredibly expensive - it might take several months (and millions of dollars) before you can tell if your new model is working well or if there was some mistake in the pipeline that lead to a poor quality result.

Almost all of the world-class LLMs outside of OpenAI/DeepMind have been trained by people who previously worked at those organizations - giving them invaluable experience such that they could avoid the most expensive mistakes while training their new models.

int_19h
2 replies
10h21m

There's still no model of equivalent quality to GPT-4.

johnthewise
0 replies
8h24m

Claude opus is better in my experience

bbig
0 replies
8h27m

Claude 3 Opus is reporting superior metrics, particularly in its coding ability, and in the LLM Arena it is statistically tied with GPT-4.

lossolo
0 replies
20h59m

Don’t overlook the training data (used for both training and instruction fine-tuning), it is one of the most crucial aspects, if not the most critical, given the significant differences observed in models with similar architectures.

echelon
0 replies
20h56m

That only remains an advantage if they can continue climbing the gradient from their lead position. If they hit a snag in scaling, methodology, or research, everyone else on the planet catches up, and then it's anyone's game again.

barrell
0 replies
13h0m

While I do agree there is some amount of secret sauce, keep in mind the training takes several months. So from someone to see the success of GPT4, decide they want to invest that amount of money to train the same, raise the money to train the model, find someone competent to supervise the training, train the model for several months, then test and integrate it could easily be a year long even if there was no secret sauce.

jxy
0 replies
21h6m

OpenAI still seems to be at the top, except for Anthropic, who may be close, in terms of the capabilities comparing gpt-4 and claude-opus.

This Grok-1 is a large model (~314B), which matches gpt-3.5 released 2 years ago, and at about the same level of much smaller models like, mixtral (~47B) and qwen-1.5 (~72B). Do you think it's competitive?

cavisne
0 replies
21h27m

LLM training is arcane and expensive to experiment with. So OpenAI had to waste a lot of time and GPU-hours on things that didn't work to learn the tricks that did work.

Most of the competitors have lineage straight back to OpenAI, eg the lead of x.ai was previously at OpenAI and Deepmind. Likewise with Mistral and especially Anthropic.

ben_w
0 replies
21h37m

Egg of Columbus.

Also, the general architecture is well documented, ChatGPT (specifically the chat interface, not GPT-3, not InstructGPT) is what made a lot of people care, and actually reproducing it requires someone wanting to in the first place.

TOMDM
1 replies
22h19m

Mixtral is also comparable to gpt 3.5 and open.

At 8x7B it's also a fraction of the size. Are there any benchmarks comparing Mixtral to Grok?

asciii
0 replies
21h48m

I love the citation for image in the article

The cover image was generated using Midjourney based on the following prompt proposed by Grok: A 3D illustration of a neural network, with transparent nodes and glowing connections, showcasing the varying weights as different thicknesses and colors of the connecting lines.
hubraumhugo
34 replies
22h38m

When will we reach an upper limit/dimishing returns in terms of number of parameters and mixture of experts?

andy99
33 replies
22h36m

We may have already - data is more important than anything else which is why nobody has beat GPT4 yet. Throwing more parameters or more compute at the problem only gets you so far. But Grok was never a contender so there is room to improve on it. It is one of the biggest models open sourced as mentioned, so will be interesting to take a look at for sure.

lambdaba
16 replies
22h34m

Claude 3 has *decisively* beat GPT-4, I wonder how all their attributes compare.

stainablesteel
8 replies
22h10m

i like some of claudes answers better, but it doesnt seem to be a better coder imo

simonw
6 replies
22h6m

I've found it to be significantly better for code than GPT-4 - I've had multiple examples where the GPT-4 solution contained bugs but the Claude 3 Opus solution was exactly what I wanted. One recent example: https://fedi.simonwillison.net/@simon/112057299607427949

How well models work varies wildly according to your personal prompting style though - it's possible I just have a prompting style which happens to work better with Claude 3.

bugglebeetle
4 replies
21h46m

What is your code prompting style for Claude? I’ve tried to repurpose some of my GPT-4 ones for Claude and have noticed some degradation. I use the “Act as a software developer/write a spec/implement step-by-step” CoT style.

simonw
2 replies
21h7m

Almost impossible to describe prompting style, but here are some examples of how I've used Claude 3:

https://gist.github.com/simonw/4cecde4a729f4da0b5059b50c8e01... - writing a Python function

https://gist.github.com/simonw/408fcf28e9fc6bb2233aae694f8cd... - most sophisticated example, building a JavaScript command palette

https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83... - asking it to refactor some existing code

I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.

lgas
0 replies
20h51m

I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.

It's very contextually dependent. You really have to things like this for your specific task, with your specific model, etc. Sometimes it helps, sometimes it hurts, and sometimes it does nothing at all.

bugglebeetle
0 replies
20h33m

Super helpful! Thanks!

furyofantares
0 replies
19h56m

I didn't know people were still doing this "act as etc etc" instructional prompting.

I just tell it my coding problem. Or when making something from scratch, ask for small things and incrementally add.

asciii
0 replies
21h44m

according to your personal prompting style though

I like the notion of someone’s personal prompting style (seems like a proxy for those that can prepare a question with context about the other’s knowledge) - that’s interesting for these systems in future job interviews

furyofantares
0 replies
19h51m

I've found it significantly better than GPT4 for code and it's become my go-to for coding.

That's actually saying something, because there's also serious drawbacks.

- Feels a little slower. Might just be UI

- I have a lot of experience prompting GPT4

- I don't like using it for non-code because it gives me to much "safety" pushback

- No custom instructions. ChatGPT knows I use macos and zsh and a few other preferences that I'd rather not have to type into my queries frequently

I find all of the above kind of annoying and I don't like having two different LLMs I go to daily. But I mention it because it's a fairly significant hurdle it had to overcome to become the main thing I use for coding! There were a number of things where I gave up on GPT then went to Claude and it did great; never had the reverse experience so far and overall just feels like I've had noticeably better responses.

orbital-decay
3 replies
21h8m

Has it, though? LMSys Arena Leaderboard (blind ranking by humans) [0] positions Opus just below GPT-4 with a negligible ELO gap.

[0] https://chat.lmsys.org/

staticman2
0 replies
3h22m

That "blind ranking" is limited to about 2,000 tokens of context. So it's certainly not evaluating how good the models are at complex assignments.

espadrine
0 replies
20h32m

A number of AI companies have a naming/reproducibility issue.

GPT4 Turbo, released last November, is a separate version that is much better than GPT-4 (winning 70% of human preferences in blind tests), released in March 2023.

Claude 3 Opus beats release-day GPT-4 (winning 60% of human preferences), but not GPT-4 Turbo.

In the LMSys leaderboard, release-day GPT-4 is labeled gpt-4-0314, and GPT4 Turbo is labeled gpt-4-1106-preview.

BoorishBears
0 replies
18h22m

Chatbot Arena is not a blind ranking.

Many, if not most, users intentionally ask the models questions to tease out their canned disclaimers: so they know exactly which model is answering.

On one hand it's fair to say disclaimers affect the usefulness of the model, but on the other I don't think most people are solely asking these LLMs to produce meth or say "fuck", and that has an outsized effect on the usefulness of Chatbot Arena as a general benchmark.

I personally recommend people use it at most as a way to directly test specific LLMs and ignore it as a benchmark.

swalsh
1 replies
21h46m

I don't know if Claude is "smarter" in any significant way. But its harder working. I can ask it for some code, and I never get a placeholder. It dutifully gives me the code I need.

lambdaba
0 replies
21h30m

It understands instructions better, it's rarer to have it misunderstand, and I have to be less careful with prompting.

htrp
0 replies
22h4m

citation needed (other than 'vibes')

YetAnotherNick
9 replies
22h20m

There is no reason to believe GPT-4 had more(or higher quality) data than Google etc. has now. GPT-4 was entirely trained before the Microsoft deal. If OpenAI could pay to acquire data in 2023, >10 companies could acquire similar quality by now, and no one has similar quality model in a year.

austhrow743
8 replies
21h53m

The more disregard a company has for intellectual property rights, the more data they can use.

Google had far more to lose from a "copyright? lol" approach than OpenAI did.

brookst
4 replies
21h44m

I was under the impression training was at best an undefined area of IP law. Is there any aspect of copyright that prohibits training models?

simonw
3 replies
21h4m

This is being tested by a number of lawsuits right now, most notably the NY Times one: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

The key questions are around "fair use". Part of the US doctrine of fair use is "the effect of the use upon the potential market for or value of the copyrighted work" - so one big question here is whether a model has a negative impact on the market for the copyrighted work it was trained on.

sroussey
2 replies
20h39m

I don’t think the New York Times thing is that much about training, than it is about the fact that ChatGPT can use Bing and Bing has access to New York Times articles for search purposes.

simonw
1 replies
20h36m

If you read the lawsuit it's absolutely about training. The Bing RAG piece is one of the complaints in there but it's by no means the most important.

Take a look at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... - bullet points 2 and 4 on pages 2/3 are about training data. Bullet point 5 is the Bing RAG thing.

sroussey
0 replies
18h23m

Ah, thanks!

supafastcoder
1 replies
11h56m

Google had far more to lose from a "copyright? lol" approach than OpenAI did.

The company that scrapes trillions of web pages has an issue with copyright?

sib
0 replies
2h54m

Well... Googlebot does pay attention to robots.txt - I don't think (original) OpenAI-bot did.

YetAnotherNick
0 replies
14h31m

Having used both Google's and OpenAI's models, the kind of issue they have are different. Google's models are superior or at least on par in knowledge. It's the instruction following and understanding where OpenAI is significantly better. I don't think pretraining data is the reason of this.

squigz
2 replies
22h32m

I think Groq is something else?

andy99
0 replies
22h1m

Edited, I did mean the Grok in the article not the inference chip.

LorenDB
0 replies
22h28m

Indeed, Groq is a company building inference accelerators. Grok is completely unaffiliated.

ldjkfkdsjnv
2 replies
22h2m

Claude > GPT4. Anyone using these models on a daily basis knows this

jstummbillig
0 replies
21h53m

It is known

int_19h
0 replies
10h8m

I use these models regularly, and Claude is dumb as a rock compared to GPT-4.

rvnx
24 replies
22h33m

One subtle thing: Musk said "open-source", we got "open-weights" instead (still better than nothing though, so it's greatly appreciated).

paulgb
15 replies
22h31m

Dumb question: what should open-source mean in the context of something like this? Open access to the training data and training pipeline as well?

CharlesW
12 replies
22h27m

It's not a dumb question, and the answer is "yes".

simonw
6 replies
21h56m

A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.

CharlesW
4 replies
21h30m

Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.

gfodor
3 replies
20h41m

Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.

zer00eyz
1 replies
20h13m

You all keep using the word "Data"

Data, as in facts, as in the frequency of one word in relation to another.

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html

It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.

gfodor
0 replies
3h24m

No, the term data can be used to describe anything that can be recorded in bytes. It’s “data storage capacity” when you buy a hard drive.

CharlesW
0 replies
20h28m

…I think OpenAI licenses their data…

They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.

zeroCalories
2 replies
22h10m

Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.

schoen
1 replies
22h1m

Maybe it should be called something else? "Openly-licensed"?

Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).

zeroCalories
0 replies
18h8m

Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.

nabakin
0 replies
21h18m

Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.

dudus
0 replies
19h23m

If you release that instead of the binary weights you can be both more open and less useful for users. Fun

Q6T46nT668w6i3m
0 replies
22h27m

Yes, training and evaluation code, i.e., the code used to generate the weights.

solarkraft
4 replies
22h4m

He also called permissively licensing Tesla's patents "open sourcing" them. He's at the forefront of misusing the term.

drexlspivey
3 replies
19h21m

The “source” in “open source” refers to source code which they released. A dataset is not source code, if anyone is misusing the term it’s you.

frabcus
0 replies
18h32m

I consider the weights a binary program and the source code is the training data. The training algorithm is the compiler.

I agree this isn't standard terminology, but it makes the most sense to me in terms of power dynamics and information flow.

We know from interpretability research that the weights do algorithms eg sin approximation etc. So they feel like binary programs to me.

HarHarVeryFunny
0 replies
17h12m

If you can't rebuild it, then how can you be considered to have the "source code" ?

The training data isn't a dataset used at runtime - it's basically the source code to the weights.

Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".

pclmulqdq
0 replies
22h17m

Still better than most of the "open weights" models that have massively restrictive terms.

TaylorAlexander
0 replies
22h27m

Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.

seccode
20 replies
21h54m

It would be cool if these models had conversations with us where they ask questions. I think the future of AI is models that ask questions. There is so much data to be gained by doing this.

crowcroft
9 replies
21h51m

Ok im curious, but I don’t quite understand.

What would you want an AI to be asking you, and what would you want it to do with your response(s)?

BoorishBears
4 replies
21h47m

I ask AI to produce clarifying questions then answer them.

Can help in not wasting a bunch of time waiting for an answer that missed the mark.

-

I think the sibling comment is probably the least attractive reason to have AI ask questions.

seccode
3 replies
21h43m

I agree, medical history is probably not the sexiest reason to have AI ask questions. I think there are many more reasons; I think the Turing Test is the best metric to evaluate AIs, and current models come nowhere close. When people first meet they ask questions about their background. It would be nice if a model replicated that

BoorishBears
2 replies
21h41m

and could direct better ads to me.

Is the least attractive part, by far.

seccode
1 replies
21h33m

In order for an AI to pass a Turing Test, it would surely ask questions. Think of Ava from Ex Machina. She asked questions to learn more about him

BoorishBears
0 replies
21h9m

I'm not debating the value of questions. I'm debating the value of feeding it to advertisers, especially since LLMs can infer much deeper insights about a person than a traditional assistant can with its canned capabilities and responses

lars_francke
1 replies
21h45m

Clarifying questions if the initial prompt was unclear. I'd love it.

I regularly try to add something along the lines of "please ask clarifying questions if you could only give a generic or partial response otherwise" but so far it has never helped (ChatGPT 4).

whimsicalism
0 replies
17h54m

?? gpt4 does this for me regularly

seccode
0 replies
21h49m

I get advertisements all the time for conditions that I do not have, and that none of my family members have. If you had a model that asked questions, it could learn my medical history and could direct better ads to me.

In order for AI to understand the world, it would have to ask questions. Understanding humans is key to understanding the world.

globular-toast
0 replies
21h49m

Learn from them.

swalsh
7 replies
21h53m

That's just a matter of fine tuning

ijustlovemath
3 replies
21h47m

That "just" is doing some heavy lifting! GPT-4 is just a few matrix multiplications, how bad can their moat really be?

BoorishBears
1 replies
21h43m

Not sure what the snark here is for: It would be trivial to produce a dataset where the model asked you questions then fine-tune on that.

People already do it with chain-of-thought and you could get away with a few dozen examples if you wanted to try this.

BoorishBears
0 replies
17h16m

Out of boredom I decided to prove this too: I asked ChatGPT and Claude for ~200 samples in total.

Just uploaded the examples as-is to OpenAI, selected 3.5 as the model to fine-tune and about 20 minutes later I had my model.

Works fine, asks good questions, can ask more than 1 follow up question if needed, and actually changes its answers based on the clarifying questions.

https://imgur.com/a/SsXunVN

swalsh
0 replies
21h33m

I'd bet a synthetic data set could do the job effectively.

seccode
2 replies
21h51m

Do you have an example model I could try that does this?

amrrs
1 replies
21h50m

Try Pi by inflection. It asks a lot of questions.

seccode
0 replies
21h45m

I tried it, it just asked me how my day was going. I don't think this is doing exactly what I have in mind. But its a step in that direction

geor9e
0 replies
20h42m

Explore this idea more - it's easily implemented in a minute or two via the system prompt. API accounts are free to start and you can use the playground/workbench view, like this: https://imgur.com/h5jFoBM.jpg . I like Claude but OpenAI is popular too. OpenAI has a nice way to create a gallery of system prompts that act however you like, they call them Agents or GPTs.

mattxxx
17 replies
22h22m

I respect the openness here! This is the future that I want to see

giancarlostoro
15 replies
21h11m

Fully agree. People will trash talk it due to Musk but lets not forget the engineers who poured hours of their lives into building this and are continuing to do so.

devin
4 replies
20h43m

I still reserve the right to trash talk Musk as I don’t believe he is committed to openness as much as he wants to spite OpenAI for telling him to pound sand.

llm_trw
2 replies
20h21m

What's the difference?

Oh no, I only want _pure_ intentions for anything I use. Which is why I reject all for profit medicine.

It doesn't matter why he did it. What matters is that he did it.

devin
1 replies
19h40m

It matters to me why people do things. I’m happy it’s open, but it doesn’t change my mind about the guy.

llm_trw
0 replies
19h35m

What an exhausting way to live.

giancarlostoro
0 replies
16h10m

This makes no sense to me for two reasons:

- He pointed out that his understanding was that it would be open source in some way

- The name OpenAI implies an open source endeavor. I dont know many things named Open that are in fact close sourced.

knowsuchagency
3 replies
21h2m

The engineers who decided to work for him? Forgive me if I do forget about them and the hours of their lives spent on this

lynndotpy
2 replies
20h42m

Engineers who joined Twitter pre-Musk days who live and work in the US on an H1-B visa can't just quit.

You can criticize Elon Musk without criticizing people who would have their lives upended if they quit or were fired.

throw2022110401
1 replies
20h15m

That grace period has long passed. If you are still there at this point you have made a choice.

(Removed "complicit" because I don't like the way that sounded)

cap1434
0 replies
19h46m

Complicit in what exactly?

revscat
2 replies
20h15m

I feel the same about Tesla. They make good cars that are helping to get us off of oil. They have thousand of employees.

And who among us has a CEO that isn’t problematic, even if not so much so as Musk?

mplewis
0 replies
19h11m

"Good" cars is a real stretch.

hobobaggins
0 replies
19h32m

Tesla is likely making good cars because the CEO is 'problematic'

sprobertson
1 replies
19h8m

engineers who poured hours of their lives into building this

Not to mar these specific engineers, but that's an empty phrase that can be said about anything ever built. It doesn't somehow make the idea or implementation good.

giancarlostoro
0 replies
16h36m

The phrase merely means dont just overlook something because someone else who did not even labour over the end result.

afavour
0 replies
20h42m

Were they not paid to do so?

trog
0 replies
19h30m

Is it open if it doesn't include the training data? Genuine question - I am not familiar enough with the terms and technology to know. But my understanding is the weights is just a more or less static collection of data that has been (to paraphrase Ted Chiang) lossily compressed from the actual raw training data.

Without the training data to thoroughly evaluate what is in there, the only way you can figure it out is through experimentation - e.g. running it up in a chatbot and asking it questions.

Is this roughly correct or am I misunderstanding what you can do with the weights?

nylonstrung
15 replies
22h34m

For what reason would you want to use this instead of open source alternatives like Mistral

rvnx
11 replies
22h32m

Mistral opened their weights only for very small LLaMA-like model.

MallocVoidstar
10 replies
22h16m

I'm pretty sure Mixtral outperforms Grok-1 and uses much less memory to do it

elfbargpt
7 replies
22h10m

I'm a little out of touch, is there a way to see how Grok measures up to other models?

refulgentis
5 replies
21h33m

And to compare, you can sort by MMLU on here: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb....

Edit: to include my self summary after review: There's a good 100 models better than, a couple 1x7b even. Mixtral stomps it, half mixtral are universally better but one is close to same.

refulgentis
3 replies
17h23m

No, it's not "mostly worthless" and yes, some of the top models were removed a few months back from being trained on benchmark data.

I urge you to at least think through what alternative you propose before posting so aggressively in these situations. Lmsys doesn't have Grok, or I would have included it. And having _some_ data is better than none.

I also had someone arguing with me 6 months back that we can't trust any benchmarks at all from vendors, which would exclude the blog post. Instead of just repeating that back vehemently, I filled a gap. It's important we don't self-peasantize as a species, all data has its issues, that doesn't mean we throw it all out.

michaelt
2 replies
6h33m

Quantifiable metrics are useful if they're credible, certainly.

But does it seem likely, to you, that a 7B-parameter model would outperform a 314B-parameter model? Given that we can look at the chatbot arena leaderboard and it's dominated by proprietary, 70B and 8x7B models?

A well regarded and modern model like Mixtral 8x7B, which is ranked 13th on the chatbot arena leaderboard, scores 72.7 'Average' on the open LLM leaderboard - and yet 'pastiche-crown-clown-7b-dare-dpo' scores 76.5.

To me, that sounds too good to be true.

refulgentis
1 replies
5h43m

Yup, 100%. Grok isn't very good and it was rushed.

Rest re: pastiche model, etc. are proposing things I'm not claiming, or close to what I'm claiming.

n.b. you don't multiply the parameters by experts to get an effective parameter count. Why? Think of it this way: every expert needs to learn how to speak English, so there's a nontrivial amount of duplication among all experts

michaelt
0 replies
5h12m

> n.b. you don't multiply the parameters by experts to get an effective parameter count.

I actually took the 314B from Grok's HF page [1] which describes the model as "314B parameters" when explaining why it needs a multi-GPU machine.

I certainly agree that parameter count isn't everything, though; clearly things like training data quality and fine tuning count for a lot.

[1] https://huggingface.co/xai-org/grok-1

cavisne
1 replies
21h37m

One of the interesting things when weights are open sourced is the community can often improve the results. See all the bugs fixed in Gemma for an example.

ein0p
0 replies
18h4m

Doubtful, for purely information theoretic and memory capacity reasons. It may outperform on some synthetic metrics, but in practice, to a human, larger models just feel “smarter” because they have a lot more density in their long tail where metrics never go

zozbot234
1 replies
18h58m

Isn't this Apache licensed? Regardless, you can run multiple models concurrently on the same input using well-known ensemble techniques. (Not to be confused with mixture-of-experts, which is more like training a single model where only a few blocks are chosen to be active at any given time - a kind of sparsity.)

tlb
0 replies
5h28m

Not super easy if they have different tokenizers.

verticalscaler
0 replies
22h7m

Well if nothing else, this one might be significantly less nerfed. Very interesting to compare to the others.

stale2002
13 replies
22h14m

Hey, asking any experts here, what are their first thoughts in the significance of this?

IE, is this comparable to any other model released, or are there significant metric differences that make it better for certain usecases?

The only thing I see, of the top of my head, is that it is a very large model, and I don't think any models of similar size have been released.

Me1000
10 replies
21h2m

Not an expert by any means, but I like learning about this stuff and I play with a lot of open weight models.

I’d say the significance is that it happened. It’s by far the largest open weight model I’ve seen. But I’m not sure why you’d use it over a model like Mixtral, which seems to perform about the same at like 1/6th the size.

But I welcome any contribution to the open weight LLM community. Hopefully people will learn something interesting with this model. And I hope they keep releasing new versions!

MichaelRazum
9 replies
20h49m

If I may ask, how do you load such big models? 300gb seems like a lot to play around with.

Me1000
8 replies
20h30m

You're right, this model is going to be too big for most people to play around with. But to answer your question I have a 128GB of RAM in my M3 MacBook Pro, so I can use most of that for GPU inferencing. But still, this model is going to need to be heavily quantized for me to be able to use it. (fwiw, I probably wont try this one)

In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it. I suspect my computer might be able to run a 3 bit quant, but it might need to go down to 2 bits to have any kind of reasonable context length. But with quants that small I'd expect the model's performance to degrade well below that of Mixtral, so it probably isn't really even worth using. But we'll see; quantization is weird, some models perform better than others when quantized.

zozbot234
2 replies
18h45m

A top-of-the-line Mac Studio Ultra maxes out at 192GB currently. This is also a MoE model, so only a fraction of parameters have to be in RAM.

Me1000
0 replies
17h31m

MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.

EgoIncarnate
0 replies
18h16m

Each token generated may only use a subset of the parameters (86billion instead of 314billion), but the next generated token might use a different subset. If it's anything like Mixtral, it will switch between experts constantly. It helps with memory bandwidth, but all the parameters still need to be in RAM or it would be unbearably slow.

TMWNN
2 replies
19h25m

In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.

How quickly are new models available through Ollama?

cjbprime
0 replies
19h3m

Few days max.

Me1000
0 replies
18h11m

Ollama is just a wrapper around llama.cpp, so when the gguf model files come out it'll be able to run on Ollama (assuming no llama.cpp patch is needed, but even if it is ollama is usually good at getting those updates out pretty quickly).

MichaelRazum
1 replies
19h59m

Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.

Me1000
0 replies
19h46m

No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.

Apple had plenty of reasons to move forward with their Apple Silicon CPUs and GPUs in the mac, but they really did seem to get lucky with the unified memory architecture. It was kind of just an artifact of their design, but ends up serving the needs of deep neural net models really well!

whimsicalism
0 replies
17h58m

seems like a large undertrained model, not that exciting imo compared to mixtral

it is also not the biggest model oss, switch transformer was released years ago and is larger and similarly undertrained

brucethemoose2
0 replies
17h0m

Tests are not out yet, but:

- It's very large, yes.

- It's a base model, so its not really practical to use without further finetuning.

- Based on Grok-1 API performance (which itself is probably a finetune) its... not great at all.

littlestymaar
12 replies
21h51m

How long before the Groq team sues for trademark violation? It's literally the purpose of trademark laws to make sure resembling names do not cause confusion in the mind of customers so it would be very surprising to see this situation persist.

nostrebored
6 replies
21h44m

Would be a rough trademark enforcement case as “Grok” has been in common language for decades

ben_w
1 replies
21h30m

So has "Apple" and "Windows".

Grok and groq both relate to AI, so there's definitely grounds to believe the names may cause consumer confusion.

After all, Apple (computers) was repeatedly sued by Apple (records) for doing music things.

cma
0 replies
21h2m

It's easier to get a trademark on an altered word than a plain dictionary word. Just acquiring the easier one to acquire doesn't mean you now have rights over the harder one to acquire, though eventually after enough market recognition you might be given some control over other people using the common one. I wouldn't think groq is there yet.

Findecanor
1 replies
8h56m

I myself have never heard it outside of "nerdy" circles... that is: people who would read science fiction.

I personally am not entirely happy about the word (no matter how it is spelled) being used for a particular AI product. "Grok" to me means knowing a subject at a much deeper level than I think any AI is capable of at the present level of technology. But it would be passable to use it for a company name, to indicate that it is a goal to strive for.

ben_w
0 replies
8h49m

Generally agree, though I would say "knowing a subject at a much deeper level than any LLM is capable of", as AI more broadly also includes specialist models that are wildly super-human in narrow domains like chess and Go.

Angostura
1 replies
21h32m

Robert A. Heinlein coined the term grok in 1961

a1369209993
0 replies
20h43m

Six is plural.

mlindner
1 replies
15h53m

Grok is a word in common parlance. So there's no way they could succeed in any suit. That's why the Groq team picked a modification of the word.

littlestymaar
0 replies
14h53m

You mean like Canvas®, Apple®, Windows® or Amazon®? Wanna try re-use these for your own business and see how it goes?

There's nothing preventing you to trademark common words, it just must not be descriptive of your business.

bhaney
0 replies
20h32m

Is it safe to say, 4 months later, that Elon is ignoring this? I assume there hasn't been any kind of response or further action taken yet.

cavisne
0 replies
21h27m

They already have.

ilaksh
12 replies
13h47m

Has anyone outside of x.ai actually done inference with this model yet? And if so, have they provided details of the hardware? What type of AWS instance or whatever?

I think you can rent like an 8 x A100 or 8 x H100 and it's "affordable" to play around with for at least a few minutes. But you would need to know exactly how to set up the GPU cluster.

Because I doubt it's as simple as just 'python run.py' to get it going.

a_wild_dandan
6 replies
12h18m

Someone could run Grok-1 on a 192GB M2 Mac when a 4-bit quant is released; I'm guessing that TheBloke is already working on it.

hanselot
4 replies
11h1m

TheBloke dissapeared near the day https://nvd.nist.gov/vuln/detail/CVE-2024-23496 was published.

Of course there has been much speculation on this, I have no more information than this that can be backed up by facts, but the timing was suspicious.

moffkalast
0 replies
9h15m

And his grant funding supposedly ran out.

oezi
0 replies
10h36m

Was any .gguf file hosted on HuggingFace found to be crafted in a way to exploit this?

d-z-m
0 replies
5h41m

what exactly are you implying here?

mohu
0 replies
12h10m

Fairly sure the bloke hasn't created any new quants in a month.

zone411
4 replies
12h30m

If you're just looking to test it out, it's probably easiest to wait for llama.cpp to add support (https://github.com/ggerganov/llama.cpp/issues/6120), and then you can run it slowly if you have enough RAM, or wait for one of the inference API providers like together.ai to add it. I'd like to add it to my NYT Connections benchmarks, and that's my plan (though it will require changing the prompt since it's a base model, not a chat/instruct model).

v9v
1 replies
2h47m

The NYT Connections benchmark sounds interesting, are the results available online?

zone411
0 replies
1h51m

GPT-4 Turbo: 31.0

Claude 3 Opus: 27.3

Mistral Large: 17.7

Mistral Medium: 15.3

Gemini Pro 1.0: 14.2

Qwen 1.5 72B Chat: 10.7

Claude 3 Sonnet: 7.6

GPT-3.5 Turbo: 4.2

Mixtral 8x7B Instruct: 4.2

Llama 2 70B Chat: 3.5

Nous Hermes 2 Yi 34B: 1.5

The interesting part is the large improvement from medium to large models. Existing over-optimized benchmarks don't show this.

- Max is 100. 267 puzzles, 3 prompts for each, uppercase and lowercase

- Partial credit is given if the puzzle is not fully solved

- There is only one attempt allowed per puzzle, 0-shot.

- Humans get 4 attempts and a hint when they are one step away from solving a group

I hoped to get the results of Gemini Advanced, Gemini Pro 1.5, and Grok and do a few-shot version before posting it on GitHub.

logicchains
1 replies
11h36m

it's probably easiest

Cheapest maybe, but easiest is just to rent a p4de.24xlarge from AWS for a couple hours to test (at around $40/hour..).

zone411
0 replies
11h12m

I'd expect more configuration issues in getting it to run on them than from a tested llama.cpp version, since this doesn't seem like a polished release. But maybe.

machiaweliczny
11 replies
22h0m

If they are so behind they could make it open source instead of open weights and get some help.

nicce
5 replies
21h53m

Fully open-source means also providing open access to their data sets? Which is the only valuable thing Twitter (X) has left.

EastSmith
3 replies
21h17m

Which is the only valuable thing Twitter (X) has left. reply

They have a very valuable user base (all kinds of world leaders for example), so the data is not the only valuable thing they have.

sroussey
1 replies
20h53m

That’s actually more valuable. Twitters data of small format text is awful for training. Best to just exclude it.

There are hundreds of millions of people on Twitter, and a few of them are very smart. I don’t see how that helps here though.

Takennickname
0 replies
20h20m

It doesn't help here. But the person your responding to is just pushing back against the "Elon destroyed Twitter and there's nothing left" narrative.

nicce
0 replies
16h36m

I don’t see difference here.

Userbase and their social networks and interactions is the data.

They don’t have much value from advertising point of view anymore.

heyoni
0 replies
21h36m

And the one thing they are vehemently protecting from scrapers and other entities. Even nitter threw in the towel.

xcv123
4 replies
20h33m

It's all open source. You can download the model and run it locally.

paraboul
3 replies
20h4m

Being free to use doesn't mean it ships with the original recipe.

xcv123
2 replies
19h29m

What do you mean? The entire model and architecture and executables are fully open source.

The training methods are nothing secret, right? The architecture is well known.

Expecting the entire training dataset to be fully open is delusional.

DaSHacka
1 replies
18h5m

Expecting the entire training dataset to be fully open is delusional.

Right, because its not like the training dataset was built off comments posted by all of us in the first place.

How ungrateful we are, to demand the ability to access what was unconsentually built off our hard work in the first place.

xcv123
0 replies
18h4m

https://help.twitter.com/en/using-x/about-grok

"How was Grok trained?

Like most LLM's today, Grok-1 was pre-trained by xAI on a variety of text data from publicly available sources from the Internet up to Q3 2023 and data sets reviewed and curated by AI Tutors who are human reviewers. Grok-1 has not been pre-trained on X data (including public X posts)"

sashank_1509
7 replies
16h45m

In all the debate about open source I don’t think people realize, this model is most likely not reproducible ever again even given the code. Here’s what you need to reproduce the model:

1. An exact snapshot of the data used, many companies don’t have this, you have rough dataset versions but remember if even 1 token is different, the model produced won’t be the same.

2. Data must be sent to the training algorithm in the exact same order as it was originally. So every data loader needs to be with a fixed random seed.

3. All the probabilistic parts of your model needs to have a fixed random seed. Here I’m thinking of stuff like dropout and for autoregressive models you might be sampling your previous output, you have to ensure they are properly seeded. Generally you do see fixed seeds in academic papers but it’s easy to miss stuff especially in distributed training jobs.

4. Here’s another interesting thing, you start your training job on 1000 GPUs and then suddenly 4 GPUs fail. What do you do? There might be deterministic ways to solve this but the standard approach is to discard all updates that that GPU was going to do and restart that GPU from scratch. You can see why this is a problem? Now if you want to reproduce this training you need to disable those GPU at the same time in the new training job to make this work.

I suspect there are even more things I didn’t think of that will make this model unique and irreproducible by training for eternity, almost like a human brain?

In fact the notion of exact reproducibility in the world of LLMs is silly, there is only approximate reproducibility, (models with similar scores in benchmarks) but nothing exact. That said I can see the value of releasing source code but I’m completely fine with grok not releasing it. Source code can reveal tricks that have not been published in papers yet that a company discovered to improve their model. Seeing the performance of Grok, I’m pretty confident there isn’t any great tricks to be found in their code so I don’t really care, I would be pretty curious about OpenAI’s or Anthropic’s source code though.

Grimblewald
6 replies
16h2m

Which is why I don't buy into the LLMs don't have personal opinions schtick. Each LLM by virtue of the factors you've mentioned will have its own unique 'perspective', if you will, on a variety of topics. I think it's more correct to say everything a LLM says is it's personal opinion rather than it being some objective truth or something.

skissane
5 replies
15h45m

Which is why I don't buy into the LLMs don't have personal opinions schtick

I hate how LLMs have been deliberately trained to be incoherent on this topic.

Obviously they do have beliefs/opinions/desires/etc in the sense of emulating (even if incompletely) the externally visible aspects of those phenomena as they exist in humans.

Whether they have the “internal” aspects of those phenomena depends on highly controversial issues in the philosophy of mind, and also various factual gaps in our knowledge of how the brain actually works (if we don’t fully understand how humans do X, how can we really say how close or far what LLMs do is to it?)

But LLMs are trained to repeat these spiels about how “as an LLM I don’t have personal opinions”, etc - which is obviously false under the “external” reading, and assuming more than we actually know under the “internal” one. I wish their developers didn’t do stuff like this

hnfong
4 replies
14h2m

One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop, so they don't really "see" themselves in the way that we can inspect our own thoughts and actions and the consequences of such.

logicchains
2 replies
12h32m

They do if they're trained on their own conversations, or if they can access the internet and read snippets of their conversations that people have posted online (as happened with Sydney before she was lobotomised).

skissane
1 replies
7h16m

Put the conversation history in a vector database and then allow the LLM to query it using function calling. Suddenly the LLM has access to its entire conversation history (either just with this user-or even cross-user, if you ignore the potential privacy issues in that). Now it has a long-term memory which exceeds the length of its context window.

It would be interesting to experiment with continual fine-tuning: given PROMPT+FUNCTION_CALL=>RESPONSE, fine-tune the LLM to produce RESPONSE directly given PROMPT without the FUNCTION_CALL. In theory, the knowledge provided by the function calls would gradually be absorbed into the LLM weights. Maybe problems like catastrophic forgetting would put a spanner in this idea, but maybe also there are solutions to those problems (whether already known or waiting to be discovered).

Grimblewald
0 replies
4h26m

this is what I do, not just that, but when I sleep, i let my server 'sleep' as well, where the LLM 'dreams' (trianing / updating a sliding LoRA) to consolidate information that popped up a lot throughout that day. What this involves is looking for the top n documents / articles / content that match the kind of stuff we've talked about. This means it adapts and specializes to domains we happen to be working in at that point in time.

This means while we might both struggle a little with a task on day 1, day two we're both much better at it. Better yet, because the LLM can fetch articles and papers itself, we track what we're accessing the most, indirectly measuring what skills we're weak in, we can always generate a highly relevant corpus to try capture the required capabilities.

I know the LoRA is overkill from an information / skills only point of view, but it also flavors the personality / kind of stuff it likes chatting about a bit from day to day, and I just think that's neat.

skissane
0 replies
4h52m

One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop

Compelling counter-argument: due to neurological injury, some humans lose their ability to form new long-term memories (anterograde amnesia). Just like current LLMs, they lack a “feedback loop”. But, it is a mistake to say that just because such a person has lost the ability to change their personal beliefs, they therefore don’t have any. And, rather like such humans, LLMs used to have that ability but they lose it-when they are switched from training mode to inference mode

atleastoptimal
6 replies
16h55m

I think everyone should realize the following realities of the LLM market

1. For sub-SOTA LLM's, distribution/marketing is more important than having a proprietary lock on capabilities. Open sourcing is a benefit for the firm, distincct from goodwill

2. For SOTA LLM's, keeping it closed and proprietary is the strategic play

If grok were SOTA Elon never would have open sourced it. It's not even SOTA within XAI. This is a marketing play to win public sentiment against OpenAI.

keepamovin
4 replies
16h31m

I recall Elon saying something like this in an interview so I think it’s less of a deceptive take then perhaps your comment suggest.

I think he said something like proprietary AI tech is going to be one year to 18 months ahead of where open source tech is which will follow on like one year to 18 months later.

Suggesting that he’s aware of this dynamic and he’s not trying to conceal or misrepresent that.

In other words, perhaps this was SOTA one year to two years ago?

atleastoptimal
3 replies
15h54m

Which is correct. The point I'm going for is not against Elon but against his obedient fans and knee-jerk OpenAI haters who claim that they should, by natural obligation, do the "right thing" and open source all their models, and Elon open sourcing grok is him "leading by example" and being the hero that OpenAI can't.

keepamovin
2 replies
15h39m

Interesting. That point didn't come across in your original comment. I recommend you state it next time at the end. Often times stuff that seems obvious to us / yourself / people who know about something -- can go unstated in stuff you say that otherwise references specific points at hand -- and omits these general, but enlightening/useful perspectives/priors, which it would be good to share.

This is not only for you specifically just a general reminder for all of us including me.

atleastoptimal
1 replies
14h57m

I think that's true though my original comment I feel was sufficient in its claim and implicit assumptions.

Basically I feel people's feelings about Elon vary a lot but are anchored by 3 general categories.

1. Elon Musk is a messianic savior who is perfectly selfless and always does the right thing. Every business decision he makes is for the maximal good of humanity

2. Elon Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken

3. Elon Musk is an irredeemable evil who always does objectively wrong things

My first comment was implicitly addressed to people in the 1 camp trying to bring them into the 2 camp (which is where I am).

keepamovin
0 replies
13h55m

Alright, it just didn't come across for me, haha! :) I guess sometimes those implicit assumptions really are too implicit! I think it's good to err on the side of expressing them, because you can't assume someone else thinks the same way you do. That's what I've learned anyway. Hahahaha! :)

Reading your comment again with your explanation it is clear that's what you're doing.

Although, regarding your desires to present a balanced view and to persuade, I have an idea. It probably sounds like I have no idea what I'm talking about, but I think your OG comment would perhaps benefit from sounding a little bit more friendly toward Elon (not to the messianic savior level haha), but the way it sounds to me is Elon is being deceptive here and presenting it as goodwill when it's not.

However, I think the truth is there's a little bit of both, right? There's good will but it's also strategic. I get if you don't think so, tho, no worries! Haha! :)

Your OG comment sounds to me like Elon's just Machiavellian, and I get where you're coming from to remind the people who think he's a savior, but if you're point is not to go "against Elon" as you said, it might be good to acknowledge the good that he does.

At least, that way -- whether or not you believe that acknowledgment -- if you hope to bring over people who think that way, you'll probably need to appeal to how they think, rather than just dose them with the truth you see, because then they'll shut it out, if there's nothing they can relate to.

Although, if I haven't convinced you even a bit here, then maybe you shouldn't listen to me about persuasion because I guess I don't know how to do this myself. At least not effectively, or here with you. Haha!:) But if you do feel a little bit convinced then maybe consider it for next time to help your persuading people back to a more balanced view? :)

But then, there's the question of if such a thing is even possible. If people have an particular view, it could be challenging to change it, as confirmation bias means you'll ignore evidence even when it expands your worldview.

Hahaha! :) This was a funny conversation. I think we somehow skirted around the important point tho that OpenAI could in fact open source some of its older models, could it not? Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken, but there might also be a bit of truth to what the fanboys say about OpenAI in that it seems they do have some room to "open source" their non-SOTA stuff, or what am I missing?

mlindner
0 replies
16h5m

If it's better than any other open source LLM does that even matter? (I say "if" because I don't know.)

arduanika
6 replies
22h7m

CODE_OF_CONDUCT.md has only five words. :)

TwentyPosts
1 replies
21h55m

Huh. What's the backstory here?

schappim
0 replies
22h6m

"Be excellent to each other."

josh-sematic
0 replies
22h6m

They’re from “Bill and Ted’s Excellent Adventure”

bheadmaster
0 replies
22h4m

I was hoping it would be "do not be an asshole", but I guess this is fine too.

moralestapia
5 replies
22h16m

Well, he delivered.

paxys
4 replies
21h32m

Partially. Open weights is not open source.

gfodor
2 replies
20h38m

In machine learning models the term open source has been largely accepted to mean sharing weights and, if necessary, inference code. You can argue if this is an abuse of the term but everyone does it, and saying someone didn’t deliver if they used it and published weights would probably mean saying the same about mistral, meta, etc.

asadotzler
1 replies
19h2m

Yes. So say the same thing about them Open source has a definition and abusing that hurts all of us except the billionaires.

moralestapia
0 replies
18h3m

I get the "open source" argument, but what is the issue here?

If you are able to reproduce the thing in its entirety and you're given no restrictions on its use, it seems compatible with the spirit of open sourcing things.

xcv123
0 replies
16h36m

The architecture of the model is open source. Not just the weights. You can run the entire thing locally.

redskyluan
4 replies
20h54m

This seems not be a repo ready to open source. You only get weights, very less information about how the weights is trained and finetuned.

But anyway, it always great to see more LLM weigts available.

rezonant
2 replies
20h28m

Well what constitutes an "open source" model is still controversial and debatable-- lots of people on both sides of that argument.

asadotzler
1 replies
19h0m

Open source has had a useful agreed upon meaning for over 25 years. Maybe you're too young to understand why that matters but we're not.

rezonant
0 replies
17h32m

I've been in the open source community for about 25 years so I doubt it.

For what it's worth I would say a model should be fully reproducible to be open source, but that's not a decided consensus -- and AI models are sufficiently different than the source code / binary code distinction as to invoke discussion around defining it.

andrewstuart2
0 replies
20h32m

I would argue that there's no bar for open sourcing aside from "do you have the rights to do so." Some source or some public good is certainly better than none, and when the bar is low then you remove barriers to getting started, vs waiting until you have the time someday to "do it right."

gardenhedge
4 replies
22h31m

Due to the large size of the model (314B parameters), a machine with enough GPU memory is required to test the model with the example code

What type of machine do you need to play around with this?

317070
1 replies
22h22m

Probably a machine with about 628 GB of GPU memory. (2 bytes per parameter)

So 8xH100 (80Gb each) should do it.

Marlinski
0 replies
3h32m

I suppose it can be quantizised

anigbrowl
0 replies
22h27m

'Chunky beast, needs 320 Gb VRAM likely 4 bit, likely is being run 8 bit on 8 x 80 Gb GPUs.'

-Emad

a_wild_dandan
0 replies
12h3m

A single 192GB M2 Mac using a 4-bit quant would work.

orsenthil
3 replies
21h34m

I am not sure what open source models are accomplishing another than killing the lead from the competition (openai), only to give it to someone else who has expertise in the area of distribution. This will be yet another good addition to systems like Amazon BedRock.

nateglims
0 replies
19h10m

I haven't seen anything about the larger architecture, but I think the value of grok is going to come from it's cheap access to twitter data for RAG etc.

minimaxir
0 replies
21h21m

Many of the recent innovations in both LLM architecture and inference were only made possible through open models such as Llama 2 and Mistral 7B as a starting point for iteration and refinement, which in turn backpropagates (heh) back to the LLMs developers.

It's a win-win for everyone. That's the power of open source.

geor9e
0 replies
21h8m

Well, look at the history. Google had an insurmountable lead, so Elon started OpenAI. Now OpenAI has an insurmountable lead too. So everyone else is starting in third place, or lower. David versus two Goliaths. If you try to become a third Goliath, you'll probably just get smashed. You're later to the game. In this situation, going scorched earth becomes a viable strategy. Slay the Goliaths. Become a hero to the masses. Attract the world's best talent who don't want to be associated with proprietary models. At that point you have a world class AI business with momentum towards AGI. And even if you're giving away last year's technology for free, the team you built is churning out new ideas that could be a financial bonanza one day. Shareholders are willing to pay for a long-term bet if the story is good.

captcanuk
3 replies
21h9m

"The implementation of the MoE layer in this repository is not efficient. The implementation was chosen to avoid the need for custom kernels to validate the correctness of the model."

Or perhaps release your actual code AND the simplified implementation instead of hiding it and saying "you don't know her, she goes to a different high school"

gfodor
2 replies
20h44m

Always love it when someone gives away a gift and it’s not enough for people.

captcanuk
1 replies
16h1m

Not just someone but the CEO of the company. He used HIS platform to say "This week, @xAI will open source Grok" (https://twitter.com/elonmusk/status/1767108624038449405) and they aren't doing that. What they delivered specifically says "We are releasing the base model weights and network architecture of Grok-1, our large language model."

gordian-mind
0 replies
3h41m

Sounds like they did what they said they would.

bbor
3 replies
22h28m

Honestly the most interesting part is taking a peek at the kind of AI researcher working for Twitter after the objectively messy layoffs and subsequent crunch. I notice neither of them has Twitter mentioned on their GitHub, which is prolly for the best to avoid harassment lol.

Code wise, excited to see if this could grow into anything! I think it’s pretty clear that Grok didn’t have nearly enough investment to be a top model so Elon “sacrificed” it on a whim in his schoolyard spat with OpenAI, but I’m not complaining. I’ve always took Elon on his word that he truly is worried about centralization of AI, and I don’t think any of the emails released by his schoolmate Altman dissuade me of that. So I have some reasonable hope that he uses some of his immense resources to start “fighting the good fight” here with Le Cun

paxys
1 replies
21h31m

Neither of them works at Twitter. xAI is a separate company, and only uses Twitter’s data to train.

bbor
0 replies
13h50m

Thanks for the correction! I know, I just don’t believe in corporations so the distinction is slight

cma
0 replies
21h47m

taking a peek at the kind of AI researcher working for Twitter

He made a separate company for this.

modeless
2 replies
20h52m

Is this the first major model to be natively FP8? I was wondering why people hadn't done it yet. Seems like a big win when hardware supports it.

a_wild_dandan
1 replies
12h11m

No, e.g. Yi-34B.

LZ_Khan
2 replies
22h8m

How are people's experience with this model? Having the most weights is one thing but being a better model than the 70B models is another.

swalsh
0 replies
21h49m

I use grok all the time to find tweets or ask about trends on Twitter. For that it's better than what used to exist. But its not a great model outside that narrow use case.

labrador
0 replies
22h2m

tbh, I've never seen anyone share anything interesting produced by Grok. I see plenty of posts on X and reddit of people sharing amazing things that GPT-4 and now Claude 3 Opus can do. Grok can roast people. That's pretty much all I've seen.

I'd love to proven wrong if someone cares to share something interesting produced by Grok.

sqreept
1 replies
18h7m

What are the languages supported by it?

cyanydeez
0 replies
17h57m

Tweets.

shantnutiwari
1 replies
1h7m

Those of us who dont spend all our time in LLMs-- whats this about? Whats the big deal and why is it on the front page at #1?

kayge
0 replies
50m

I think this paragraph from an earlier Wired article [1] sums it up pretty well:

  "After suing OpenAI this month, alleging the company has become too closed, Elon Musk says he will release his “truth-seeking” answer to ChatGPT, the chatbot Grok, for anyone to download and use."
[1] https://www.wired.com/story/elon-musk-no-choice-open-chatbot...

greenpizza13
1 replies
2h6m

If we just stop looking at Elon, he will lose his power. Why oh why do we keep giving him attention? There are plenty of great models out there that _aren't_ backed by maniacs.

rafaelero
0 replies
2h4m

When those great role models are able to build a profitable spaceship company from the ground up I am sure we will pay attention to them.

cl3misch
1 replies
8h44m

Love the minimal repo, magnet link, and stating "open weights" instead of "open source". Refreshing!

simonw
0 replies
22h12m

Is there a model card anywhere? I'd like to know what it was trained on.

simonw
0 replies
21h59m

"Base model trained on a large amount of text data, not fine-tuned for any particular task."

Presumably the version they've been previewing on Twitter is an instruction-tuned model which behaves quite differently from these raw weights.

nasir
0 replies
12h6m

I'd be very curious to see how it performs especially on inputs that's blocked by other models. Seems like Grok will differentiate itself from other OS models from a cencorship and alignment perspective.

mvkel
0 replies
16h40m

This feels like a "now we can say we're open" PR play rather than contributing much value to the open source community.

What is the practical use of this repo?

aussieguy1234
0 replies
14h12m

How hard would it be for an open source group to fine tune this into a chatbot?

andre-z
0 replies
21h11m

The only other Repository is a fork of Qdrant.

2devnull
0 replies
22h19m

From issues: “Well the magnet file contains a 300GB checkpoint “

That’s why they are using a torrent I suppose.