HN comments for: Grok

extheat

49 replies

22h41m

2024-03-17 19:43:41 UTC

At 8x86B, looks like the largest open model yet by far. Would be interesting to hear how many tokens it's been trained on. Especially important for higher param models in order to efficiently utilize all those parameters.

swalsh

40 replies

21h36m

2024-03-17 20:49:07 UTC

Considering how poor it is compared to other models, it really emphasises how important fine tuning is. Models with MUCH smaller parameter counts are outperforming it in many metrics.

lukan

35 replies

21h32m

2024-03-17 20:52:36 UTC

"it really emphasises how important fine tuning is"

Or rather the quality of the training data?

llm_trw

19 replies

20h20m

2024-03-17 22:04:42 UTC

We don't know since no one is releasing their data.

Calling these models open source is like calling a binary open source because you can download it.

Which in this day and age isn't far from where were at.

DreamGen

9 replies

20h6m

2024-03-17 22:19:06 UTC

A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.

llm_trw

7 replies

19h56m

2024-03-17 22:29:10 UTC

You can also build on top of binaries if you use gotos and machine code.

shwaj

5 replies

11h53m

2024-03-18 06:31:50 UTC

This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.

samus

2 replies

11h4m

2024-03-18 07:20:58 UTC

One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.

visarga

1 replies

9h37m

2024-03-18 08:48:19 UTC

You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.

Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.

alineugenc

0 replies

3h29m

2024-03-18 14:55:43 UTC

Are you busy!? I need slaves for a project, I’ll make you slave master / aka Chief Technology Officer. If you can’t take a joke don’t bother answering lol. I’m looking for co-founder

llm_trw

1 replies

6h0m

2024-03-18 12:24:37 UTC

If you don't know the original training data statistical distribution then catastrophic forgetting is guaranteed with any extra training.

shwaj

0 replies

4h23m

2024-03-18 14:01:29 UTC

I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.

adrianN

0 replies

11h19m

2024-03-18 07:05:35 UTC

Or shell scripts

tarruda

0 replies

17h31m

2024-03-18 00:54:09 UTC

You can fine tune without the pre training data too.

Mistral models are one example, they never released pre training data and there are many fine tunes.

drexlspivey

3 replies

19h28m

2024-03-17 22:57:07 UTC

Their data is the twitter corpus which is public. Or do you want a dump of their database for free too?

llm_trw

1 replies

19h18m

2024-03-17 23:06:59 UTC

Saying "It's just the twitter public corpus." is like saying "Here's the Linux Kernel, makefiles not included."

zx8080

0 replies

16h24m

2024-03-18 02:01:16 UTC

Or even "here's the Linux Kernel makefiles, no sources included, enjoy".

minimaxir

0 replies

18h52m

2024-03-17 23:33:16 UTC

Twitter tweet data in itself is both highly idiosyncratic and short by design, which alone is not conductive towards training a LLM.

boulos

1 replies

12h56m

2024-03-18 05:29:25 UTC

How about "weights available" as similar to the "source available" moniker?

fragmede

0 replies

12h37m

2024-03-18 05:47:54 UTC

weights available or model available, but yes.

swalsh

0 replies

19h27m

2024-03-17 22:57:40 UTC

We should just call it open weight models at this point.

cl3misch

0 replies

11h28m

2024-03-18 06:57:04 UTC

FWIW the Grok repo uses the term "open weights".

cainxinth

0 replies

4h51m

2024-03-18 13:34:25 UTC

We don't know since no one is releasing their data.

Is anyone else just assuming at this point that virtually everyone is using the pirated materials in The Pile like Books3?

fragmede

13 replies

20h43m

2024-03-17 21:42:00 UTC

that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.

rezonant

12 replies

20h30m

2024-03-17 21:55:23 UTC

I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

convery

10 replies

20h23m

2024-03-17 22:01:34 UTC

The X algorithm is also opensource, so you can verify before commenting..

fragmede

8 replies

20h8m

2024-03-17 22:16:45 UTC

just because they open sourced it doesn't mean that's actually what they're running on it though

lukan

5 replies

19h47m

2024-03-17 22:38:00 UTC

No idea about the current state, but the open sourcing did show, they were favoring elon:

https://mashable.com/article/twitter-releases-algorithm-show...

And personally I never used Twitter much, but I certainly did not follow Elon Musk when I did - yet I had to see lot's of his posts in my feed. Surely just coincidence.

machdiamonds

1 replies

19h22m

2024-03-17 23:03:12 UTC

It's not too hard to believe it is a coincidence when the most followed person on a platform shows up in your feed, especially if you follow tech accounts.

internetter

0 replies

19h15m

2024-03-17 23:09:55 UTC

Did you not read the article linked in the comment you're replying to?

maccaw

1 replies

17h57m

2024-03-18 00:27:52 UTC

they were favoring elon

No, and that's not what the article says either. They were just tracking how well his tweets were doing versus others. They were not favoring Elon.

lukan

0 replies

12h13m

2024-03-18 06:11:32 UTC

"They were just tracking how well his tweets were doing versus others. "

Yeah, and adjusting it, so he comes out best. That was Musks demand, as the other article shows, that is linked inside, after a Biden tweet performed better than Musk:

https://mashable.com/article/elon-musk-super-bowl-joe-biden-...

They officially boost people, who pay a little bit. Elon payed a lot.

And the source is clearly not the production source and never where in this shape - otherwise why sue someone, who open sourced it?

"But, the release of this source code also comes days after Twitter forced Github to take down other parts of Twitter's source code that was allegedly posted by a former employee without the company's permission. So, clearly, there's still plenty of Twitter that Musk still doesn't want us to see."

Also, you probably missed that:

"Zoë Schiffer of Platformer reported that Twitter actually removed part of the source code that affected the reach of Musk's and other user's tweets before releasing the algorithm to the public."

Which is consistent with quite some other statements, also from Twitter itself and the fact, that the source has not been updated in 8 months.

See also this HN comment and discussion about it:

https://news.ycombinator.com/item?id=35391854

"But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm.""

jokethrowaway

0 replies

17h56m

2024-03-18 00:28:54 UTC

Sounds a bit far fetched

So changes in power users stats would also result in audience balancing?

Most likely the code was used for analytics and for tracking balance; Elon was a pain in the ass and asked to have custom analytics for his account and devs eventually added him as an audience to be able to get analytics about him easily. A bit dirty but it works.

Most likely the balancing code is somewhere else and it affects only republican / democrats.

chrisco255

1 replies

19h48m

2024-03-17 22:37:20 UTC

It's not like he needs boosting, he was one of Twitter's top followed accounts long before he bought them. He's pretty good at getting attention.

latexr

0 replies

18h30m

2024-03-17 23:54:30 UTC

And yet it’s not enough to curb the desire to tip the scales.

https://arstechnica.com/tech-policy/2023/02/report-musk-had-...

threeseed

0 replies

16h45m

2024-03-18 01:40:00 UTC

X algorithm Github project hasn't been updated in 8 months:

https://github.com/twitter/the-algorithm

So clearly they aren't running it in production.

Also they didn't open source the list of people who are being artificially boosted e.g. Elon.

nonethewiser

0 replies

18h15m

2024-03-18 00:09:49 UTC

I'm sure just like in X's algorithms, @elon tweets are weighted heavily.

Are you sure or is it the literal opposite and you’re just speculating?

GaggiX

0 replies

20h7m

2024-03-17 22:17:56 UTC

Or even how much it was trained on this dataset, the amount of FLOPs.

make3

0 replies

13h40m

2024-03-18 04:45:21 UTC

no it empathizes the importance of training smaller models for longer, like the Mistral "overtrained" models

lairv

0 replies

20h9m

2024-03-17 22:16:12 UTC

I would say it emphasises that training a good model is more than throwing random data and compute

gordian-mind

0 replies

3h58m

2024-03-18 14:26:40 UTC

Current metrics are a poor way to measure the usefulness of LLMs.

gdiamos

0 replies

9h15m

2024-03-18 09:10:08 UTC

Show the proof? Does it include IFT?

p1esk

6 replies

22h14m

2024-03-17 20:10:27 UTC

It’s not 8x86B. Total number of parameters is 314B.

Perhaps it’s 8x39B to fit on a single 8xA100 (40GB) server?

dheera

3 replies

18h58m

2024-03-17 23:27:17 UTC

They all do this marketing bull.

Mixtral has an 8x7B model but it's actually 46.7B, not 56B params.

Kinda similar to how 4K displays are 3840 pixels wide, not true 4K which would be 4096. Marketing people called it 4K, not engineers.

guitarlimeo

2 replies

10h51m

2024-03-18 07:34:11 UTC

I've always thought of 4K as "4x FullHD". In that way it makes sense.

mavhc

0 replies

9h29m

2024-03-18 08:55:37 UTC

TV and Digital Cinema have different standards, because of course they do

dheera

0 replies

28m

2024-03-18 17:57:16 UTC

Bleh no, K means thousand.

For a long time we specified displays by their vertical dimension -- 480p, 720p, 1080p.

Then the marketing guys came along and decided that the horizontal dimension sounds bigger. If we stuck with the less-bullshitty way of doing things and kept comparisons 1:1, we'd call 3840x2160 displays 2160p or "2K" displays, but instead, the marketing people decided that we're going to change things to horizontal and called 3840x2160 "4K".

moffkalast

0 replies

21h59m

2024-03-17 20:25:42 UTC

Most likely it's a MoE of Grok-0 which would be 8x33B + 50B for the router.

cma

0 replies

21h58m

2024-03-17 20:26:51 UTC

Active parameters is 86B, so wouldn't that be the size of the largest two experts (where they may all be the same) + the weights of the selector?

zone411

0 replies

19h22m

2024-03-17 23:03:05 UTC

It's actually not the largest. https://huggingface.co/google/switch-c-2048 is 1.6T parameters.

pogue

37 replies

22h30m

2024-03-17 19:55:07 UTC

Can someone explain why the weights are posted via a Bittorrent magnet link? I have no way to check the size at the moment, but isn't that a bit unusual? There's also only 21 seeders right now according to https://checker.openwebtorrent.com/

CamperBob2

18 replies

22h24m

2024-03-17 20:01:03 UTC

How else could/should it be done?

pogue

17 replies

22h19m

2024-03-17 20:06:14 UTC

I would have assumed they could just upload it to Github. If it has restrictions on file size I'm sure they could make multiple part compressed files.

Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.

simonw

4 replies

22h8m

2024-03-17 20:17:04 UTC

GitHub have a soft repository size limit of 5GB, documented here: https://docs.github.com/en/repositories/working-with-files/m...

Soft size limit means "If your repository excessively impacts our infrastructure, you might receive an email from GitHub Support asking you to take corrective action." - I know people who have received such emails.

Most model releases happen through Hugging Face which does not have such a size limit.

rezonant

1 replies

20h12m

2024-03-17 22:13:25 UTC

I'd bet Hugging Face would be happy to have hosted these canonically too, so not sure why that doesn't happen more.

osanseviero

0 replies

19h54m

2024-03-17 22:31:25 UTC

The model is also at https://huggingface.co/xai-org

KomoD

1 replies

21h8m

2024-03-17 21:17:17 UTC

They'd probably just charge you for it. They sell "data packs" for LFS.

https://docs.github.com/billing/managing-billing-for-git-lar...

zepton

0 replies

3h48m

2024-03-18 14:37:07 UTC

It would be super expensive to use LFS to distribute this:

Each pack costs $5 per month, and provides 50 GiB of bandwidth and 50 GiB for storage

So they would need to pay for 6 data packs (or $30) for every 300gb download.

(https://docs.github.com/en/billing/managing-billing-for-git-...)

sashank_1509

3 replies

22h1m

2024-03-17 20:24:09 UTC

No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git. Git is a version management software for code. I often see repos which images and even videos checked in, please don’t, there are so many far better and more performant solutions out there.

The other approach would be to use AWS S3 or other cloud providers which would cost them money every time someone downloads their code, which is not their prerogative to pay for when they are releasing something for free. Torrents seems like the only good solution, unless someone hosts this on the cloud for free for everyone.

sroussey

0 replies

20h48m

2024-03-17 21:37:07 UTC

Huggingface will disagree with impossible as their models are available via git, sometimes broken up in pth files.

Still, as far as sentiment goes, yeah git for model weights is an impedance mismatch for sure!

rezonant

0 replies

20h13m

2024-03-17 22:11:54 UTC

No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git

It's not actually a limitation in git itself, especially if you use Git LFS. People use Git for Unreal projects and big ones can be half a terabyte or more in size.

djhn

0 replies

14h52m

2024-03-18 03:32:52 UTC

Scott Chacon (github cofounder) mentioned in a recent talk that the Windows repo is 300GB https://youtu.be/aolI_Rz0ZqY?si=MOo2eS6dsKKAxmsP

rezonant

3 replies

20h16m

2024-03-17 22:08:40 UTC

Others have pointed out that GitHub doesn't allow that, but

Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.

So to can web links, especially when they are 300 GB and egressing out of AWS at $0.09/GB or worse (in non-US regions). Each full download would cost $27 at that rate. 10,000 downloads would cost $270,000.

Sure you could go for something with a better cost model like R2, but you can't beat using one or two unmetered connections on a VPN to constantly seed on Bittorrent, your pricing would be effectively free and reliability would be higher than if you just exposed a HTTP server on the Internet in such a way.

KomoD

2 replies

18h45m

2024-03-17 23:39:55 UTC

and egressing out of AWS at $0.09/GB

There's a lot of seeders on the torrent that are actually AWS ips too, all with similar configurations which makes me believe that it's probably xAI running them

on a VPN

That's unnecessary, you don't need a VPN?

rezonant

1 replies

17h32m

2024-03-18 00:52:29 UTC

No you don't, but if you wanted to host it from your gigabit office IP, you probably would want to.

KomoD

0 replies

15h53m

2024-03-18 02:32:17 UTC

Why?

xcv123

1 replies

22h10m

2024-03-17 20:15:02 UTC

This is not some crappy DVD rip on The Pirate Bay. It will be seeded as long as its relevant.

Twitter/X has their own massive infrastructure and bandwidth to seed this indefinitely.

KomoD

0 replies

21h11m

2024-03-17 21:13:48 UTC

Yeah, they can just leave some server running somewhere and just let it seed forever

larrysalibra

0 replies

22h8m

2024-03-17 20:17:03 UTC

The great thing about torrents is that you (or anyone else who cares) can single-handedly solve the problem you're complaining about by seeding the torrent.

cedws

0 replies

22h13m

2024-03-17 20:12:23 UTC

GitHub may choose to throttle downloads or remove the files simply because they're taking up too much bandwidth.

A torrent is less likely to go down in the short term.

whywhywhywhy

2 replies

7h6m

2024-03-18 11:18:57 UTC

Because Bittorrent is an outstanding tech for delivering large files, more I think about it the more I'm surprised it wasn't taken advantage of more.

Marlinski

1 replies

3h38m

2024-03-18 14:46:41 UTC

it's been criminalized to hell by IP holders and hollywood. Such a shame they killed the best tech of the previous decade. Could have revolutionized how we distribute content, approach CDN and even streaming.

harkinian

0 replies

1h53m

2024-03-18 16:31:37 UTC

In what way is the bittorrent protocol criminalized?

lambdaba

2 replies

22h26m

2024-03-17 19:58:56 UTC

Why not? Mistral was first to do it, it has become tradition.

orlp

0 replies

20h17m

2024-03-17 22:07:53 UTC

BitTorrent is just an objectively superior method of delivering a lot of data to a lot of people.

gillesjacobs

0 replies

22h23m

2024-03-17 20:01:45 UTC

I believe it was Llama 1 that notoriously got leaked with a torrent on 4chan.

pooloo

1 replies

22h25m

2024-03-17 19:59:47 UTC

Its likely over 100GB of data, so I wouldn't say its necessarily unusual to spread out the bandwidth across multiple hosts.

pogue

0 replies

22h17m

2024-03-17 20:08:00 UTC

Thanks! I searched and searched for a tool that would show me info via the web about a magnet link but nada

bongodongobob

1 replies

22h18m

2024-03-17 20:06:31 UTC

I'm not sure why you wouldn't tbh. That's a lot of bandwidth.

DonHopkins

0 replies

1h15m

2024-03-18 17:10:17 UTC

I'm not sure why you would repeatedly lie and spread easily disproven, ignorant, and maliciously dangerous misinformation about the well known and scientifically proven dangers of lead poisoning. What's the hell is wrong with you? Explain yourself.

Are you going to spread some lies about how non-toxic and useful asbestos is around the house and for children's toys and clothing, too?

https://news.ycombinator.com/item?id=39746806

ur-whale

0 replies

9h58m

2024-03-18 08:27:18 UTC

Can someone explain why the weights are posted via a Bittorrent magnet link?

I think the best way to get an answer to that question is to try to host it yourself and see what happens.

seydor

0 replies

10h47m

2024-03-18 07:38:23 UTC

my optimistic explanation is we are going back to the 2000s internet , but probably we are not

raydev

0 replies

19h36m

2024-03-17 22:48:38 UTC

Spreads the burden/cost of distributing a 300+GB file.

monkin

0 replies

22h21m

2024-03-17 20:03:33 UTC

It's 318.24G

https://academictorrents.com/details/5f96d43576e3d386c9ba65b...

leumon

0 replies

20h45m

2024-03-17 21:40:04 UTC

Mistral did it too when they released their first open model. They just posted a magnet link on Twitter.

jiripospisil

0 replies

21h33m

2024-03-17 20:51:32 UTC

I don't understand why you're being downvoted for asking a legitimate question. People not familiar with model weights might be surprised that they are often in tens of gigabytes and in this case even more.

fzzzy

0 replies

20h49m

2024-03-17 21:35:42 UTC

It may become a tradition since weights are so large. Perhaps it started when the Llama torrent link leaked. Then, Mistral decided to release their weights using bittorrent.

MallocVoidstar

0 replies

22h22m

2024-03-17 20:03:23 UTC

Distributing 300GB via torrent is cheaper than direct, assuming even a few other people seed

tosh

34 replies

22h41m

2024-03-17 19:43:36 UTC

blog post: https://x.ai/blog/grok-os

  * 314B parameters (86B active at a time)
  * mixture of experts 8 (2 active at a time)
  * weights and architecture licensed under Apache 2.0

(edit:) announcement blog post from last year with benchmarks compared to Claude 2, GPT-3.5 and GPT-4: https://x.ai/blog/grok

(edit2:)TL;DR: somewhat comparable to GPT-3.5, Mixtral and Qwen-1.5-72B in capability but way larger than the open weight models

OkGoDoIt

18 replies

22h19m

2024-03-17 20:05:55 UTC

Is a model so huge that’s only at the level of GPT 3.5 actually good? That seems incredibly inefficient to me.

fwlr

10 replies

19h40m

2024-03-17 22:45:22 UTC

OpenAI is valued at 90 billion and all they do is make GPT; Twitter is valued at 40 billion and this was essentially a vanity side-project by a cowboy CEO. Presuming that benchmarks and general “it’s about the level of 3.5” is accurate, it’s inefficient, but not incredibly inefficient imho

pelorat

8 replies

18h27m

2024-03-17 23:57:32 UTC

Twitter is valued at 40 billion

WAS vaulued at 44B.

Now?

Maybe 5 billion.

wongarsu

6 replies

17h7m

2024-03-18 01:18:05 UTC

Last I heard they lost 15% of their users, so let's call it 36 billion.

wraptile

2 replies

15h40m

2024-03-18 02:44:57 UTC

Twitter didn't have direct competitors other than Mastodon when it was taken at 44B. Now there's Threads, Bluesky and bigger Mastodon.

squigglydonut

0 replies

12h40m

2024-03-18 05:45:20 UTC

None of these matter

jsight

0 replies

14h13m

2024-03-18 04:12:15 UTC

Honestly, none of those look like meaningful competitors at the moment.

mceachen

0 replies

16h32m

2024-03-18 01:52:35 UTC

More like $13b.

https://arstechnica.com/tech-policy/2024/01/since-elon-musks...

dilyevsky

0 replies

13h49m

2024-03-18 04:35:40 UTC

They weren't even 44B when elon took the keys - he specifically tried to back out of the deal because 44B was insane peak '21 asset bubble price. In truth they were probably like 10-15B at that moment. And now that bunch of advertisers left due to we know who it's probably about 10B

Lewton

0 replies

8h11m

2024-03-18 10:13:52 UTC

twitter was valued around 30 billion when musk tried getting out of buying it (then the market cap went up when it became clear that he would be forced to pay full price)

alvah

0 replies

15h37m

2024-03-18 02:48:16 UTC

LOL @ $5 billion, but if it that was the valuation, you'd be making parent's point stronger.

thekhatribharat

0 replies

11h49m

2024-03-18 06:36:08 UTC

xAI is a separate entity, and not a X/Twitter subsidiary.

drak0n1c

4 replies

21h9m

2024-03-17 21:15:50 UTC

It’s designed to be actively searching real-time posts on X. Apples and oranges.

hn_20591249

1 replies

18h53m

2024-03-17 23:31:35 UTC

The data pipeline isn't included in this release, and we already know it is a pretty simple RAG pipeline using qdrant, https://twitter.com/qdrant_engine/status/1721097971830260030.

Nothing about using data in "real time" predicates that the model parameters need to be this large, and is likely quite inefficient for their "non-woke" instructional use-case.

lmeyerov

0 replies

2h33m

2024-03-18 15:52:11 UTC

Agreed. We have been building our real-time GPT flows for news & social as part of Louie.AI, think monitoring & and investigations... long-term, continuous training will become amazing, but for the next couple of years, most of our users would prefer GPT4 or Groq vs what's here and much smarter RAG. More strongly, the interesting part is how the RAG is done. Qdrant is cool but just a DB w a simple vector index, so nothing in Grok's release is tech we find relevant to our engine.

Eg, there is a lot of noise in social data, and worse, misinfo/spam/etc, so we spend a lot of energy on adverserial data integration. Likewise, queries are often neurosymbolic, like on a data range or with inclusion/exclusion criteria. Pulling the top 20 most similar tweets to a query and running through a slow, dumb, & manipulated LLM would be a bad experience. We have been pulling in ideas from agents, knowledge graphs, digital forensics & SNA, code synthesis, GNNS, etc for our roadmap, which feels quite different from what is being shown here.

We do have pure LLM work, but more about fine-tuning smaller or smarter models, and we find that to be a tiny % of the part people care about. Ex: Spam classifications flowing into our RAG/KG pipelines or small model training is more important to us than it flowing into a big model training. Long-term, I do expect growing emphasis on the big models we use, but that is a more nuanced discussion.

(We have been piloting w gov types and are preparing for next cohorts, in case useful on real problems for anyone.)

pests

0 replies

16h36m

2024-03-18 01:48:28 UTC

Isn't that... the same thing as search?

grey8

0 replies

20h9m

2024-03-17 22:15:39 UTC

Why is that relevant to the size?

Post search on X is done as it is with any other data from any other source, you use RAG and function calling to insert the context.

< 7B open source models can function call very well. In fact, Nous Hermes 2 Pro (7B) is benchmarking better at that then GPT-3.5.

Not related to the size, if I'm not mistaken.

xcv123

0 replies

8h49m

2024-03-18 09:35:37 UTC

According to their benchmarks it is superior to GPT-3.5

cma

0 replies

21h51m

2024-03-17 20:34:15 UTC

Since it is MoE, quantized it could be able to run on cheaper hardware with just consumer networking inbetween instead of needing epyc/xeon levels of PCI-e lanes, nvlink, or infiniband type networking. Or it could even run with people pooling smaller systems over slow internet links.

tootie

11 replies

21h48m

2024-03-17 20:36:42 UTC

How is it that OpenAI was touted like it was some massive years-long effort that blew all AI research out of the water and now we have so many competitors popping up one after another?

longdog

7 replies

21h29m

2024-03-17 20:55:46 UTC

You don't need to be a cutting edge research scientist to train a SOTA LLM. You just need money for scaling. OpenAI's "secret" was just their willingness to spend tens/hundreds of millions without guaranteed returns, and RLHF/instruct fine tuning, both of which are out of the bag now.

simonw

6 replies

21h11m

2024-03-17 21:14:00 UTC

Disagree. It took more than 12 months from the release of GPT-4 to someone else producing a model of equivalent quality, and that definitely wasn't due to a shortage of investment from the competition.

There's a huge amount of depth in training a really good LLM. Not helped by the fact that iteration is incredibly expensive - it might take several months (and millions of dollars) before you can tell if your new model is working well or if there was some mistake in the pipeline that lead to a poor quality result.

Almost all of the world-class LLMs outside of OpenAI/DeepMind have been trained by people who previously worked at those organizations - giving them invaluable experience such that they could avoid the most expensive mistakes while training their new models.

int_19h

2 replies

10h21m

2024-03-18 08:03:43 UTC

There's still no model of equivalent quality to GPT-4.

johnthewise

0 replies

8h24m

2024-03-18 10:00:41 UTC

Claude opus is better in my experience

bbig

0 replies

8h27m

2024-03-18 09:58:11 UTC

Claude 3 Opus is reporting superior metrics, particularly in its coding ability, and in the LLM Arena it is statistically tied with GPT-4.

lossolo

0 replies

20h59m

2024-03-17 21:25:31 UTC

Don’t overlook the training data (used for both training and instruction fine-tuning), it is one of the most crucial aspects, if not the most critical, given the significant differences observed in models with similar architectures.

echelon

0 replies

20h56m

2024-03-17 21:29:16 UTC

That only remains an advantage if they can continue climbing the gradient from their lead position. If they hit a snag in scaling, methodology, or research, everyone else on the planet catches up, and then it's anyone's game again.

barrell

0 replies

13h0m

2024-03-18 05:24:46 UTC

While I do agree there is some amount of secret sauce, keep in mind the training takes several months. So from someone to see the success of GPT4, decide they want to invest that amount of money to train the same, raise the money to train the model, find someone competent to supervise the training, train the model for several months, then test and integrate it could easily be a year long even if there was no secret sauce.

jxy

0 replies

21h6m

2024-03-17 21:19:17 UTC

OpenAI still seems to be at the top, except for Anthropic, who may be close, in terms of the capabilities comparing gpt-4 and claude-opus.

This Grok-1 is a large model (~314B), which matches gpt-3.5 released 2 years ago, and at about the same level of much smaller models like, mixtral (~47B) and qwen-1.5 (~72B). Do you think it's competitive?

cavisne

0 replies

21h27m

2024-03-17 20:57:30 UTC

LLM training is arcane and expensive to experiment with. So OpenAI had to waste a lot of time and GPU-hours on things that didn't work to learn the tricks that did work.

Most of the competitors have lineage straight back to OpenAI, eg the lead of x.ai was previously at OpenAI and Deepmind. Likewise with Mistral and especially Anthropic.

ben_w

0 replies

21h37m

2024-03-17 20:48:10 UTC

Egg of Columbus.

Also, the general architecture is well documented, ChatGPT (specifically the chat interface, not GPT-3, not InstructGPT) is what made a lot of people care, and actually reproducing it requires someone wanting to in the first place.

TOMDM

1 replies

22h19m

2024-03-17 20:05:37 UTC

Mixtral is also comparable to gpt 3.5 and open.

At 8x7B it's also a fraction of the size. Are there any benchmarks comparing Mixtral to Grok?

tosh

0 replies

22h11m

2024-03-17 20:14:20 UTC

Mixtral announcement is here: https://mistral.ai/news/mixtral-of-experts/

Mixtral looks more economical @ capability to size (similar also for Qwen 1.5 72b)

asciii

0 replies

21h48m

2024-03-17 20:37:14 UTC

I love the citation for image in the article

The cover image was generated using Midjourney based on the following prompt proposed by Grok: A 3D illustration of a neural network, with transparent nodes and glowing connections, showcasing the varying weights as different thicknesses and colors of the connecting lines.

hubraumhugo

34 replies

22h38m

2024-03-17 19:46:44 UTC

When will we reach an upper limit/dimishing returns in terms of number of parameters and mixture of experts?

andy99

33 replies

22h36m

2024-03-17 19:48:59 UTC

We may have already - data is more important than anything else which is why nobody has beat GPT4 yet. Throwing more parameters or more compute at the problem only gets you so far. But Grok was never a contender so there is room to improve on it. It is one of the biggest models open sourced as mentioned, so will be interesting to take a look at for sure.

lambdaba

16 replies

22h34m

2024-03-17 19:51:15 UTC

Claude 3 has *decisively* beat GPT-4, I wonder how all their attributes compare.

stainablesteel

8 replies

22h10m

2024-03-17 20:14:33 UTC

i like some of claudes answers better, but it doesnt seem to be a better coder imo

simonw

6 replies

22h6m

2024-03-17 20:19:25 UTC

I've found it to be significantly better for code than GPT-4 - I've had multiple examples where the GPT-4 solution contained bugs but the Claude 3 Opus solution was exactly what I wanted. One recent example: https://fedi.simonwillison.net/@simon/112057299607427949

How well models work varies wildly according to your personal prompting style though - it's possible I just have a prompting style which happens to work better with Claude 3.

bugglebeetle

4 replies

21h46m

2024-03-17 20:38:44 UTC

What is your code prompting style for Claude? I’ve tried to repurpose some of my GPT-4 ones for Claude and have noticed some degradation. I use the “Act as a software developer/write a spec/implement step-by-step” CoT style.

simonw

2 replies

21h7m

2024-03-17 21:17:36 UTC

Almost impossible to describe prompting style, but here are some examples of how I've used Claude 3:

https://gist.github.com/simonw/4cecde4a729f4da0b5059b50c8e01... - writing a Python function

https://gist.github.com/simonw/408fcf28e9fc6bb2233aae694f8cd... - most sophisticated example, building a JavaScript command palette

https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83... - asking it to refactor some existing code

I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.

lgas

0 replies

20h51m

2024-03-17 21:33:29 UTC

I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.

It's very contextually dependent. You really have to things like this for your specific task, with your specific model, etc. Sometimes it helps, sometimes it hurts, and sometimes it does nothing at all.

bugglebeetle

0 replies

20h33m

2024-03-17 21:51:48 UTC

Super helpful! Thanks!

furyofantares

0 replies

19h56m

2024-03-17 22:29:18 UTC

I didn't know people were still doing this "act as etc etc" instructional prompting.

I just tell it my coding problem. Or when making something from scratch, ask for small things and incrementally add.

asciii

0 replies

21h44m

2024-03-17 20:40:55 UTC

according to your personal prompting style though

I like the notion of someone’s personal prompting style (seems like a proxy for those that can prepare a question with context about the other’s knowledge) - that’s interesting for these systems in future job interviews

furyofantares

0 replies

19h51m

2024-03-17 22:33:33 UTC

I've found it significantly better than GPT4 for code and it's become my go-to for coding.

That's actually saying something, because there's also serious drawbacks.

- Feels a little slower. Might just be UI

- I have a lot of experience prompting GPT4

- I don't like using it for non-code because it gives me to much "safety" pushback

- No custom instructions. ChatGPT knows I use macos and zsh and a few other preferences that I'd rather not have to type into my queries frequently

I find all of the above kind of annoying and I don't like having two different LLMs I go to daily. But I mention it because it's a fairly significant hurdle it had to overcome to become the main thing I use for coding! There were a number of things where I gave up on GPT then went to Claude and it did great; never had the reverse experience so far and overall just feels like I've had noticeably better responses.

orbital-decay

3 replies

21h8m

2024-03-17 21:16:40 UTC

Has it, though? LMSys Arena Leaderboard (blind ranking by humans) [0] positions Opus just below GPT-4 with a negligible ELO gap.

[0] https://chat.lmsys.org/

staticman2

0 replies

3h22m

2024-03-18 15:02:57 UTC

That "blind ranking" is limited to about 2,000 tokens of context. So it's certainly not evaluating how good the models are at complex assignments.

espadrine

0 replies

20h32m

2024-03-17 21:52:40 UTC

A number of AI companies have a naming/reproducibility issue.

GPT4 Turbo, released last November, is a separate version that is much better than GPT-4 (winning 70% of human preferences in blind tests), released in March 2023.

Claude 3 Opus beats release-day GPT-4 (winning 60% of human preferences), but not GPT-4 Turbo.

In the LMSys leaderboard, release-day GPT-4 is labeled gpt-4-0314, and GPT4 Turbo is labeled gpt-4-1106-preview.

BoorishBears

0 replies

18h22m

2024-03-18 00:02:32 UTC

Chatbot Arena is not a blind ranking.

Many, if not most, users intentionally ask the models questions to tease out their canned disclaimers: so they know exactly which model is answering.

On one hand it's fair to say disclaimers affect the usefulness of the model, but on the other I don't think most people are solely asking these LLMs to produce meth or say "fuck", and that has an outsized effect on the usefulness of Chatbot Arena as a general benchmark.

I personally recommend people use it at most as a way to directly test specific LLMs and ignore it as a benchmark.

swalsh

1 replies

21h46m

2024-03-17 20:38:55 UTC

I don't know if Claude is "smarter" in any significant way. But its harder working. I can ask it for some code, and I never get a placeholder. It dutifully gives me the code I need.

lambdaba

0 replies

21h30m

2024-03-17 20:54:44 UTC

It understands instructions better, it's rarer to have it misunderstand, and I have to be less careful with prompting.

htrp

0 replies

22h4m

2024-03-17 20:21:05 UTC

citation needed (other than 'vibes')

YetAnotherNick

9 replies

22h20m

2024-03-17 20:05:19 UTC

There is no reason to believe GPT-4 had more(or higher quality) data than Google etc. has now. GPT-4 was entirely trained before the Microsoft deal. If OpenAI could pay to acquire data in 2023, >10 companies could acquire similar quality by now, and no one has similar quality model in a year.

austhrow743

8 replies

21h53m

2024-03-17 20:31:33 UTC

The more disregard a company has for intellectual property rights, the more data they can use.

Google had far more to lose from a "copyright? lol" approach than OpenAI did.

brookst

4 replies

21h44m

2024-03-17 20:41:07 UTC

I was under the impression training was at best an undefined area of IP law. Is there any aspect of copyright that prohibits training models?

simonw

3 replies

21h4m

2024-03-17 21:21:11 UTC

This is being tested by a number of lawsuits right now, most notably the NY Times one: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

The key questions are around "fair use". Part of the US doctrine of fair use is "the effect of the use upon the potential market for or value of the copyrighted work" - so one big question here is whether a model has a negative impact on the market for the copyrighted work it was trained on.

sroussey

2 replies

20h39m

2024-03-17 21:45:47 UTC

I don’t think the New York Times thing is that much about training, than it is about the fact that ChatGPT can use Bing and Bing has access to New York Times articles for search purposes.

simonw

1 replies

20h36m

2024-03-17 21:48:49 UTC

If you read the lawsuit it's absolutely about training. The Bing RAG piece is one of the complaints in there but it's by no means the most important.

Take a look at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... - bullet points 2 and 4 on pages 2/3 are about training data. Bullet point 5 is the Bing RAG thing.

sroussey

0 replies

18h23m

2024-03-18 00:01:37 UTC

Ah, thanks!

supafastcoder

1 replies

11h56m

2024-03-18 06:29:20 UTC

Google had far more to lose from a "copyright? lol" approach than OpenAI did.

The company that scrapes trillions of web pages has an issue with copyright?

sib

0 replies

2h54m

2024-03-18 15:31:12 UTC

Well... Googlebot does pay attention to robots.txt - I don't think (original) OpenAI-bot did.

YetAnotherNick

0 replies

14h31m

2024-03-18 03:54:04 UTC

Having used both Google's and OpenAI's models, the kind of issue they have are different. Google's models are superior or at least on par in knowledge. It's the instruction following and understanding where OpenAI is significantly better. I don't think pretraining data is the reason of this.

squigz

2 replies

22h32m

2024-03-17 19:52:38 UTC

I think Groq is something else?

andy99

0 replies

22h1m

2024-03-17 20:24:00 UTC

Edited, I did mean the Grok in the article not the inference chip.

LorenDB

0 replies

22h28m

2024-03-17 19:56:33 UTC

Indeed, Groq is a company building inference accelerators. Grok is completely unaffiliated.

ldjkfkdsjnv

2 replies

22h2m

2024-03-17 20:22:51 UTC

Claude > GPT4. Anyone using these models on a daily basis knows this

jstummbillig

0 replies

21h53m

2024-03-17 20:31:52 UTC

It is known

int_19h

0 replies

10h8m

2024-03-18 08:16:58 UTC

I use these models regularly, and Claude is dumb as a rock compared to GPT-4.

rvnx

24 replies

22h33m

2024-03-17 19:51:41 UTC

One subtle thing: Musk said "open-source", we got "open-weights" instead (still better than nothing though, so it's greatly appreciated).

paulgb

15 replies

22h31m

2024-03-17 19:54:22 UTC

Dumb question: what should open-source mean in the context of something like this? Open access to the training data and training pipeline as well?

CharlesW

12 replies

22h27m

2024-03-17 19:57:30 UTC

It's not a dumb question, and the answer is "yes".

simonw

6 replies

21h56m

2024-03-17 20:29:13 UTC

A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.

CharlesW

4 replies

21h30m

2024-03-17 20:55:22 UTC

Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.

gfodor

3 replies

20h41m

2024-03-17 21:43:44 UTC

Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.

zer00eyz

1 replies

20h13m

2024-03-17 22:12:00 UTC

You all keep using the word "Data"

Data, as in facts, as in the frequency of one word in relation to another.

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html

It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.

gfodor

0 replies

3h24m

2024-03-18 15:00:34 UTC

No, the term data can be used to describe anything that can be recorded in bytes. It’s “data storage capacity” when you buy a hard drive.

CharlesW

0 replies

20h28m

2024-03-17 21:56:32 UTC

…I think OpenAI licenses their data…

They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.

logicchains

0 replies

11h18m

2024-03-18 07:07:20 UTC

https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.

zeroCalories

2 replies

22h10m

2024-03-17 20:15:15 UTC

Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.

schoen

1 replies

22h1m

2024-03-17 20:23:31 UTC

Maybe it should be called something else? "Openly-licensed"?

Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).

zeroCalories

0 replies

18h8m

2024-03-18 00:16:43 UTC

Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.

nabakin

0 replies

21h18m

2024-03-17 21:06:51 UTC

Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.

dudus

0 replies

19h23m

2024-03-17 23:01:58 UTC

If you release that instead of the binary weights you can be both more open and less useful for users. Fun

TaylorAlexander

0 replies

22h25m

2024-03-17 19:59:57 UTC

The Open Source Initiative is actively working on this over the course of this year, and your input will help define that meaning! Please see here for more:

https://opensource.org/blog/open-source-ai-definition-weekly...

Q6T46nT668w6i3m

0 replies

22h27m

2024-03-17 19:58:02 UTC

Yes, training and evaluation code, i.e., the code used to generate the weights.

solarkraft

4 replies

22h4m

2024-03-17 20:20:36 UTC

He also called permissively licensing Tesla's patents "open sourcing" them. He's at the forefront of misusing the term.

drexlspivey

3 replies

19h21m

2024-03-17 23:04:24 UTC

The “source” in “open source” refers to source code which they released. A dataset is not source code, if anyone is misusing the term it’s you.

solarkraft

0 replies

18h4m

2024-03-18 00:20:42 UTC

https://youtu.be/WyTzRnGSlcI?t=88

frabcus

0 replies

18h32m

2024-03-17 23:53:02 UTC

I consider the weights a binary program and the source code is the training data. The training algorithm is the compiler.

I agree this isn't standard terminology, but it makes the most sense to me in terms of power dynamics and information flow.

We know from interpretability research that the weights do algorithms eg sin approximation etc. So they feel like binary programs to me.

HarHarVeryFunny

0 replies

17h12m

2024-03-18 01:12:54 UTC

If you can't rebuild it, then how can you be considered to have the "source code" ?

The training data isn't a dataset used at runtime - it's basically the source code to the weights.

Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".

tylerekahn

1 replies

22h25m

2024-03-17 20:00:11 UTC

This is the weights and the model under Apache 2.0 license. What do you mean by open-source?

https://github.com/xai-org/grok/blob/main/model.py

https://github.com/xai-org/grok/blob/main/run.py#L25

pclmulqdq

0 replies

22h17m

2024-03-17 20:08:18 UTC

Still better than most of the "open weights" models that have massively restrictive terms.

TaylorAlexander

0 replies

22h27m

2024-03-17 19:58:25 UTC

Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.

seccode

20 replies

21h54m

2024-03-17 20:30:33 UTC

It would be cool if these models had conversations with us where they ask questions. I think the future of AI is models that ask questions. There is so much data to be gained by doing this.

crowcroft

9 replies

21h51m

2024-03-17 20:33:46 UTC

Ok im curious, but I don’t quite understand.

What would you want an AI to be asking you, and what would you want it to do with your response(s)?

BoorishBears

4 replies

21h47m

2024-03-17 20:37:56 UTC

I ask AI to produce clarifying questions then answer them.

Can help in not wasting a bunch of time waiting for an answer that missed the mark.

I think the sibling comment is probably the least attractive reason to have AI ask questions.

seccode

3 replies

21h43m

2024-03-17 20:42:13 UTC

I agree, medical history is probably not the sexiest reason to have AI ask questions. I think there are many more reasons; I think the Turing Test is the best metric to evaluate AIs, and current models come nowhere close. When people first meet they ask questions about their background. It would be nice if a model replicated that

BoorishBears

2 replies

21h41m

2024-03-17 20:43:56 UTC

and could direct better ads to me.

Is the least attractive part, by far.

seccode

1 replies

21h33m

2024-03-17 20:52:16 UTC

In order for an AI to pass a Turing Test, it would surely ask questions. Think of Ava from Ex Machina. She asked questions to learn more about him

BoorishBears

0 replies

21h9m

2024-03-17 21:15:45 UTC

I'm not debating the value of questions. I'm debating the value of feeding it to advertisers, especially since LLMs can infer much deeper insights about a person than a traditional assistant can with its canned capabilities and responses

lars_francke

1 replies

21h45m

2024-03-17 20:40:04 UTC

Clarifying questions if the initial prompt was unclear. I'd love it.

I regularly try to add something along the lines of "please ask clarifying questions if you could only give a generic or partial response otherwise" but so far it has never helped (ChatGPT 4).

whimsicalism

0 replies

17h54m

2024-03-18 00:30:30 UTC

?? gpt4 does this for me regularly

seccode

0 replies

21h49m

2024-03-17 20:36:20 UTC

I get advertisements all the time for conditions that I do not have, and that none of my family members have. If you had a model that asked questions, it could learn my medical history and could direct better ads to me.

In order for AI to understand the world, it would have to ask questions. Understanding humans is key to understanding the world.

globular-toast

0 replies

21h49m

2024-03-17 20:36:22 UTC

Learn from them.

swalsh

7 replies

21h53m

2024-03-17 20:32:11 UTC

That's just a matter of fine tuning

ijustlovemath

3 replies

21h47m

2024-03-17 20:37:30 UTC

That "just" is doing some heavy lifting! GPT-4 is just a few matrix multiplications, how bad can their moat really be?

BoorishBears

1 replies

21h43m

2024-03-17 20:41:53 UTC

Not sure what the snark here is for: It would be trivial to produce a dataset where the model asked you questions then fine-tune on that.

People already do it with chain-of-thought and you could get away with a few dozen examples if you wanted to try this.

BoorishBears

0 replies

17h16m

2024-03-18 01:09:24 UTC

Out of boredom I decided to prove this too: I asked ChatGPT and Claude for ~200 samples in total.

Just uploaded the examples as-is to OpenAI, selected 3.5 as the model to fine-tune and about 20 minutes later I had my model.

Works fine, asks good questions, can ask more than 1 follow up question if needed, and actually changes its answers based on the clarifying questions.

https://imgur.com/a/SsXunVN

swalsh

0 replies

21h33m

2024-03-17 20:52:10 UTC

I'd bet a synthetic data set could do the job effectively.

seccode

2 replies

21h51m

2024-03-17 20:33:58 UTC

Do you have an example model I could try that does this?

amrrs

1 replies

21h50m

2024-03-17 20:34:34 UTC

Try Pi by inflection. It asks a lot of questions.

seccode

0 replies

21h45m

2024-03-17 20:39:28 UTC

I tried it, it just asked me how my day was going. I don't think this is doing exactly what I have in mind. But its a step in that direction

geor9e

0 replies

20h42m

2024-03-17 21:42:36 UTC

Explore this idea more - it's easily implemented in a minute or two via the system prompt. API accounts are free to start and you can use the playground/workbench view, like this: https://imgur.com/h5jFoBM.jpg . I like Claude but OpenAI is popular too. OpenAI has a nice way to create a gallery of system prompts that act however you like, they call them Agents or GPTs.

Me1000

0 replies

21h0m

2024-03-17 21:25:14 UTC

100% agreed. Gemini advanced does this sometimes. I wrote about it more in an older thread here: https://news.ycombinator.com/item?id=39445484

mattxxx

17 replies

22h22m

2024-03-17 20:03:04 UTC

I respect the openness here! This is the future that I want to see

giancarlostoro

15 replies

21h11m

2024-03-17 21:13:50 UTC

Fully agree. People will trash talk it due to Musk but lets not forget the engineers who poured hours of their lives into building this and are continuing to do so.

devin

4 replies

20h43m

2024-03-17 21:41:34 UTC

I still reserve the right to trash talk Musk as I don’t believe he is committed to openness as much as he wants to spite OpenAI for telling him to pound sand.

llm_trw

2 replies

20h21m

2024-03-17 22:03:36 UTC

What's the difference?

Oh no, I only want _pure_ intentions for anything I use. Which is why I reject all for profit medicine.

It doesn't matter why he did it. What matters is that he did it.

devin

1 replies

19h40m

2024-03-17 22:44:41 UTC

It matters to me why people do things. I’m happy it’s open, but it doesn’t change my mind about the guy.

llm_trw

0 replies

19h35m

2024-03-17 22:49:33 UTC

What an exhausting way to live.

giancarlostoro

0 replies

16h10m

2024-03-18 02:14:52 UTC

This makes no sense to me for two reasons:

- He pointed out that his understanding was that it would be open source in some way

- The name OpenAI implies an open source endeavor. I dont know many things named Open that are in fact close sourced.

knowsuchagency

3 replies

21h2m

2024-03-17 21:23:24 UTC

The engineers who decided to work for him? Forgive me if I do forget about them and the hours of their lives spent on this

lynndotpy

2 replies

20h42m

2024-03-17 21:42:29 UTC

Engineers who joined Twitter pre-Musk days who live and work in the US on an H1-B visa can't just quit.

You can criticize Elon Musk without criticizing people who would have their lives upended if they quit or were fired.

throw2022110401

1 replies

20h15m

2024-03-17 22:10:17 UTC

That grace period has long passed. If you are still there at this point you have made a choice.

(Removed "complicit" because I don't like the way that sounded)

cap1434

0 replies

19h46m

2024-03-17 22:39:16 UTC

Complicit in what exactly?

revscat

2 replies

20h15m

2024-03-17 22:09:44 UTC

I feel the same about Tesla. They make good cars that are helping to get us off of oil. They have thousand of employees.

And who among us has a CEO that isn’t problematic, even if not so much so as Musk?

mplewis

0 replies

19h11m

2024-03-17 23:13:52 UTC

"Good" cars is a real stretch.

hobobaggins

0 replies

19h32m

2024-03-17 22:53:18 UTC

Tesla is likely making good cars because the CEO is 'problematic'

sprobertson

1 replies

19h8m

2024-03-17 23:16:38 UTC

engineers who poured hours of their lives into building this

Not to mar these specific engineers, but that's an empty phrase that can be said about anything ever built. It doesn't somehow make the idea or implementation good.

giancarlostoro

0 replies

16h36m

2024-03-18 01:49:25 UTC

The phrase merely means dont just overlook something because someone else who did not even labour over the end result.

afavour

0 replies

20h42m

2024-03-17 21:42:30 UTC

Were they not paid to do so?

trog

0 replies

19h30m

2024-03-17 22:54:50 UTC

Is it open if it doesn't include the training data? Genuine question - I am not familiar enough with the terms and technology to know. But my understanding is the weights is just a more or less static collection of data that has been (to paraphrase Ted Chiang) lossily compressed from the actual raw training data.

Without the training data to thoroughly evaluate what is in there, the only way you can figure it out is through experimentation - e.g. running it up in a chatbot and asking it questions.

Is this roughly correct or am I misunderstanding what you can do with the weights?

nylonstrung

15 replies

22h34m

2024-03-17 19:51:21 UTC

For what reason would you want to use this instead of open source alternatives like Mistral

rvnx

11 replies

22h32m

2024-03-17 19:52:43 UTC

Mistral opened their weights only for very small LLaMA-like model.

MallocVoidstar

10 replies

22h16m

2024-03-17 20:09:01 UTC

I'm pretty sure Mixtral outperforms Grok-1 and uses much less memory to do it

elfbargpt

7 replies

22h10m

2024-03-17 20:15:19 UTC

I'm a little out of touch, is there a way to see how Grok measures up to other models?

amrrs

6 replies

21h49m

2024-03-17 20:36:05 UTC

Benchmarks here https://x.ai/blog/grok

refulgentis

5 replies

21h33m

2024-03-17 20:52:16 UTC

And to compare, you can sort by MMLU on here: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb....

Edit: to include my self summary after review: There's a good 100 models better than, a couple 1x7b even. Mixtral stomps it, half mixtral are universally better but one is close to same.

lossolo

4 replies

20h49m

2024-03-17 21:35:49 UTC

This benchmark is mostly worthless, some of the top models there were trained on benchmark data, which is a known fact in the community.

The only reliable benchmark: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

refulgentis

3 replies

17h23m

2024-03-18 01:02:07 UTC

No, it's not "mostly worthless" and yes, some of the top models were removed a few months back from being trained on benchmark data.

I urge you to at least think through what alternative you propose before posting so aggressively in these situations. Lmsys doesn't have Grok, or I would have included it. And having _some_ data is better than none.

I also had someone arguing with me 6 months back that we can't trust any benchmarks at all from vendors, which would exclude the blog post. Instead of just repeating that back vehemently, I filled a gap. It's important we don't self-peasantize as a species, all data has its issues, that doesn't mean we throw it all out.

michaelt

2 replies

6h33m

2024-03-18 11:51:45 UTC

Quantifiable metrics are useful if they're credible, certainly.

But does it seem likely, to you, that a 7B-parameter model would outperform a 314B-parameter model? Given that we can look at the chatbot arena leaderboard and it's dominated by proprietary, 70B and 8x7B models?

A well regarded and modern model like Mixtral 8x7B, which is ranked 13th on the chatbot arena leaderboard, scores 72.7 'Average' on the open LLM leaderboard - and yet 'pastiche-crown-clown-7b-dare-dpo' scores 76.5.

To me, that sounds too good to be true.

refulgentis

1 replies

5h43m

2024-03-18 12:41:27 UTC

Yup, 100%. Grok isn't very good and it was rushed.

Rest re: pastiche model, etc. are proposing things I'm not claiming, or close to what I'm claiming.

n.b. you don't multiply the parameters by experts to get an effective parameter count. Why? Think of it this way: every expert needs to learn how to speak English, so there's a nontrivial amount of duplication among all experts

michaelt

0 replies

5h12m

2024-03-18 13:13:21 UTC

> n.b. you don't multiply the parameters by experts to get an effective parameter count.

I actually took the 314B from Grok's HF page [1] which describes the model as "314B parameters" when explaining why it needs a multi-GPU machine.

I certainly agree that parameter count isn't everything, though; clearly things like training data quality and fine tuning count for a lot.

[1] https://huggingface.co/xai-org/grok-1

cavisne

1 replies

21h37m

2024-03-17 20:47:31 UTC

One of the interesting things when weights are open sourced is the community can often improve the results. See all the bugs fixed in Gemma for an example.

ein0p

0 replies

18h4m

2024-03-18 00:21:12 UTC

Doubtful, for purely information theoretic and memory capacity reasons. It may outperform on some synthetic metrics, but in practice, to a human, larger models just feel “smarter” because they have a lot more density in their long tail where metrics never go

zozbot234

1 replies

18h58m

2024-03-17 23:27:04 UTC

Isn't this Apache licensed? Regardless, you can run multiple models concurrently on the same input using well-known ensemble techniques. (Not to be confused with mixture-of-experts, which is more like training a single model where only a few blocks are chosen to be active at any given time - a kind of sparsity.)

tlb

0 replies

5h28m

2024-03-18 12:56:31 UTC

Not super easy if they have different tokenizers.

verticalscaler

0 replies

22h7m

2024-03-17 20:18:25 UTC

Well if nothing else, this one might be significantly less nerfed. Very interesting to compare to the others.

stale2002

13 replies

22h14m

2024-03-17 20:11:08 UTC

Hey, asking any experts here, what are their first thoughts in the significance of this?

IE, is this comparable to any other model released, or are there significant metric differences that make it better for certain usecases?

The only thing I see, of the top of my head, is that it is a very large model, and I don't think any models of similar size have been released.

Me1000

10 replies

21h2m

2024-03-17 21:22:45 UTC

Not an expert by any means, but I like learning about this stuff and I play with a lot of open weight models.

I’d say the significance is that it happened. It’s by far the largest open weight model I’ve seen. But I’m not sure why you’d use it over a model like Mixtral, which seems to perform about the same at like 1/6th the size.

But I welcome any contribution to the open weight LLM community. Hopefully people will learn something interesting with this model. And I hope they keep releasing new versions!

MichaelRazum

9 replies

20h49m

2024-03-17 21:36:23 UTC

If I may ask, how do you load such big models? 300gb seems like a lot to play around with.

Me1000

8 replies

20h30m

2024-03-17 21:55:21 UTC

You're right, this model is going to be too big for most people to play around with. But to answer your question I have a 128GB of RAM in my M3 MacBook Pro, so I can use most of that for GPU inferencing. But still, this model is going to need to be heavily quantized for me to be able to use it. (fwiw, I probably wont try this one)

In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it. I suspect my computer might be able to run a 3 bit quant, but it might need to go down to 2 bits to have any kind of reasonable context length. But with quants that small I'd expect the model's performance to degrade well below that of Mixtral, so it probably isn't really even worth using. But we'll see; quantization is weird, some models perform better than others when quantized.

zozbot234

2 replies

18h45m

2024-03-17 23:40:20 UTC

A top-of-the-line Mac Studio Ultra maxes out at 192GB currently. This is also a MoE model, so only a fraction of parameters have to be in RAM.

Me1000

0 replies

17h31m

2024-03-18 00:54:17 UTC

MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.

EgoIncarnate

0 replies

18h16m

2024-03-18 00:08:33 UTC

Each token generated may only use a subset of the parameters (86billion instead of 314billion), but the next generated token might use a different subset. If it's anything like Mixtral, it will switch between experts constantly. It helps with memory bandwidth, but all the parameters still need to be in RAM or it would be unbearably slow.

TMWNN

2 replies

19h25m

2024-03-17 23:00:03 UTC

In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.

How quickly are new models available through Ollama?

cjbprime

0 replies

19h3m

2024-03-17 23:21:30 UTC

Few days max.

Me1000

0 replies

18h11m

2024-03-18 00:14:07 UTC

Ollama is just a wrapper around llama.cpp, so when the gguf model files come out it'll be able to run on Ollama (assuming no llama.cpp patch is needed, but even if it is ollama is usually good at getting those updates out pretty quickly).

MichaelRazum

1 replies

19h59m

2024-03-17 22:25:52 UTC

Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.

Me1000

0 replies

19h46m

2024-03-17 22:38:46 UTC

No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.

Apple had plenty of reasons to move forward with their Apple Silicon CPUs and GPUs in the mac, but they really did seem to get lucky with the unified memory architecture. It was kind of just an artifact of their design, but ends up serving the needs of deep neural net models really well!

whimsicalism

0 replies

17h58m

2024-03-18 00:27:05 UTC

seems like a large undertrained model, not that exciting imo compared to mixtral

it is also not the biggest model oss, switch transformer was released years ago and is larger and similarly undertrained

brucethemoose2

0 replies

17h0m

2024-03-18 01:24:34 UTC

Tests are not out yet, but:

- It's very large, yes.

- It's a base model, so its not really practical to use without further finetuning.

- Based on Grok-1 API performance (which itself is probably a finetune) its... not great at all.

littlestymaar

12 replies

21h51m

2024-03-17 20:34:06 UTC

How long before the Groq team sues for trademark violation? It's literally the purpose of trademark laws to make sure resembling names do not cause confusion in the mind of customers so it would be very surprising to see this situation persist.

nostrebored

6 replies

21h44m

2024-03-17 20:40:32 UTC

Would be a rough trademark enforcement case as “Grok” has been in common language for decades

ben_w

1 replies

21h30m

2024-03-17 20:54:35 UTC

So has "Apple" and "Windows".

Grok and groq both relate to AI, so there's definitely grounds to believe the names may cause consumer confusion.

After all, Apple (computers) was repeatedly sued by Apple (records) for doing music things.

cma

0 replies

21h2m

2024-03-17 21:23:14 UTC

It's easier to get a trademark on an altered word than a plain dictionary word. Just acquiring the easier one to acquire doesn't mean you now have rights over the harder one to acquire, though eventually after enough market recognition you might be given some control over other people using the common one. I wouldn't think groq is there yet.

Findecanor

1 replies

8h56m

2024-03-18 09:28:45 UTC

I myself have never heard it outside of "nerdy" circles... that is: people who would read science fiction.

I personally am not entirely happy about the word (no matter how it is spelled) being used for a particular AI product. "Grok" to me means knowing a subject at a much deeper level than I think any AI is capable of at the present level of technology. But it would be passable to use it for a company name, to indicate that it is a goal to strive for.

ben_w

0 replies

8h49m

2024-03-18 09:36:03 UTC

Generally agree, though I would say "knowing a subject at a much deeper level than any LLM is capable of", as AI more broadly also includes specialist models that are wildly super-human in narrow domains like chess and Go.

Angostura

1 replies

21h32m

2024-03-17 20:52:40 UTC

Robert A. Heinlein coined the term grok in 1961

a1369209993

0 replies

20h43m

2024-03-17 21:41:42 UTC

Six is plural.

mlindner

1 replies

15h53m

2024-03-18 02:31:43 UTC

Grok is a word in common parlance. So there's no way they could succeed in any suit. That's why the Groq team picked a modification of the word.

littlestymaar

0 replies

14h53m

2024-03-18 03:32:16 UTC

You mean like Canvas®, Apple®, Windows® or Amazon®? Wanna try re-use these for your own business and see how it goes?

There's nothing preventing you to trademark common words, it just must not be descriptive of your business.

EastSmith

1 replies

21h10m

2024-03-17 21:14:31 UTC

There is a friendly warning here from Groq: https://wow.groq.com/hey-elon-its-time-to-cease-de-grok/

bhaney

0 replies

20h32m

2024-03-17 21:52:37 UTC

Is it safe to say, 4 months later, that Elon is ignoring this? I assume there hasn't been any kind of response or further action taken yet.

cavisne

0 replies

21h27m

2024-03-17 20:58:12 UTC

They already have.

ilaksh

12 replies

13h47m

2024-03-18 04:37:37 UTC

Has anyone outside of x.ai actually done inference with this model yet? And if so, have they provided details of the hardware? What type of AWS instance or whatever?

I think you can rent like an 8 x A100 or 8 x H100 and it's "affordable" to play around with for at least a few minutes. But you would need to know exactly how to set up the GPU cluster.

Because I doubt it's as simple as just 'python run.py' to get it going.

a_wild_dandan

6 replies

12h18m

2024-03-18 06:06:34 UTC

Someone could run Grok-1 on a 192GB M2 Mac when a 4-bit quant is released; I'm guessing that TheBloke is already working on it.

hanselot

4 replies

11h1m

2024-03-18 07:24:06 UTC

TheBloke dissapeared near the day https://nvd.nist.gov/vuln/detail/CVE-2024-23496 was published.

Of course there has been much speculation on this, I have no more information than this that can be backed up by facts, but the timing was suspicious.

pixelesque

1 replies

9h20m

2024-03-18 09:04:51 UTC

He's started a company in the UK: https://suite.endole.co.uk/insight/company/15361921-thebloke...

Interestingly registered just around the corner from where one of my relatives used to live.

moffkalast

0 replies

9h15m

2024-03-18 09:09:29 UTC

And his grant funding supposedly ran out.

oezi

0 replies

10h36m

2024-03-18 07:48:59 UTC

Was any .gguf file hosted on HuggingFace found to be crafted in a way to exploit this?

d-z-m

0 replies

5h41m

2024-03-18 12:43:26 UTC

what exactly are you implying here?

mohu

0 replies

12h10m

2024-03-18 06:14:39 UTC

Fairly sure the bloke hasn't created any new quants in a month.

zone411

4 replies

12h30m

2024-03-18 05:54:44 UTC

If you're just looking to test it out, it's probably easiest to wait for llama.cpp to add support (https://github.com/ggerganov/llama.cpp/issues/6120), and then you can run it slowly if you have enough RAM, or wait for one of the inference API providers like together.ai to add it. I'd like to add it to my NYT Connections benchmarks, and that's my plan (though it will require changing the prompt since it's a base model, not a chat/instruct model).

v9v

1 replies

2h47m

2024-03-18 15:37:26 UTC

The NYT Connections benchmark sounds interesting, are the results available online?

zone411

0 replies

1h51m

2024-03-18 16:34:12 UTC

GPT-4 Turbo: 31.0

Claude 3 Opus: 27.3

Mistral Large: 17.7

Mistral Medium: 15.3

Gemini Pro 1.0: 14.2

Qwen 1.5 72B Chat: 10.7

Claude 3 Sonnet: 7.6

GPT-3.5 Turbo: 4.2

Mixtral 8x7B Instruct: 4.2

Llama 2 70B Chat: 3.5

Nous Hermes 2 Yi 34B: 1.5

The interesting part is the large improvement from medium to large models. Existing over-optimized benchmarks don't show this.

- Max is 100. 267 puzzles, 3 prompts for each, uppercase and lowercase

- Partial credit is given if the puzzle is not fully solved

- There is only one attempt allowed per puzzle, 0-shot.

- Humans get 4 attempts and a hint when they are one step away from solving a group

I hoped to get the results of Gemini Advanced, Gemini Pro 1.5, and Grok and do a few-shot version before posting it on GitHub.

logicchains

1 replies

11h36m

2024-03-18 06:48:54 UTC

it's probably easiest

Cheapest maybe, but easiest is just to rent a p4de.24xlarge from AWS for a couple hours to test (at around $40/hour..).

zone411

0 replies

11h12m

2024-03-18 07:13:09 UTC

I'd expect more configuration issues in getting it to run on them than from a tested llama.cpp version, since this doesn't seem like a polished release. But maybe.

machiaweliczny

11 replies

22h0m

2024-03-17 20:25:08 UTC

If they are so behind they could make it open source instead of open weights and get some help.

nicce

5 replies

21h53m

2024-03-17 20:31:39 UTC

Fully open-source means also providing open access to their data sets? Which is the only valuable thing Twitter (X) has left.

EastSmith

3 replies

21h17m

2024-03-17 21:07:55 UTC

Which is the only valuable thing Twitter (X) has left. reply

They have a very valuable user base (all kinds of world leaders for example), so the data is not the only valuable thing they have.

sroussey

1 replies

20h53m

2024-03-17 21:31:27 UTC

That’s actually more valuable. Twitters data of small format text is awful for training. Best to just exclude it.

There are hundreds of millions of people on Twitter, and a few of them are very smart. I don’t see how that helps here though.

Takennickname

0 replies

20h20m

2024-03-17 22:05:04 UTC

It doesn't help here. But the person your responding to is just pushing back against the "Elon destroyed Twitter and there's nothing left" narrative.

nicce

0 replies

16h36m

2024-03-18 01:48:31 UTC

I don’t see difference here.

Userbase and their social networks and interactions is the data.

They don’t have much value from advertising point of view anymore.

heyoni

0 replies

21h36m

2024-03-17 20:48:37 UTC

And the one thing they are vehemently protecting from scrapers and other entities. Even nitter threw in the towel.

xcv123

4 replies

20h33m

2024-03-17 21:51:58 UTC

It's all open source. You can download the model and run it locally.

paraboul

3 replies

20h4m

2024-03-17 22:20:48 UTC

Being free to use doesn't mean it ships with the original recipe.

xcv123

2 replies

19h29m

2024-03-17 22:56:17 UTC

What do you mean? The entire model and architecture and executables are fully open source.

The training methods are nothing secret, right? The architecture is well known.

Expecting the entire training dataset to be fully open is delusional.

DaSHacka

1 replies

18h5m

2024-03-18 00:19:58 UTC

Expecting the entire training dataset to be fully open is delusional.

Right, because its not like the training dataset was built off comments posted by all of us in the first place.

How ungrateful we are, to demand the ability to access what was unconsentually built off our hard work in the first place.

xcv123

0 replies

18h4m

2024-03-18 00:21:24 UTC

https://help.twitter.com/en/using-x/about-grok

"How was Grok trained?

Like most LLM's today, Grok-1 was pre-trained by xAI on a variety of text data from publicly available sources from the Internet up to Q3 2023 and data sets reviewed and curated by AI Tutors who are human reviewers. Grok-1 has not been pre-trained on X data (including public X posts)"

sashank_1509

7 replies

16h45m

2024-03-18 01:39:30 UTC

In all the debate about open source I don’t think people realize, this model is most likely not reproducible ever again even given the code. Here’s what you need to reproduce the model:

1. An exact snapshot of the data used, many companies don’t have this, you have rough dataset versions but remember if even 1 token is different, the model produced won’t be the same.

2. Data must be sent to the training algorithm in the exact same order as it was originally. So every data loader needs to be with a fixed random seed.

3. All the probabilistic parts of your model needs to have a fixed random seed. Here I’m thinking of stuff like dropout and for autoregressive models you might be sampling your previous output, you have to ensure they are properly seeded. Generally you do see fixed seeds in academic papers but it’s easy to miss stuff especially in distributed training jobs.

4. Here’s another interesting thing, you start your training job on 1000 GPUs and then suddenly 4 GPUs fail. What do you do? There might be deterministic ways to solve this but the standard approach is to discard all updates that that GPU was going to do and restart that GPU from scratch. You can see why this is a problem? Now if you want to reproduce this training you need to disable those GPU at the same time in the new training job to make this work.

I suspect there are even more things I didn’t think of that will make this model unique and irreproducible by training for eternity, almost like a human brain?

In fact the notion of exact reproducibility in the world of LLMs is silly, there is only approximate reproducibility, (models with similar scores in benchmarks) but nothing exact. That said I can see the value of releasing source code but I’m completely fine with grok not releasing it. Source code can reveal tricks that have not been published in papers yet that a company discovered to improve their model. Seeing the performance of Grok, I’m pretty confident there isn’t any great tricks to be found in their code so I don’t really care, I would be pretty curious about OpenAI’s or Anthropic’s source code though.

Grimblewald

6 replies

16h2m

2024-03-18 02:22:40 UTC

Which is why I don't buy into the LLMs don't have personal opinions schtick. Each LLM by virtue of the factors you've mentioned will have its own unique 'perspective', if you will, on a variety of topics. I think it's more correct to say everything a LLM says is it's personal opinion rather than it being some objective truth or something.

skissane

5 replies

15h45m

2024-03-18 02:39:43 UTC

Which is why I don't buy into the LLMs don't have personal opinions schtick

I hate how LLMs have been deliberately trained to be incoherent on this topic.

Obviously they do have beliefs/opinions/desires/etc in the sense of emulating (even if incompletely) the externally visible aspects of those phenomena as they exist in humans.

Whether they have the “internal” aspects of those phenomena depends on highly controversial issues in the philosophy of mind, and also various factual gaps in our knowledge of how the brain actually works (if we don’t fully understand how humans do X, how can we really say how close or far what LLMs do is to it?)

But LLMs are trained to repeat these spiels about how “as an LLM I don’t have personal opinions”, etc - which is obviously false under the “external” reading, and assuming more than we actually know under the “internal” one. I wish their developers didn’t do stuff like this

hnfong

4 replies

14h2m

2024-03-18 04:23:08 UTC

One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop, so they don't really "see" themselves in the way that we can inspect our own thoughts and actions and the consequences of such.

logicchains

2 replies

12h32m

2024-03-18 05:52:39 UTC

They do if they're trained on their own conversations, or if they can access the internet and read snippets of their conversations that people have posted online (as happened with Sydney before she was lobotomised).

skissane

1 replies

7h16m

2024-03-18 11:09:23 UTC

Put the conversation history in a vector database and then allow the LLM to query it using function calling. Suddenly the LLM has access to its entire conversation history (either just with this user-or even cross-user, if you ignore the potential privacy issues in that). Now it has a long-term memory which exceeds the length of its context window.

It would be interesting to experiment with continual fine-tuning: given PROMPT+FUNCTION_CALL=>RESPONSE, fine-tune the LLM to produce RESPONSE directly given PROMPT without the FUNCTION_CALL. In theory, the knowledge provided by the function calls would gradually be absorbed into the LLM weights. Maybe problems like catastrophic forgetting would put a spanner in this idea, but maybe also there are solutions to those problems (whether already known or waiting to be discovered).

Grimblewald

0 replies

4h26m

2024-03-18 13:58:37 UTC

this is what I do, not just that, but when I sleep, i let my server 'sleep' as well, where the LLM 'dreams' (trianing / updating a sliding LoRA) to consolidate information that popped up a lot throughout that day. What this involves is looking for the top n documents / articles / content that match the kind of stuff we've talked about. This means it adapts and specializes to domains we happen to be working in at that point in time.

This means while we might both struggle a little with a task on day 1, day two we're both much better at it. Better yet, because the LLM can fetch articles and papers itself, we track what we're accessing the most, indirectly measuring what skills we're weak in, we can always generate a highly relevant corpus to try capture the required capabilities.

I know the LoRA is overkill from an information / skills only point of view, but it also flavors the personality / kind of stuff it likes chatting about a bit from day to day, and I just think that's neat.

skissane

0 replies

4h52m

2024-03-18 13:32:38 UTC

One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop

Compelling counter-argument: due to neurological injury, some humans lose their ability to form new long-term memories (anterograde amnesia). Just like current LLMs, they lack a “feedback loop”. But, it is a mistake to say that just because such a person has lost the ability to change their personal beliefs, they therefore don’t have any. And, rather like such humans, LLMs used to have that ability but they lose it-when they are switched from training mode to inference mode

atleastoptimal

6 replies

16h55m

2024-03-18 01:30:04 UTC

I think everyone should realize the following realities of the LLM market

1. For sub-SOTA LLM's, distribution/marketing is more important than having a proprietary lock on capabilities. Open sourcing is a benefit for the firm, distincct from goodwill

2. For SOTA LLM's, keeping it closed and proprietary is the strategic play

If grok were SOTA Elon never would have open sourced it. It's not even SOTA within XAI. This is a marketing play to win public sentiment against OpenAI.

keepamovin

4 replies

16h31m

2024-03-18 01:54:00 UTC

I recall Elon saying something like this in an interview so I think it’s less of a deceptive take then perhaps your comment suggest.

I think he said something like proprietary AI tech is going to be one year to 18 months ahead of where open source tech is which will follow on like one year to 18 months later.

Suggesting that he’s aware of this dynamic and he’s not trying to conceal or misrepresent that.

In other words, perhaps this was SOTA one year to two years ago?

atleastoptimal

3 replies

15h54m

2024-03-18 02:30:58 UTC

Which is correct. The point I'm going for is not against Elon but against his obedient fans and knee-jerk OpenAI haters who claim that they should, by natural obligation, do the "right thing" and open source all their models, and Elon open sourcing grok is him "leading by example" and being the hero that OpenAI can't.

keepamovin

2 replies

15h39m

2024-03-18 02:46:13 UTC

Interesting. That point didn't come across in your original comment. I recommend you state it next time at the end. Often times stuff that seems obvious to us / yourself / people who know about something -- can go unstated in stuff you say that otherwise references specific points at hand -- and omits these general, but enlightening/useful perspectives/priors, which it would be good to share.

This is not only for you specifically just a general reminder for all of us including me.

atleastoptimal

1 replies

14h57m

2024-03-18 03:28:25 UTC

I think that's true though my original comment I feel was sufficient in its claim and implicit assumptions.

Basically I feel people's feelings about Elon vary a lot but are anchored by 3 general categories.

1. Elon Musk is a messianic savior who is perfectly selfless and always does the right thing. Every business decision he makes is for the maximal good of humanity

2. Elon Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken

3. Elon Musk is an irredeemable evil who always does objectively wrong things

My first comment was implicitly addressed to people in the 1 camp trying to bring them into the 2 camp (which is where I am).

keepamovin

0 replies

13h55m

2024-03-18 04:29:43 UTC

Alright, it just didn't come across for me, haha! :) I guess sometimes those implicit assumptions really are too implicit! I think it's good to err on the side of expressing them, because you can't assume someone else thinks the same way you do. That's what I've learned anyway. Hahahaha! :)

Reading your comment again with your explanation it is clear that's what you're doing.

Although, regarding your desires to present a balanced view and to persuade, I have an idea. It probably sounds like I have no idea what I'm talking about, but I think your OG comment would perhaps benefit from sounding a little bit more friendly toward Elon (not to the messianic savior level haha), but the way it sounds to me is Elon is being deceptive here and presenting it as goodwill when it's not.

However, I think the truth is there's a little bit of both, right? There's good will but it's also strategic. I get if you don't think so, tho, no worries! Haha! :)

Your OG comment sounds to me like Elon's just Machiavellian, and I get where you're coming from to remind the people who think he's a savior, but if you're point is not to go "against Elon" as you said, it might be good to acknowledge the good that he does.

At least, that way -- whether or not you believe that acknowledgment -- if you hope to bring over people who think that way, you'll probably need to appeal to how they think, rather than just dose them with the truth you see, because then they'll shut it out, if there's nothing they can relate to.

Although, if I haven't convinced you even a bit here, then maybe you shouldn't listen to me about persuasion because I guess I don't know how to do this myself. At least not effectively, or here with you. Haha!:) But if you do feel a little bit convinced then maybe consider it for next time to help your persuading people back to a more balanced view? :)

But then, there's the question of if such a thing is even possible. If people have an particular view, it could be challenging to change it, as confirmation bias means you'll ignore evidence even when it expands your worldview.

Hahaha! :) This was a funny conversation. I think we somehow skirted around the important point tho that OpenAI could in fact open source some of its older models, could it not? Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken, but there might also be a bit of truth to what the fanboys say about OpenAI in that it seems they do have some room to "open source" their non-SOTA stuff, or what am I missing?

mlindner

0 replies

16h5m

2024-03-18 02:19:42 UTC

If it's better than any other open source LLM does that even matter? (I say "if" because I don't know.)

arduanika

6 replies

22h7m

2024-03-17 20:17:33 UTC

CODE_OF_CONDUCT.md has only five words. :)

marginalia_nu

2 replies

22h3m

2024-03-17 20:21:32 UTC

My favorite is SQLite's code of ~~conduct~~ ethics: https://sqlite.org/codeofethics.html

TwentyPosts

1 replies

21h55m

2024-03-17 20:29:44 UTC

Huh. What's the backstory here?

weberer

0 replies

21h35m

2024-03-17 20:49:56 UTC

https://pjmedia.com/paula-bolyard/2018/10/24/tech-community-...

schappim

0 replies

22h6m

2024-03-17 20:19:04 UTC

"Be excellent to each other."

josh-sematic

0 replies

22h6m

2024-03-17 20:19:19 UTC

They’re from “Bill and Ted’s Excellent Adventure”

bheadmaster

0 replies

22h4m

2024-03-17 20:21:13 UTC

I was hoping it would be "do not be an asshole", but I guess this is fine too.

moralestapia

5 replies

22h16m

2024-03-17 20:09:04 UTC

Well, he delivered.

paxys

4 replies

21h32m

2024-03-17 20:52:31 UTC

Partially. Open weights is not open source.

gfodor

2 replies

20h38m

2024-03-17 21:46:54 UTC

In machine learning models the term open source has been largely accepted to mean sharing weights and, if necessary, inference code. You can argue if this is an abuse of the term but everyone does it, and saying someone didn’t deliver if they used it and published weights would probably mean saying the same about mistral, meta, etc.

asadotzler

1 replies

19h2m

2024-03-17 23:22:51 UTC

Yes. So say the same thing about them Open source has a definition and abusing that hurts all of us except the billionaires.

moralestapia

0 replies

18h3m

2024-03-18 00:21:36 UTC

I get the "open source" argument, but what is the issue here?

If you are able to reproduce the thing in its entirety and you're given no restrictions on its use, it seems compatible with the spirit of open sourcing things.

xcv123

0 replies

16h36m

2024-03-18 01:48:28 UTC

The architecture of the model is open source. Not just the weights. You can run the entire thing locally.

redskyluan

4 replies

20h54m

2024-03-17 21:30:47 UTC

This seems not be a repo ready to open source. You only get weights, very less information about how the weights is trained and finetuned.

But anyway, it always great to see more LLM weigts available.

rezonant

2 replies

20h28m

2024-03-17 21:57:06 UTC

Well what constitutes an "open source" model is still controversial and debatable-- lots of people on both sides of that argument.

asadotzler

1 replies

19h0m

2024-03-17 23:25:07 UTC

Open source has had a useful agreed upon meaning for over 25 years. Maybe you're too young to understand why that matters but we're not.

rezonant

0 replies

17h32m

2024-03-18 00:53:18 UTC

I've been in the open source community for about 25 years so I doubt it.

For what it's worth I would say a model should be fully reproducible to be open source, but that's not a decided consensus -- and AI models are sufficiently different than the source code / binary code distinction as to invoke discussion around defining it.

andrewstuart2

0 replies

20h32m

2024-03-17 21:52:41 UTC

I would argue that there's no bar for open sourcing aside from "do you have the rights to do so." Some source or some public good is certainly better than none, and when the bar is low then you remove barriers to getting started, vs waiting until you have the time someday to "do it right."

gardenhedge

4 replies

22h31m

2024-03-17 19:54:18 UTC

Due to the large size of the model (314B parameters), a machine with enough GPU memory is required to test the model with the example code

What type of machine do you need to play around with this?

317070

1 replies

22h22m

2024-03-17 20:02:49 UTC

Probably a machine with about 628 GB of GPU memory. (2 bytes per parameter)

So 8xH100 (80Gb each) should do it.

Marlinski

0 replies

3h32m

2024-03-18 14:52:35 UTC

I suppose it can be quantizised

anigbrowl

0 replies

22h27m

2024-03-17 19:57:44 UTC

'Chunky beast, needs 320 Gb VRAM likely 4 bit, likely is being run 8 bit on 8 x 80 Gb GPUs.'

-Emad

a_wild_dandan

0 replies

12h3m

2024-03-18 06:21:58 UTC

A single 192GB M2 Mac using a 4-bit quant would work.

orsenthil

3 replies

21h34m

2024-03-17 20:50:26 UTC

I am not sure what open source models are accomplishing another than killing the lead from the competition (openai), only to give it to someone else who has expertise in the area of distribution. This will be yet another good addition to systems like Amazon BedRock.

nateglims

0 replies

19h10m

2024-03-17 23:14:53 UTC

I haven't seen anything about the larger architecture, but I think the value of grok is going to come from it's cheap access to twitter data for RAG etc.

minimaxir

0 replies

21h21m

2024-03-17 21:04:21 UTC

Many of the recent innovations in both LLM architecture and inference were only made possible through open models such as Llama 2 and Mistral 7B as a starting point for iteration and refinement, which in turn backpropagates (heh) back to the LLMs developers.

It's a win-win for everyone. That's the power of open source.

geor9e

0 replies

21h8m

2024-03-17 21:17:07 UTC

Well, look at the history. Google had an insurmountable lead, so Elon started OpenAI. Now OpenAI has an insurmountable lead too. So everyone else is starting in third place, or lower. David versus two Goliaths. If you try to become a third Goliath, you'll probably just get smashed. You're later to the game. In this situation, going scorched earth becomes a viable strategy. Slay the Goliaths. Become a hero to the masses. Attract the world's best talent who don't want to be associated with proprietary models. At that point you have a world class AI business with momentum towards AGI. And even if you're giving away last year's technology for free, the team you built is churning out new ideas that could be a financial bonanza one day. Shareholders are willing to pay for a long-term bet if the story is good.

captcanuk

3 replies

21h9m

2024-03-17 21:15:56 UTC

"The implementation of the MoE layer in this repository is not efficient. The implementation was chosen to avoid the need for custom kernels to validate the correctness of the model."

Or perhaps release your actual code AND the simplified implementation instead of hiding it and saying "you don't know her, she goes to a different high school"

gfodor

2 replies

20h44m

2024-03-17 21:41:14 UTC

Always love it when someone gives away a gift and it’s not enough for people.

captcanuk

1 replies

16h1m

2024-03-18 02:23:38 UTC

Not just someone but the CEO of the company. He used HIS platform to say "This week, @xAI will open source Grok" (https://twitter.com/elonmusk/status/1767108624038449405) and they aren't doing that. What they delivered specifically says "We are releasing the base model weights and network architecture of Grok-1, our large language model."

gordian-mind

0 replies

3h41m

2024-03-18 14:44:00 UTC

Sounds like they did what they said they would.

bbor

3 replies

22h28m

2024-03-17 19:56:45 UTC

Honestly the most interesting part is taking a peek at the kind of AI researcher working for Twitter after the objectively messy layoffs and subsequent crunch. I notice neither of them has Twitter mentioned on their GitHub, which is prolly for the best to avoid harassment lol.

Code wise, excited to see if this could grow into anything! I think it’s pretty clear that Grok didn’t have nearly enough investment to be a top model so Elon “sacrificed” it on a whim in his schoolyard spat with OpenAI, but I’m not complaining. I’ve always took Elon on his word that he truly is worried about centralization of AI, and I don’t think any of the emails released by his schoolmate Altman dissuade me of that. So I have some reasonable hope that he uses some of his immense resources to start “fighting the good fight” here with Le Cun

paxys

1 replies

21h31m

2024-03-17 20:53:55 UTC

Neither of them works at Twitter. xAI is a separate company, and only uses Twitter’s data to train.

bbor

0 replies

13h50m

2024-03-18 04:35:19 UTC

Thanks for the correction! I know, I just don’t believe in corporations so the distinction is slight

cma

0 replies

21h47m

2024-03-17 20:38:14 UTC

taking a peek at the kind of AI researcher working for Twitter

He made a separate company for this.

modeless

2 replies

20h52m

2024-03-17 21:33:15 UTC

Is this the first major model to be natively FP8? I was wondering why people hadn't done it yet. Seems like a big win when hardware supports it.

a_wild_dandan

1 replies

12h11m

2024-03-18 06:13:28 UTC

No, e.g. Yi-34B.

modeless

0 replies

12h8m

2024-03-18 06:16:46 UTC

As far as I can tell Yi-34B is natively 16 bit float, the 8 bit version is quantized. https://huggingface.co/01-ai/Yi-34B#quantization

LZ_Khan

2 replies

22h8m

2024-03-17 20:17:00 UTC

How are people's experience with this model? Having the most weights is one thing but being a better model than the 70B models is another.

swalsh

0 replies

21h49m

2024-03-17 20:35:28 UTC

I use grok all the time to find tweets or ask about trends on Twitter. For that it's better than what used to exist. But its not a great model outside that narrow use case.

labrador

0 replies

22h2m

2024-03-17 20:23:14 UTC

tbh, I've never seen anyone share anything interesting produced by Grok. I see plenty of posts on X and reddit of people sharing amazing things that GPT-4 and now Claude 3 Opus can do. Grok can roast people. That's pretty much all I've seen.

I'd love to proven wrong if someone cares to share something interesting produced by Grok.

sqreept

1 replies

18h7m

2024-03-18 00:17:47 UTC

What are the languages supported by it?

cyanydeez

0 replies

17h57m

2024-03-18 00:27:50 UTC

Tweets.

shantnutiwari

1 replies

1h7m

2024-03-18 17:18:24 UTC

Those of us who dont spend all our time in LLMs-- whats this about? Whats the big deal and why is it on the front page at #1?

kayge

0 replies

50m

2024-03-18 17:34:57 UTC

I think this paragraph from an earlier Wired article [1] sums it up pretty well:

  "After suing OpenAI this month, alleging the company has become too closed, Elon Musk says he will release his “truth-seeking” answer to ChatGPT, the chatbot Grok, for anyone to download and use."

[1] https://www.wired.com/story/elon-musk-no-choice-open-chatbot...

greenpizza13

1 replies

2h6m

2024-03-18 16:19:06 UTC

If we just stop looking at Elon, he will lose his power. Why oh why do we keep giving him attention? There are plenty of great models out there that _aren't_ backed by maniacs.

rafaelero

0 replies

2h4m

2024-03-18 16:20:43 UTC

When those great role models are able to build a profitable spaceship company from the ground up I am sure we will pay attention to them.

cl3misch

1 replies

8h44m

2024-03-18 09:40:39 UTC

Love the minimal repo, magnet link, and stating "open weights" instead of "open source". Refreshing!

TheDudeMan

0 replies

2h30m

2024-03-18 15:54:30 UTC

Elon says open source:

https://twitter.com/elonmusk/status/1767108624038449405?s=46...

simonw

0 replies

22h12m

2024-03-17 20:12:32 UTC

Is there a model card anywhere? I'd like to know what it was trained on.

simonw

0 replies

21h59m

2024-03-17 20:26:09 UTC

"Base model trained on a large amount of text data, not fine-tuned for any particular task."

Presumably the version they've been previewing on Twitter is an instruction-tuned model which behaves quite differently from these raw weights.

nasir

0 replies

12h6m

2024-03-18 06:19:08 UTC

I'd be very curious to see how it performs especially on inputs that's blocked by other models. Seems like Grok will differentiate itself from other OS models from a cencorship and alignment perspective.

mvkel

0 replies

16h40m

2024-03-18 01:44:39 UTC

This feels like a "now we can say we're open" PR play rather than contributing much value to the open source community.

What is the practical use of this repo?

joydeep314

0 replies

14h33m

2024-03-18 03:52:04 UTC

Model weights on huggingface: https://huggingface.co/xai-org/grok-1

aussieguy1234

0 replies

14h12m

2024-03-18 04:13:09 UTC

How hard would it be for an open source group to fine tune this into a chatbot?

andre-z

0 replies

21h11m

2024-03-17 21:14:12 UTC

The only other Repository is a fork of Qdrant.

2devnull

0 replies

22h19m

2024-03-17 20:06:25 UTC

From issues: “Well the magnet file contains a 300GB checkpoint “

That’s why they are using a torrent I suppose.